— January 10, 2018
IBM’s struggle with international SEO
Large enterprises that try to serve the same content to users in different countries have long struggled to do it effectively. At IBM, we have tried just about every tactic to tell Google that a certain country and language version of a page is unique to that country. Early on, we tried Dublin Core metadata. Google ignored it, as it does most of the metadata in the <head> element of an HTML page.
Next, we tried putting the country and language code (cc-lc) in the URL. Google also ignored our country and language codes. As a result, if two pages in the same language were targeted for different countries, Google would choose to rank the one that it found most relevant to the query. In most cases for English, that was the US page because the content originated in the US, and was more frequently updated in the US. So organic search users in the UK would get US pages.
This was a problem when the pages had US-specific offers on them. It was especially a problem when it the US page had US contact modules and pricing. Our UK marketing organization had to spend a lot more money on paid search to get the results they needed, because they simply could not outrank our US pages in organic search.
More recently, we tried moving that cc-lc code to the front of the URL string, and registering it in Google Search Console. Google’s own Webmaster Guidelines suggest this tactic, especially for product pages where it is in their best interests to serve the correct page to the correct user in the correct country. For reasons that will be made clear, this worked well enough as long was the pages didn’t have duplicate content.
Of course, we combined the URL tactic with HREFLANG and canonical tags. Together, these seemed to work for page ranking. In other words, if a page was in Google’s index, and these codes were done correctly, Google would serve the right page in the right country and language in the search results.
But earlier this year, Google started aggressively deleting pages from its index that it considered duplicates. This deletion happens outside of the ranking algorithm, so the codes you use to try to tell Google a page is for a particular country don’t work
According to Google, the algorithm compares the text in the page. If two pages have a lot of duplicate content blocks, the algorithm marks them as duplicates. Google doesn’t publish its algorithms because that would give black hat SEOs a recipe for manipulating search results. But the consensus among SEOs is, if two pages have upwards of 70 percent the same words, it considers them duplicates and targets them for removal from its index. In the last year, Google has filtered out millions of IBM.com pages from its index.
When we contacted Google to ask why they filtered out so many of our pages, they cited duplicate content as the problem. They recommended that we create unique content for each language and country. That’s not really an option for a company that does business in 150 countries and dozens of languages. We typically create one set of content and localize it in all the relevant markets. In English- and Spanish-speaking markets, that means creating functional duplicates.
The clincher was a crazy aspect of Google’s index filter. When it found duplicates in the index, it left one random duplicate in the index, and that would be the one served in all markets that speak that language. For some of our most important product searches, it served the page for the Bahamas to all English-language markets, including the US. When we found this out, we knew we had to find a solution to our duplicate content problem.
The solution can be found in the official guidance on the Webmaster page linked to above:
If you have many pages that are similar, consider expanding each page or consolidating the pages into one. For instance, if you have a travel site with separate pages for two cities, but the same information on both pages, you could either merge the pages into one page about both cities or you could expand each page to contain unique content about each city. [emphasis mine]
The solution is actually fairly simple. As I said, it is not really an option for us to develop that much unique content in each market. It would take a larger army of copywriters and editors than we can afford to do that. So we will focus on combining pages into one canonical page.
Rather than publishing a different URL for each language and country combination, we will try publishing one URL per language for each marketing page. We then will use dynamic systems to load the unique aspects of those experiences to the users in particular countries.
For example, users in the US will get all currency information in dollars after the system detects their IP address location. Anything that the local market wants to promote on that page can be loaded at the same time, including local offers, events, or partners. For IBM, language first would also reduce our content footprint by a factor of 10. That is a huge benefit independent of SEO.
If you run a smaller enterprise, you could perhaps create unique content for each country and language combination, and serve them through Google’s AMP. But for an enterprise with thousands of marketers creating content every day, there is only one sustainable model: language first.