Artificial Intelligence: a solution for Artificial Content? (Fighting Hidden Keyword Spam)

I was recently doing some research for a blog post I’m writing on screen readers vs SEO for the RI:Technology Blog. A 2005 blog post by Matt Cutts from Google entitled “SEO Mistakes: Unwise Comments” solicited many concerns about the use of hidden content being considered keyword spam.

There are plenty of legitimate reasons for hiding text from a sighted user on page load, and in many cases, this is simply a stylistic effect and the content will be surfaced as a result of user interaction. It is not about the use of the technique, but rather the misuse. The official Google Webmaster Guidelines do have a page dedicated to hidden text and links, but it also lists as a basic principle:

Make pages primarily for users, not for search engines. Don’t deceive your users or present different content to search engines than you display to users, which is commonly referred to as “cloaking.”

So how do we determine if a technique is being used appropriately? There has always been the old standby technique to disable CSS. Does the page still make sense, or is it littered with content not meant for human consumption? This would solve our concerns about “instructional” help for users of assistive technologies, and the suppression of content until the user opts to display it. (Indeed, it is a progressive enhancement best practice to have the content on the page and then hide it using javascript anyway, so that it is available even if JS is turned off.) This works fine for intelligent human users, but we all know that in GoogleLand, human reviews are NOT a desired goal.

So what about artificial intelligence as an option? Ever so slowly (yet steadily), we are moving forward in the area of natural language processing. What if AI and NLP were used to assess the semantics of page contents? When I access a website, I don’t expect to see a series of keywords. A human accessing a page is looking for content, not keywords describing the content. Some intelligence could be used to identify the overall syntax of the content, to ensure it’s legitimate “Content”.

Naturally, specific page elements would have to be accounted for. A list of navigation links may look suspiciously like keywords. This is where semantic markup comes into play, in particular some of the new tags proposed for HTML5 (nav or section, for example) or roles outlined in WAI-ARIA. A series of (internal) links would be expected in the nav element, but a collection of random words not appearing in proper syntactic form elsewhere in the document would be considered suspect.

Obviously, whenever there are rules, there will be people setting out to break them. But if we are cognizant of how these black hat techniques differ from legitimate best practices, surely we can filter them out as such. It’s a shame to penalize those who are honestly working to enhance the user experience, not cater to search engines.

Or, as Eric Meyer stated at the Spring Break conference last week, the best google juice is having good content so that everyone want to link to you. Do it right, and the hits will come organically..