Flash Indexing

No Comments »

As a previously scheduled post on accessibility and indexability went live, a few folks pointed me to some news on searchable/indexable swfs.

A few of the articles I checked out:

  1. Google Now Crawling and Indexing Flash Content
  2. Improved Flash Indexing (Official Google Webmaster Central Blog)
  3. SWF searchability FAQ

I will admit I referred to the articles with a critical eye; google has been flirting with retrieving some amount of content from .swfs for quite awhile. Yet for the first time, I got a sense there has been real progress.

The premise is that Google and Yahoo! spiders will access the content via an enhanced Flash player. This enhanced player will give the search engine spiders the ability to navigate within the Flash experience, and access and index associated resources.

This is an exciting prospect, as until now many site designers were resigned to duplicating the content that was available from within Flash on the HTML page wrapper that housed the Flash. This followed the web development strategy of ‘progressive enhancement‘, where a non-flash-enabled site visitor (like the Googlebot) would be able to access at least the core content, and the more capabilities the visitor possessed (CSS, rich media), the more enhanced their experience. In addition to potentially increasing maintenance costs (to ensure the two versions were in sync), implementing this method is sometimes not feasible at all, depending on the complexity of the application.

I was eager to see how what I knew about Flash accessibility best practices came into play, and eagerly read through the documentation. As I did so, however, I found I had more questions than answers. In the Google Webmaster Central Blog, there is an intriguing statement:

we do not generate any anchor text for Flash buttons which target some URL, but which have no associated text.

When I first read this, I believed it meant that some links may not be followed. This makes sense from the standpoint that a button with no associated text would essentially be a hidden link, and following it may inaccurately represent the content of the site. However, the statement actually focuses on the generation of anchor text. I am not clear where this generation would take place; perhaps in a virtual buffer of all the Flash content? How does the content of the link (assuming that it DOES get followed) get associated with the overall Flash content (since there is no anchor text).

Another consideration is the use of tabindices. When coding Flash for accessibility, tabindices may be used to specify reading order. Is this something that search engine spiders will be aware of? Equally, there is a recommendation in the Google docs to “consider replacing the text within an image.. [to make] ..less informative content.. invisible to [Google]“.
This statement made me question of the sophistication of this enhanced player. For years, Google has managed to determine that items such as copyright statements are not significant content items. So why now are they unaware of this fact now that the content is coming from a .swf? The recommendation to move content from an accessible to an inaccessible form seems terribly shortsighted and irresponsible.
We are now quite sophisticated in using semantic markup for html pages to offer search engine spiders some information about the relative importance of elements.I can only assume that all text being pulled from a Flash element is given equal weighting. If this is the case, as is noted in the Adobe Developer Center documentation we will certainly need to see “best practices emerge over time for creating SWF content that is more optimized for search engine rankings”.

Another major challenge in opening applications up to search is being able to direct the searcher to the relevant section within the experience. This is also a concern with accessible PDFs. Much of the documentation recommended the use of deep-linking. However, it’s not clear to me how the spider is made aware of these deep-links. I will admit that my own exposure to deep-linking with a flash experience is limited: we did this for the People’s Choice Awards site, where querystring parameters were fed into the .swf using flashVars. While the Adobe Developer Center documentation mentions this practice (”you can create multiple HTML files that provide different variables to the SWF and start your application at the correct subsection”), I hadn’t been aware that google supported variables in their search result URLs…

There was also some mention made that external files linked to from within the .swf will be indexed, but separately. The implication is that the contents of a data file will show up in search results, separate from its presentational format (and overall context). While I assume this will be resolved in future releases, a diligent developer will likely want to ensure their “include” files are not accessed on their own. I believe my colleagues did something similar when we launched the Wal-Mart Halloween Flash/HTML Hybrid site last year. They did some great work with deep-linking and history management, and handled orphan content loading (I refer anyone interested in the specifics to Toby Miller). My concern is that based on how this functionality was announced (that developers did not need to do anything for their swfs to be indexed), there will be little motivation to ensure content is always delivered in the proper context.

Obviously, I am very interested to see if this development will enhance the experience of users of assistive technologies. Sadly, I’m not sure it will, as the major breakthrough has been made with the enhanced player. Unless Adobe also plans to work with makers of assistive technologies, I don’t know that any of these benefits will be realized. If anything, site designers may stop some of their earlier practices (textual alternatives).

I’m very interested to know if any of the accessibility properties and best practices have made it into this enhanced search — how great would it be if the use of these properties increased the weighting of content!

Like it? Share it! These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • E-mail this story to a friend!
  • Print this article!
  • Mixx
  • Google
  • TwitThis
  • BlogMemes
  • Furl
  • Ma.gnolia
  • NewsVine
  • Pownce
  • Reddit
  • Sk-rt
  • StumbleUpon
  • Technorati

what’s the deal with… findability, searchability, indexability and accessibility?

2 Comments »

As a front-end web developer, I often hear the terms “findable”, “searchable”, “indexable” and “accessible” thrown around interchangeably. For many, they mean that the content can be accessed by a non-human, be it a screen reader or a search engine spider. On some level this is true, but there are several significant differences that are must not be overlooked.

For the sake of this discussion:

  • Findable: how easily a site can be found when using a search engine (rankings). Yes, I realize that this term also refers to how easily content can be found once the user is on the site, but I’m ignoring that aspect of it for now…
  • Searchable: how easily specific content within a site can be accessed when using a search engine (deep-linking)
  • Indexable: how easily the content of a site may be retrieved and used in search engine results
  • Accessible using AT: how easily someone using assistive technologies can use your site

(ShoeMoney.com has compiled a list of definitions for SEO from some industry experts, as well)

A site created completely in Flash or Flex may be findable thanks to the use of meta-data, but it is not indexable. With some diligent coding, information may be searchable, but this is no guarantee that it will be accessible.

(Not content with these descriptions? Have more to add? Please let me know what you think in the comments!)

As I’ve mentioned, my background is in accessibility: prior to coming to Resource, I worked on large subscription-based web applications. SEO was not a consideration at all. However, accessibility was. When I first came to Resource, I was eager to see how the two complemented and contrasted each other.

Overall, I see some overlap between the areas. However, their focus is different.

SEO is based on a page mentality - this is apparent in the search results that come up. Many common SEO techniques are applied at the page level, via adding meta tags or optimizing title tags. This is how a site that requires login, or is built using a technology like Flash or Flex, can appear in search results. A search engine can access meta information about the page, and use that to rank it. Findability relates to the notion of the discovery of the page itself.

A secondary notion is that of searchability. A web application may be found on google, but can the specific content that is being sought be retrieved? Searchability refers to the idea that site visitor can easily navigate to the specific information he’s searching for within the site, once the site itself has been discovered.

Both searchability and indexability deal with how elements of the page can be accessed, but arguably in different directions. Deeplinking into a flash movie may facilitate searchability, helping a site visitor dig into the site at a specific point. In contrast, indexability refers to the ability of a search engine spider to do a broad pull of content from the site.

Where SEO and Accessibility really start to diverge is when we move beyond the retrieval of content itself. A search engine spider is only interested in the data, so that the appropriate search result may be returned to an information seeker. In contrast, accessibility refers to the ability of a site visitor to navigate within an experience. The implications are significant: each interaction must be coded in a way such that a screen reader user can activate the change, and be notified of any changes that occur.

Another important distinction is the extent to which the site content is made available. A site may work to optimize or only make indexable certain aspects of the site. In contrast, accessibility refers to the ability of all content to be available and able to be engaged with.

Like it? Share it! These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • E-mail this story to a friend!
  • Print this article!
  • Mixx
  • Google
  • TwitThis
  • BlogMemes
  • Furl
  • Ma.gnolia
  • NewsVine
  • Pownce
  • Reddit
  • Sk-rt
  • StumbleUpon
  • Technorati

The New Image of Search

1 Comment »

Recently, Mark Scholl tweeted I need a new picture to satisfy business purposes.. While I realize now that he probably meant a photo of himself, he’s “the search guy” to me, so I first thought that he wanted an icon that represented what he did.
When we think of search, we think of that trusty magnifying glass. But is that really appropriate today? When was the last time you ’searched’ for something online and had to try really hard to find it? These days our bigger problem is weeding through the huge results set. The problem now isn’t finding something, it’s filtering to find the best thing.
So what’s an icon for a filter? Other than a sieve, of course. The first thing that comes to mind for me is that little org-chart icon, with the boxy-thing above two other boxy-things. But does that really capture “search”?

Like it? Share it! These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • E-mail this story to a friend!
  • Print this article!
  • Mixx
  • Google
  • TwitThis
  • BlogMemes
  • Furl
  • Ma.gnolia
  • NewsVine
  • Pownce
  • Reddit
  • Sk-rt
  • StumbleUpon
  • Technorati

Artificial Intelligence: a solution for Artificial Content? (Fighting Hidden Keyword Spam)

No Comments »

I was recently doing some research for a blog post I’m writing on screen readers vs SEO for the RI:Technology Blog. A 2005 blog post by Matt Cutts from Google entitled “SEO Mistakes: Unwise Comments” solicited many concerns about the use of hidden content being considered keyword spam.

There are plenty of legitimate reasons for hiding text from a sighted user on page load, and in many cases, this is simply a stylistic effect and the content will be surfaced as a result of user interaction. It is not about the use of the technique, but rather the misuse. The official Google Webmaster Guidelines do have a page dedicated to hidden text and links, but it also lists as a basic principle:

Make pages primarily for users, not for search engines. Don’t deceive your users or present different content to search engines than you display to users, which is commonly referred to as “cloaking.”

So how do we determine if a technique is being used appropriately? There has always been the old standby technique to disable CSS. Does the page still make sense, or is it littered with content not meant for human consumption? This would solve our concerns about “instructional” help for users of assistive technologies, and the suppression of content until the user opts to display it. (Indeed, it is a progressive enhancement best practice to have the content on the page and then hide it using javascript anyway, so that it is available even if JS is turned off.) This works fine for intelligent human users, but we all know that in GoogleLand, human reviews are NOT a desired goal.

So what about artificial intelligence as an option? Ever so slowly (yet steadily), we are moving forward in the area of natural language processing. What if AI and NLP were used to assess the semantics of page contents? When I access a website, I don’t expect to see a series of keywords. A human accessing a page is looking for content, not keywords describing the content. Some intelligence could be used to identify the overall syntax of the content, to ensure it’s legitimate “Content”.

Naturally, specific page elements would have to be accounted for. A list of navigation links may look suspiciously like keywords. This is where semantic markup comes into play, in particular some of the new tags proposed for HTML5 (nav or section, for example) or roles outlined in WAI-ARIA. A series of (internal) links would be expected in the nav element, but a collection of random words not appearing in proper syntactic form elsewhere in the document would be considered suspect.

Obviously, whenever there are rules, there will be people setting out to break them. But if we are cognizant of how these black hat techniques differ from legitimate best practices, surely we can filter them out as such. It’s a shame to penalize those who are honestly working to enhance the user experience, not cater to search engines.

Or, as Eric Meyer stated at the Spring Break conference last week, the best google juice is having good content so that everyone want to link to you. Do it right, and the hits will come organically..

Like it? Share it! These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • E-mail this story to a friend!
  • Print this article!
  • Mixx
  • Google
  • TwitThis
  • BlogMemes
  • Furl
  • Ma.gnolia
  • NewsVine
  • Pownce
  • Reddit
  • Sk-rt
  • StumbleUpon
  • Technorati

Google Friend Connect - first (premature) thoughts

2 Comments »

My thoughts are premature, because I haven’t actually seen Google Friend Connect (GFC; can I call it GFC?) in action, I’ve only seen the few screenshots that google has released. That being said, I thought I’d respond to my impressions or understandings of the service, before seeing what it really is. That way if I’m wrong, I can claim ignorance :)

In a press release, it was stated that “Google Friend Connect is about helping the ‘long tail’ of sites become more social.” The idea was that “without requiring coding experience”, GFC (geez, I’m totally hurting my search engine ranking by not spelling that out) would provide site maintainers with a way to tap into the benefits of social networks, attracting and engaging more visitors.

As a developer, implementation is always in my mind. I’m interested in how a series of widgets or wizards magically add “social” to your site. When you’re working on a specific platform (say, facebook or myspace), you can tap into a known architecture and codebase to aid in the integration. (facebook apps, wordpress plugins). When you’re not, well, is it really an integrated solution?

Pluck already does a good job at offering blogs, forums and other social goodies to sites, either via javascript or an API. People have long been able to add polls and forums to their sites via services like bravenet or dreambook (remember when it was all the rage to have a guestbook? Now THAT was engagement!) The functionality may be the basically the same, so what’s the big draw?

It’s the data. Isn’t everything about the data these days? Pluck or any other third-party hosted widget has the data living… somewhere. To a user, it may seem like that blog post or poll is on your site, but if it’s being pulled in using javascript, the good ole Google crawler isn’t going to associate it with your site.

Hm… the google crawler… may not index all the information associated with your site (blog postings, reviews, comments) if it’s hosted by a third party social site, if it gets pulled in dynamically.. but what if google DID? What if google provided the hooks into the social stuff? I will definitely be interested to see if they’ve figured this piece of it out. They wrote the rules, so it will be interesting to see if they get re-written.

Update: an article on ReadWriteWeb states that the social magic will be added in via iframe.. so much for my high hopes of making the social in your site actually seem like your site. I thought we’d all communally agreed back about 5 years ago that iframes were evil? :(

The other consideration about data is related to personal data. Right now a site implementing third-party software retains control of the data. A site integrating a third party product may or may not have the same control. It appears that one limitation with MySpace’s Data Availability initiative is that MySpace retains the control over the data is makes available. If a site implements GFC, can the user hook into one or more existing social networks, and how are any actions taking place on the host site being tracked? I think of Disqus, which centralizes blog comments. When I respond (after having authenticated) to a blog posting where the author has set up disqus, my comments are stored as part of my disqus profile. Disqus purports to “makes your comments more interactive for readers and easier to manage for you — all while connecting your community with other blogs.” - but it does this largely on its own domain. Will google.com/friendconnect serve as a landing pad for user behaviours online? Currently it appears that that is what is the distinguishing feature between Google’s *connect feature, as opposed to the recent offerings by Facebook and MySpace.
Another consideration with the lack of an existing primary platform is how conflicting information will be resolved. I will admit that I don’t yet have a clear understanding how GFC will tie into the authentication of other sites, if a user will be able to select one platform with which to associate (and from which his friends and preferences will migrate), or if he will be able to pick and choose. Just two days ago there was an article in ReadWriteWeb stating that filtering is the next step for social media. We are at a breaking point with too much duplicating information out there, and now we need to do through the tedious work to de-dupe and validate. I don’t have a clear sense what the GFC strategy is for assigning friends to groups with varying levels of privileges, and how referential integrity across platforms may be ensured (if Melissa de-friends iKeif on facebook, what happens to his access to her data on my site?)

I will be very interested to see how this all plays out.. I’ll be eager to read the full reports from the few whitelisted sites that will be trying things out.

Like it? Share it! These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • E-mail this story to a friend!
  • Print this article!
  • Mixx
  • Google
  • TwitThis
  • BlogMemes
  • Furl
  • Ma.gnolia
  • NewsVine
  • Pownce
  • Reddit
  • Sk-rt
  • StumbleUpon
  • Technorati

headers and images - alt text and the weight factor

6 Comments »

I am drafting an article for the RI:Technology blog on Screen readers and Search Engines, and was reviewing a paper a colleague wrote about Search. He mentioned sIFR as a technique “to bring content to search engines”. I asked another colleague about this, as I’d always just considered sIFR as a “stylability” technique.
We started talking about the weight factor of search engines, whether content written to a page and then sIFRized would be weighted more heavily than the alt text of an image. I hadn’t really thought about that before. I then mentioned a habit I have of placing images within a heading tag, i.e.
<h1><img alt="descriptive text" /></h1>. Toby asked if this really worked, if the alt text would be considered the header. I realized that I’d never really verified it before.

So I took a quick look at an example using FANGS, and learned that alternate text, really isn’t. Turning off images would cause the alt text to display as the appropriate heading level, but at least for FANGS, the alt text does not get surfaced as the heading (the text should show up before the colon in the screen shot below)

FANGS output with no text associated with the header

Now, obviously FANGS is an emulator, and it’s possible that a screen reader would access that alt text. But this gave me something to consider..

Like it? Share it! These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • E-mail this story to a friend!
  • Print this article!
  • Mixx
  • Google
  • TwitThis
  • BlogMemes
  • Furl
  • Ma.gnolia
  • NewsVine
  • Pownce
  • Reddit
  • Sk-rt
  • StumbleUpon
  • Technorati

RIA - Rich Internet Accessibility?

3 Comments »

It is interesting to be getting into the RIA realm. I see the benefits on several levels (’stickiness’ from a business perspective, general usability), but I am also very aware of the challenges.

I have a background in accessibility, so that is a large concern for me even as I want to turn to using these new technologies. I engaged a co-worker who works in the search arena, to ask him about ajax/flex/flash. His first response was the same as the accessibility response, that they’re less than ideal. What struck me, however, was the difference in how to deal with the issues.

One thing I need to remember is that there is SOME difference between the two. We were looking at rapidly changing content — that content doesn’t need to be ‘understood’ by a search engine. However, it does need to be available to users of assistive technologies. From a search perspective, a hybrid site in which the ‘main’ content is text, and therefore indexable, is ok. It’s not from an accessibility standpoint.

Avenue-A Razorfish just published an article on SOFA — (search optimized flash architecture. The idea is that the content is written to the page in XHTML and then presented via flash. The concern is that this could be considered to be cloaking. However, if the content is the same, simply presented in a different format, this should be immaterial.

I am really enjoying my work. I’m learning plenty, and am also finding there are plenty of opportunities to share my thoughts and opinions and explore other areas as well. I don’t know if it is partially due to the small size of the company and the tremendous growth they’re undergoing that processes are still fluid and there is opportunity to make an impact, but it is definitely a great environment for me!

In other news, if anyone is an information architect or a flex developer looking for a new job with a fantastic company, drop me a line… !

Like it? Share it! These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us