Had a few interesting eye openers over the past few days on how to be friendly to spiders...and why you would want this.
Spiders, of course, refers to search engine 'bots. Interesting that many of the same things that drive me nuts are also pretty hostile to search engines and their metrics. Functionally, Google is a blind man surfing the web. A multi-billionaire blind man with millions of friends who hang on his every word.
So, I present a few arachnophobia awards. Sites, designs, and practices that are remarkably toxic to spiders. The result is sites with a remarkably low Google profile.
First place in my book goes to [link|http://www.trilogyit.com/|TrilogyIT]. The home page is simply a flash animation. It looks nice enough when viewed...with a flash-enabled browser. For those of us who don't have, or don't enable, flash, there's simply nothing there. Google makes [link|http://www.google.com/search?q=site%3Atrilogyit.com+job|no reference] to any content at the site, and only finds three [link|http://www.google.com/search?hl=en&q=link%3Awww.trilogyit.com|links to it]. Unsurprisingly, I had zero results from the company in my recent job search, despite several years working with one of the key members of the company in a prior life.
Another practice that's highly toxic to Google searches is [link|http://www.amazon.com/exec/obidos/tg/browse/-/523851/ref%3Dwg%F5rb%Fb5o/002-0718869-4578410
|session IDs]. They break two ways. One is that no two sweeps through a site are the same, the second is that mangling the session string may break the URL (as in this case with Amazon -- the page linked originally was simply the default entry page). Session IDs are among several web tricks which result in sites that can produce a virtually unlimited number of pages. Because of the risk of turning these into spider traps (the spider enters and never leaves), many spiders simply avoid such sites.
[link|http://judiciary.senate.gov/beta/|Text as images] is another classic faux pas. Google doesn't OCR (though it does [link|http://www.google.com/search?as_q=file&num=20&btnG=Google+Search&as_epq=&as_oq=&as_eq=&lr=&as_ft=i&as_filetype=pdf&as_qdr=all&as_occt=any&as_dt=i&as_sitesearch=&safe=off|handle PDFs] and other formats), so any keywords presented as graphics might as well be screen noise.
My thought is that, with all the noise about standards and accessability, it's going to be search engines ultimately which drive these issues. Accessible, standards-compliant websites will be more valuable than any chrome -- Flash, animated gifs, sound, Java, or Javascript. What might make things particularly interesting is if Google stated a clear preference for W3C conformant HTML and XHTML -- after all, it makes their spidering process easier. Might improve things around here....