Crawling the Rich.....


For a web developer [specifically, someone who respects the beauty of Javascript] it get disheartening when the Search Engine Optimizers come along with reports of how JavaScript had harmed the rank of the site on major search engines. All the interactivity and usability build with engineering excellence are just ignored by these crawlers of the Web 1.0 times.
Though there have been numerous changes to the search algorithm, the crawlers have remained very static parts of the equation. Indexing has improved, more machines are thrown to improve response times, even semantic searches are now becoming mainstream. If web 2.0 is not only about HTML pages, then crawlers of search engines better adapt themselves to this. The idea of page ranks was great, but links were the focal point of the technique. Unfortunately, the web 2.0 paradigm refuses to confine HTML documents to pages with links.
It is about interactivity, about usability, and as it is not about information loading about page loading, conventional crawlers may as well miss all the information that are downloaded that are available due to user actions. Though I am not trying to propose a theoretically correct crawler model, I definitely am suggesting in the blog about a crawler that understands rich internet applications and ranks pages including information obtained by user interaction.
A crawler of this genre could mimic a screen reader, trying to perform user actions (clicks, mouse over for help, etc.) It could rate pertinence based on the depth of information, i.e. number of clicks or user actions required to get to that information. Other heuristics could be developed to assess the importance that a real user would attach to such dynamic divs and information obtained through AJAX.
The crawler could also create a data model to represent the data available at the site, and associate relevance based on the user action required to fetch the data, or availability of data. People may argue that the main reason that JavaScript is not supported is to fight search engine spam, but crawlers designed to represent such a data model of the page with user interactivity as a dimension could automatically have logic to push spam lower the rank. Hence, the problem of redirects and Javascript iFrames could be solved depending on Object visibility, that would be a parameter in the user interaction dimension.
This model is not perfect, and still has problems. Captchas may block the flow, but that is a problem with HTML crawlers as well. Privacy could be a concern, but crawlers could obey robots.txt, as they are doing now. Finally, the biggest hurdle to be solved is the concept of mapping user interactivity to ranking values. This could have certain fuzziness, but is not an impossible problem to solve. There are some low hanging fruits [e.g. one click is better than 2 clicks] that could be leveraged, to start with.
I tried investigating efforts in this field, but google did not help me much. I just found initiative of Adobe Flash that allowed crawling the rich internet applications and crawlers looking for javascript code, but nothing seemed satisfying. They were still talking in terms of links. I would appreciate if you let me know of any such initiative you are aware of.
To conclude, I strongly feel that it is time that JavaScript is given its due in the Search Engine World. Is this a google killer - no, this would only serve to complement the current technologies.
Search Engines should stop forcing people to play cheap tricks of putting key words in invisible <div> and <noscript> tags, and enhance credibility by looking at a page for information just like a real human would look for and rank it....