Screen Scraping with Javascript, Firebug, Greasemonkey and Google Spreadsheets

Most of the web page scrapers I have seen are usually written in PERL, mostly due to the power of Regex. However, nothing could get friendlier when parsing web pages if not Javascript.
The pages to scrape are usually like this or this.
With the power of regex inherited from PERL, DOM parsing from Jquery and its likes, scraping data from web pages is a lot easier. This article outlines the various techniques that makes screen scraping using web pages easy.
Writing a web page scraper usually involves the following steps.
  1. Identification : identifying Page Elements
  2. Selection : getting data out of the selected nodes
  3. Execution : running the code in the context of the page.
  4. Submissing : saving the parsed data so that it can be used later
  5. Timeouts : to introduce delay so that the server is not overwhelmed.
Identifying nodes can get tricky, but with Firebug, its a simple point and click exercise. Firebug gives us the entire XPath to the node in question. Web Scraping usually involves picking up data from a structured document that has elements repeated in some pattern. "Inspecting" elements would reveal a pattern, usually a common class name or a hirearchy.

Once identified, we could get the code to select all the elements interactively using the Firebug console. This would usually be a combination of XPath expressions, getElementsByTagName, getElementsByClassName, etc. Once you have all the elements (as rows usually), you could possible dive deeper into each element till you extract the innerHTML or href. This is what goes into the code.

Once you have the code returning useful data (checked using Firebug console), you would need a way to run it for the page and possibly load the next page for parsing once the current page is done. The javascript code could be inserted using a bookmarklet, but that would require the user to click the bookmarklet for every page. I choose to add the small greasemonkey header and convert these scripts to greasemonkey scripts. This ensures that the we parse the current page and move the the next automatically.

Once reason why most people don't use javascript for parsing is its inability to store the parsed data. This is where Google Spreadsheets come to the rescue. Google spreadsheets lets us create forms to which we can POST data. The script would need to create a form with its action set to a url that resembles "http://spreadsheets.google.com/formResponse?formkey=yourkey". You would also have to create input elements with names resembling entry.0.single, entry.1.single and so on. You could check the actual data submitted to spreadsheets using Tamper Data. So we now have all our data in a spreadsheet, giving it sorting and filtering capabilities.

The last point is about preventing loops and switching to timeouts instead. This would ensure that you don't overload the server with requests. A good feedback mechanism, something like coloring the background of fields that are parsed successfully, would be an added bonus.

To conclude, scraping data this way may require your browser to be open all the time, but some of the benefits over the command line way (that I could think of ) are

  1. Easiest way to handle the DOM, no regex, just DOM traversing
  2. Visual indication of what data is parsed, real time
  3. Proxy and TOR configuration out of the box if your IP is blocked :)
  4. With webworkers, complex parsing could be done a lot easier
Writing all this from scratch is hard, may be you could use this template. The templates is littered with placeholders where you could insert your code.