The pages to scrape are usually like this or this.
With the power of regex inherited from PERL, DOM parsing from Jquery and its likes, scraping data from web pages is a lot easier. This article outlines the various techniques that makes screen scraping using web pages easy.
Writing a web page scraper usually involves the following steps.
- Identification : identifying Page Elements
- Selection : getting data out of the selected nodes
- Execution : running the code in the context of the page.
- Submissing : saving the parsed data so that it can be used later
- Timeouts : to introduce delay so that the server is not overwhelmed.
Once identified, we could get the code to select all the elements interactively using the Firebug console. This would usually be a combination of XPath expressions, getElementsByTagName, getElementsByClassName, etc. Once you have all the elements (as rows usually), you could possible dive deeper into each element till you extract the innerHTML or href. This is what goes into the code.
The last point is about preventing loops and switching to timeouts instead. This would ensure that you don't overload the server with requests. A good feedback mechanism, something like coloring the background of fields that are parsed successfully, would be an added bonus.
To conclude, scraping data this way may require your browser to be open all the time, but some of the benefits over the command line way (that I could think of ) are
- Easiest way to handle the DOM, no regex, just DOM traversing
- Visual indication of what data is parsed, real time
- Proxy and TOR configuration out of the box if your IP is blocked :)
- With webworkers, complex parsing could be done a lot easier