ScraperWiki
ScraperWiki is a platform for writing and scheduling screen scrapers, and for storing the data they generate.
ScraperWiki lets you write your scraper in Python, PHP or Ruby, schedule scrapes and download scraped data in CSV format, as an SQLite3 database or via the API. You can fork existing scrapers too, Github-style.
Let’s say you wanted to grab all the URLs from link posts on the front page of One Thing Well and store them. Here’s how, in Python:
import scraperwiki
import lxml.html
root = scraperwiki.scrape('http://onethingwell.org/')
content = lxml.html.etree.HTML(root)
linkage = content.xpath("/html/body/div/article/h2/a/@href")
for links in linkage:
record = { "link" : links }
scraperwiki.datastore.save(["link"], record)
You can see the results on ScraperWiki.
And here’s another scraper—a bit more complex at 26 lines of code—which grabs the URLs from every link post on OTW by following the ‘Older’ link at the bottom of the page and parsing the next page until it reaches the last one.
I only know the very, very basics of Python, but the above scrapers were really easy to cobble together by following nicking code from the ScraperWiki tutorials and a bit of trial and error.
Have a go, and let me know if you make a cool scraper.