scraper object, objects keep their own cache/etc. import scrapelib s = scrapelib.Scraper(requests_per_minute=10,allow_cookies=True, follow_robots=True, retry_attempts=2, retry_wait_seconds=5) # use urlopen as you would use urllib2's # useable as context manager to deal with scrape errors with s.urlopen('http://example.com') as html: # result object is string-like html[0:6] == '<html>' # also has a response attribute with more info html.response.code == 200 html.response.headers['server'] == 'Apache 2.0.63' https://github.com/sunlightlabs/scrapelib
parse HTML into a lxml tree >>> doc = lxml.html.fromstring(html) # possible to use css selectors (all <a> children of <p> tags) >>> links = doc.cssselect('p > a') >>> links[0].get('href') 'http://example.com/link/0' # can also use xpath to dive deep doc.xpath('//h4[text()="TEST"]/following-sibling::ul[1]/li/a[1]') much less code compared to fragile BeautifulSoup quite a bit faster in most cases http://lxml.de/
• 10 states "ready" and 11 "experimental" ◦ MD, LA, CA, TX, WI, NJ, VT, MN, UT, NC ◦ NV, SD, WA, AK, AZ, MS, FL, VA, PA, OH, DC • Lots of data ◦ 3,300 legislators ◦ 110,000 bills ◦ 900,000 actions ◦ 91,000 votes • new states every month
• API Docs: http://openstates.sunlightlabs.com/api/ • RESTful JSON-based API ◦ state metadata ◦ bills, legislators, committees, events ▪ search by attribute or lookup by ID ◦ legislator lookup by lat+long • python-openstates http://github.com/sunlightlabs/python-openstates openstates.Legislator.search(state='ca', first_name='Mike') openstates.Bill.search('agriculture', state='vt')
the Google Group ▪ http://goo.gl/M3An ◦ Code on GitHub: ▪ http://github.com/sunlightlabs/openstates We're a great project to get started with if you want to brush up on web scraping.
◦ Sunlight Congress API - info on all members of congress ◦ TransparencyData API - campaign finance information ◦ Real Time Congress API - bills, real time updates, etc. ◦ and of course, Open States API • python libraries for all of these except RTC • same key works across all APIs • virtually no limits on usage ("be reasonable") • API users mailing list: http://goo.gl/KYlKy