Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Opening Government with Python

Avatar for jamesturk jamesturk
September 26, 2011

Opening Government with Python

OpenStates technical presentation

Avatar for jamesturk

jamesturk

September 26, 2011
Tweet

More Decks by jamesturk

Other Decks in Programming

Transcript

  1. Scraping the 50 States • Large volunteer effort ◦ 3500

    commits ◦ 34 contributors ◦ 20,000 LoC ▪ (Python)
  2. Start to Finish • Page Retrieval: scrapelib • Parsing: lxml.html

    • Validation: validictory • Storage: pymongo (mongoDB) • Clean up: name_tools, jellyfish • API: Django, django-piston, boundaryservice
  3. Scraping: scrapelib • retrieving HTML from pages # instantiate a

    scraper object, objects keep their own cache/etc. import scrapelib s = scrapelib.Scraper(requests_per_minute=10,allow_cookies=True, follow_robots=True, retry_attempts=2, retry_wait_seconds=5) # use urlopen as you would use urllib2's # useable as context manager to deal with scrape errors with s.urlopen('http://example.com') as html: # result object is string-like html[0:6] == '<html>' # also has a response attribute with more info html.response.code == 200 html.response.headers['server'] == 'Apache 2.0.63' https://github.com/sunlightlabs/scrapelib
  4. Parsing: lxml.html • parsing and extracting information from HTML #

    parse HTML into a lxml tree >>> doc = lxml.html.fromstring(html) # possible to use css selectors (all <a> children of <p> tags) >>> links = doc.cssselect('p > a') >>> links[0].get('href') 'http://example.com/link/0' # can also use xpath to dive deep doc.xpath('//h4[text()="TEST"]/following-sibling::ul[1]/li/a[1]') much less code compared to fragile BeautifulSoup quite a bit faster in most cases http://lxml.de/
  5. Validation: validictory • validation of python data structures ◦ based

    on http://json-schema.org spec schema = {"type": "object", "properties": { "state": {"type": "string", "minLength": 2, "maxLength": 2}, "session": {"type": "string"}, "chamber": {"type": "string", "enum": ["upper", "lower"]}, "bill_id": {"type": "string"}, "type": {"type": "array", "items": {"type": "string"}} }} bill = {'state':'nc', 'session': '2010', 'chamber': 'senate', 'bill_id': 4, 'type': ['bill']} validictory.validate(bill, schema) # would raise exception http://github.com/sunlightlabs/validictory
  6. Name Matching: name_tools # split names into (title, first, last,

    suffix) >>> name_tools.split("President Barack Hussein Obama II") ('President', 'Barack Hussein', 'Obama', 'II') >>> name_tools.split("Obama, President Barack H., II") ('President', 'Barack H', 'Obama', 'II') >>> name_tools.split("Fleet Admiral William F. Halsey, Jr., USN") ('Fleet Admiral', 'William F', 'Halsey', 'Jr., USN') # score name similarity >>> name_tools.match('Eric Schmidt', 'Eric Schmidt') 1.0 >>> name_tools.match('Bob Dole', 'Dole, Bob') 0.97999999999999998 >>> name_tools.match('Jeff Tweedy', 'J Tweedy') 0.90000000000000002 >>> name_tools.match('Ferris Bueller', 'Bueller') 0.80000000000000002 http://github.com/mikejs/name_tools
  7. Name Matching: jellyfish • string comparison ◦ Levenshtein & Damerau-Levenshtein

    distance ◦ Jaro & Jaro-Winkler distance ◦ Match Ratching Approach ◦ Hamming Distance • Phonetic encoding ◦ American soundex ◦ Metaphone ◦ NYSIIS ◦ Match Rating Codex
  8. Name Matching: jellyfish # string comparison >>> jellyfish.jaro_distance('barak', 'barack') 0.94444444

    >>> jellyfish.levenshtein_distance('barak', 'barack') 1 # soundalikes >>> jellyfish.soundex('payne') P500 >>> jellyfish.soundex('pain') P500 >>> jellyfish.metaphone('jellyfish') JLFX >>> jellyfish.metaphone('gellyfish') JLFX http://github.com/sunlightlabs/jellyfish
  9. Project Status • Public beta of our API ◦ http://openstates.sunlightlabs.com/api/

    • 10 states "ready" and 11 "experimental" ◦ MD, LA, CA, TX, WI, NJ, VT, MN, UT, NC ◦ NV, SD, WA, AK, AZ, MS, FL, VA, PA, OH, DC • Lots of data ◦ 3,300 legislators ◦ 110,000 bills ◦ 900,000 actions ◦ 91,000 votes • new states every month
  10. Use the data • Get Sunlight API Key @ http://services.sunlightlabs.com/

    • API Docs: http://openstates.sunlightlabs.com/api/ • RESTful JSON-based API ◦ state metadata ◦ bills, legislators, committees, events ▪ search by attribute or lookup by ID ◦ legislator lookup by lat+long • python-openstates http://github.com/sunlightlabs/python-openstates openstates.Legislator.search(state='ca', first_name='Mike') openstates.Bill.search('agriculture', state='vt')
  11. Adopt a State ◦ Contributor Guide ▪ http://openstates.sunlightlabs.com/contributing/ ◦ Join

    the Google Group ▪ http://goo.gl/M3An ◦ Code on GitHub: ▪ http://github.com/sunlightlabs/openstates We're a great project to get started with if you want to brush up on web scraping.
  12. Bonus Content: Other Sunlight APIs • Sunlight Labs APIs: http://services.sunlightlabs.com/

    ◦ Sunlight Congress API - info on all members of congress ◦ TransparencyData API - campaign finance information ◦ Real Time Congress API - bills, real time updates, etc. ◦ and of course, Open States API • python libraries for all of these except RTC • same key works across all APIs • virtually no limits on usage ("be reasonable") • API users mailing list: http://goo.gl/KYlKy