Opening Government with Python

http://openstates.sunlightlabs.com these slides: http://goo.gl/c7U94

& lots more... Federal Open Gov Ecosystem

State Open Gov Ecosystem

PyCon 2009 OpenGov Hackathon

Scraping the 50 States • Large volunteer effort ◦ 3500
commits ◦ 34 contributors ◦ 20,000 LoC ▪ (Python)

Start to Finish • Page Retrieval: scrapelib • Parsing: lxml.html
• Validation: validictory • Storage: pymongo (mongoDB) • Clean up: name_tools, jellyfish • API: Django, django-piston, boundaryservice

Scraping: scrapelib • retrieving HTML from pages # instantiate a
scraper object, objects keep their own cache/etc. import scrapelib s = scrapelib.Scraper(requests_per_minute=10,allow_cookies=True, follow_robots=True, retry_attempts=2, retry_wait_seconds=5) # use urlopen as you would use urllib2's # useable as context manager to deal with scrape errors with s.urlopen('http://example.com') as html: # result object is string-like html[0:6] == '<html>' # also has a response attribute with more info html.response.code == 200 html.response.headers['server'] == 'Apache 2.0.63' https://github.com/sunlightlabs/scrapelib

Parsing: lxml.html • parsing and extracting information from HTML #
parse HTML into a lxml tree >>> doc = lxml.html.fromstring(html) # possible to use css selectors (all <a> children of <p> tags) >>> links = doc.cssselect('p > a') >>> links[0].get('href') 'http://example.com/link/0' # can also use xpath to dive deep doc.xpath('//h4[text()="TEST"]/following-sibling::ul[1]/li/a[1]') much less code compared to fragile BeautifulSoup quite a bit faster in most cases http://lxml.de/

Validation: validictory • validation of python data structures ◦ based
on http://json-schema.org spec schema = {"type": "object", "properties": { "state": {"type": "string", "minLength": 2, "maxLength": 2}, "session": {"type": "string"}, "chamber": {"type": "string", "enum": ["upper", "lower"]}, "bill_id": {"type": "string"}, "type": {"type": "array", "items": {"type": "string"}} }} bill = {'state':'nc', 'session': '2010', 'chamber': 'senate', 'bill_id': 4, 'type': ['bill']} validictory.validate(bill, schema) # would raise exception http://github.com/sunlightlabs/validictory

Name Matching: name_tools # split names into (title, first, last,
suffix) >>> name_tools.split("President Barack Hussein Obama II") ('President', 'Barack Hussein', 'Obama', 'II') >>> name_tools.split("Obama, President Barack H., II") ('President', 'Barack H', 'Obama', 'II') >>> name_tools.split("Fleet Admiral William F. Halsey, Jr., USN") ('Fleet Admiral', 'William F', 'Halsey', 'Jr., USN') # score name similarity >>> name_tools.match('Eric Schmidt', 'Eric Schmidt') 1.0 >>> name_tools.match('Bob Dole', 'Dole, Bob') 0.97999999999999998 >>> name_tools.match('Jeff Tweedy', 'J Tweedy') 0.90000000000000002 >>> name_tools.match('Ferris Bueller', 'Bueller') 0.80000000000000002 http://github.com/mikejs/name_tools

Name Matching: jellyfish • string comparison ◦ Levenshtein & Damerau-Levenshtein
distance ◦ Jaro & Jaro-Winkler distance ◦ Match Ratching Approach ◦ Hamming Distance • Phonetic encoding ◦ American soundex ◦ Metaphone ◦ NYSIIS ◦ Match Rating Codex

Name Matching: jellyfish # string comparison >>> jellyfish.jaro_distance('barak', 'barack') 0.94444444
>>> jellyfish.levenshtein_distance('barak', 'barack') 1 # soundalikes >>> jellyfish.soundex('payne') P500 >>> jellyfish.soundex('pain') P500 >>> jellyfish.metaphone('jellyfish') JLFX >>> jellyfish.metaphone('gellyfish') JLFX http://github.com/sunlightlabs/jellyfish

Project Status • Public beta of our API ◦ http://openstates.sunlightlabs.com/api/
• 10 states "ready" and 11 "experimental" ◦ MD, LA, CA, TX, WI, NJ, VT, MN, UT, NC ◦ NV, SD, WA, AK, AZ, MS, FL, VA, PA, OH, DC • Lots of data ◦ 3,300 legislators ◦ 110,000 bills ◦ 900,000 actions ◦ 91,000 votes • new states every month

photo http://www.flickr.com/photos/gottgraphicsdesign/4543701893/ CC- BY

Use the data • Get Sunlight API Key @ http://services.sunlightlabs.com/
• API Docs: http://openstates.sunlightlabs.com/api/ • RESTful JSON-based API ◦ state metadata ◦ bills, legislators, committees, events ▪ search by attribute or lookup by ID ◦ legislator lookup by lat+long • python-openstates http://github.com/sunlightlabs/python-openstates openstates.Legislator.search(state='ca', first_name='Mike') openstates.Bill.search('agriculture', state='vt')

Adopt a State ◦ Contributor Guide ▪ http://openstates.sunlightlabs.com/contributing/ ◦ Join
the Google Group ▪ http://goo.gl/M3An ◦ Code on GitHub: ▪ http://github.com/sunlightlabs/openstates We're a great project to get started with if you want to brush up on web scraping.

Bonus Content: Other Sunlight APIs • Sunlight Labs APIs: http://services.sunlightlabs.com/
◦ Sunlight Congress API - info on all members of congress ◦ TransparencyData API - campaign finance information ◦ Real Time Congress API - bills, real time updates, etc. ◦ and of course, Open States API • python libraries for all of these except RTC • same key works across all APIs • virtually no limits on usage ("be reasonable") • API users mailing list: http://goo.gl/KYlKy

Thanks! http://openstates.sunlightlabs.com/ Additional Questions? Google Group: http://goo.gl/M3An [email protected] @jamesturk this
presentation: http://goo.gl/c7U94

Opening Government with Python

Opening Government with Python

jamesturk

More Decks by jamesturk

Other Decks in Programming

Featured

Transcript

http://openstates.sunlightlabs.com these slides: http://goo.gl/c7U94

& lots more... Federal Open Gov Ecosystem

State Open Gov Ecosystem

PyCon 2009 OpenGov Hackathon

Scraping the 50 States • Large volunteer effort ◦ 3500

Start to Finish • Page Retrieval: scrapelib • Parsing: lxml.html

Scraping: scrapelib • retrieving HTML from pages # instantiate a

Parsing: lxml.html • parsing and extracting information from HTML #

Validation: validictory • validation of python data structures ◦ based

Name Matching: name_tools # split names into (title, first, last,

Name Matching: jellyfish • string comparison ◦ Levenshtein & Damerau-Levenshtein

Name Matching: jellyfish # string comparison >>> jellyfish.jaro_distance('barak', 'barack') 0.94444444

Project Status • Public beta of our API ◦ http://openstates.sunlightlabs.com/api/

photo http://www.flickr.com/photos/gottgraphicsdesign/4543701893/ CC- BY

Use the data • Get Sunlight API Key @ http://services.sunlightlabs.com/

Adopt a State ◦ Contributor Guide ▪ http://openstates.sunlightlabs.com/contributing/ ◦ Join

Bonus Content: Other Sunlight APIs • Sunlight Labs APIs: http://services.sunlightlabs.com/

Thanks! http://openstates.sunlightlabs.com/ Additional Questions? Google Group: http://goo.gl/M3An [email protected] @jamesturk this