Guerilla APIs: integrating web systems that weren't designed to be integrated

Guerrilla APIs: Integrating web systems that weren’t designed to be
integrated Dr Russell Keith-Magee DjangoCon US 2013

http://en.wikipedia.org/wiki/File:Male_gorilla_in_SF_zoo.jpg

http://en.wikipedia.org/wiki/File:Afrikaner_Commandos2.JPG

Why Guerrilla APIs?

Lessons learned

Most of the world hasn’t embraced API-centric development

Most of the world’s interesting data isn’t API accesible

If you want to use this data, you need to
use unconventional tactics.

We can build a user- facing API that works the
way we want it to

Tips and tricks for dealing with non-API systems

Not productized (but could be)

Data In • RSS • Email • MS Word Documents
• PDF documents • Web scraping

RSS handling • FeedParser • http://code.google.com/p/feedparser/ • Timed tasks/cron to
retrieve content • Pay attention to RSS update frequency • Pay attention to server failures • Pay attention to object UUIDs

Email handling • Low volume • Timed task/cron job +
Python script • High volume • Mail server ﬁlter scripts • Mailgun et al.

Email content • Text content - Regular Expressions • HTML
content - lxml • Attachments - depends on content type • Emails may be recursive - “message/rfc822”

MIME types • Email describes it’s MIME type • Each
attachment is MIME typed • Use MIME type to determine processing • Unless sent by Microsoft tools! • Use mimetypes.guess_type()

Word Documents • Two different formats • DOCX • “Open
Ofﬁce XML” - easy to parse • github.com/mikemaccana/python-docx • DOC • Covert to PDF using OpenOfﬁce

PDF Processing • PDF is a printing format • Internally:
• Vector-based drawing instructions • May contain attachments

PDF2HTML

PDFMiner • Available on PyPI • Not actively maintained... :-(
• ... but it works • Can be used to extract raw drawing info

How to use PDFMiner

Using PDFMiner (1) from pdfminer.pdfparser import PDFParser, PDFDocument with open('myfile.pdf')
as file_obj: parser = PDFParser(file_obj) doc = PDFDocument() # connect the parser and document objects parser.set_document(doc) doc.set_parser(parser) # supply the password for initialization doc.initialize(password) if doc.is_extractable: # apply the function and return the result data = parse_pages(doc)

Using PDFMiner (2) from pdfminer.pdfinterp import * from pdfminer.converter import
PDFPageAggregator from pdfminer.layout import * def parse_pages(doc): data = {} rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) for i, page in enumerate(doc.get_pages()): interpreter.process_page(page) layout = device.get_result() parse_layout_objs(layout, (i + 1), data) return data

Using PDFMiner (3) def parse_layout_objs(lt_objs, page_number, data): for lt_obj in
lt_objs: if isinstance(lt_obj, LTTextLine): data.setdefault(page_number, []).append( (lt_obj.bbox, lt_obj.text.strip())) elif isinstance(lt_obj, LTImage): data.setdefault(page_number, []).append( (lt_obj.bbox, None)) elif isinstance(lt_obj, LTTextBox): parse_lt_objs(lt_obj, page_number, data) elif isinstance(lt_obj, LTFigure): parse_lt_objs(lt_obj, page_number, data)

Using PDFMiner (4) data = { 1: [ ((1.0, 1.0,
2.0, 0.25), 'Hello') ((3.25, 1.0, 2.0, 0.25), 'World') ((1.0, 1.3, 3.5, 0.25), 'This is') ((4.75, 1.3, 4.0, 0.25), 'your life.') ], 2: [...] ... }

Hello your life. This is World

Name Russell 37 Age 22 Alex 20 Zach Gaynor Voase
Magee

Ad hoc parsers

A DSL for parsing?

Web scraping • Tools • LXML • BeautifulSoup • Scrapy
• requests • Plenty of tutorials out there • http://newcoder.io/Intro-Scrape/

Guerrilla RSS • Some websites update data regularly • Should
be RSS, but isn’t • Many of the same problems apply • UUIDs needed, often won’t be. • Frequency needs to be monitored

http://source.mozillaopennews.org/en-US/learning/ sane-data-updates-are-harder-you-think-part-3/

Data out

Web forms

Just use requests

Portals

Not just about web forms

What’s the problem? • Server-side session state • Dynamic page
manipulation • Dynamic form data manipulation

You need to re- implement a browser to get at
this data.

How to poke a portal • User makes an API
call (“Submit invoice”) • Queue a job • Selenium opens a browser session • Firefox opens a browser • Xvfb captures the visual output • Script pushes the same buttons as a human

Emerging alternatives • PhantomJS • http://phantomjs.org • Ghost.py • http://jeanphix.me/Ghost.py/

Huge market opportunity

Assuming they even have a web form...

“Submit your completed forms via email”

“Submit your completed forms via email fax”

Reportlab

http://en.wikipedia.org/wiki/File:Afrikaner_Commandos2.JPG

Questions? http://cecinestpasun.com [email protected] @freakboy3742

Guerilla APIs: integrating web systems that wer...

Guerilla APIs: integrating web systems that weren't designed to be integrated

More Decks by Russell Keith-Magee

Other Decks in Technology

Featured

Transcript