Guerilla APIs: integrating web systems that weren't designed to be integrated

Slide 1

Slide 1 text

Guerrilla APIs: Integrating web systems that weren’t designed to be integrated Dr Russell Keith-Magee DjangoCon US 2013

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

http://en.wikipedia.org/wiki/File:Male_gorilla_in_SF_zoo.jpg

Slide 5

Slide 5 text

http://en.wikipedia.org/wiki/File:Afrikaner_Commandos2.JPG

Slide 6

Slide 6 text

Why Guerrilla APIs?

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Lessons learned

Slide 14

Slide 14 text

Most of the world hasn’t embraced API-centric development

Slide 15

Slide 15 text

Most of the world’s interesting data isn’t API accesible

Slide 16

Slide 16 text

If you want to use this data, you need to use unconventional tactics.

Slide 17

Slide 17 text

We can build a user- facing API that works the way we want it to

Slide 18

Slide 18 text

Tips and tricks for dealing with non-API systems

Slide 19

Slide 19 text

Not productized (but could be)

Slide 20

Slide 20 text

Data In • RSS • Email • MS Word Documents • PDF documents • Web scraping

Slide 21

Slide 21 text

RSS handling • FeedParser • http://code.google.com/p/feedparser/ • Timed tasks/cron to retrieve content • Pay attention to RSS update frequency • Pay attention to server failures • Pay attention to object UUIDs

Slide 22

Slide 22 text

Email handling • Low volume • Timed task/cron job + Python script • High volume • Mail server ﬁlter scripts • Mailgun et al.

Slide 23

Slide 23 text

Email content • Text content - Regular Expressions • HTML content - lxml • Attachments - depends on content type • Emails may be recursive - “message/rfc822”

Slide 24

Slide 24 text

MIME types • Email describes it’s MIME type • Each attachment is MIME typed • Use MIME type to determine processing • Unless sent by Microsoft tools! • Use mimetypes.guess_type()

Slide 25

Slide 25 text

Word Documents • Two different formats • DOCX • “Open Ofﬁce XML” - easy to parse • github.com/mikemaccana/python-docx • DOC • Covert to PDF using OpenOfﬁce

Slide 26

Slide 26 text

PDF Processing • PDF is a printing format • Internally: • Vector-based drawing instructions • May contain attachments

Slide 27

Slide 27 text

PDF2HTML

Slide 28

Slide 28 text

PDFMiner • Available on PyPI • Not actively maintained... :-( • ... but it works • Can be used to extract raw drawing info

Slide 29

Slide 29 text

How to use PDFMiner

Slide 30

Slide 30 text

Using PDFMiner (1) from pdfminer.pdfparser import PDFParser, PDFDocument with open('myfile.pdf') as file_obj: parser = PDFParser(file_obj) doc = PDFDocument() # connect the parser and document objects parser.set_document(doc) doc.set_parser(parser) # supply the password for initialization doc.initialize(password) if doc.is_extractable: # apply the function and return the result data = parse_pages(doc)

Slide 31

Slide 31 text

Using PDFMiner (2) from pdfminer.pdfinterp import * from pdfminer.converter import PDFPageAggregator from pdfminer.layout import * def parse_pages(doc): data = {} rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) for i, page in enumerate(doc.get_pages()): interpreter.process_page(page) layout = device.get_result() parse_layout_objs(layout, (i + 1), data) return data

Slide 32

Slide 32 text

Using PDFMiner (3) def parse_layout_objs(lt_objs, page_number, data): for lt_obj in lt_objs: if isinstance(lt_obj, LTTextLine): data.setdefault(page_number, []).append( (lt_obj.bbox, lt_obj.text.strip())) elif isinstance(lt_obj, LTImage): data.setdefault(page_number, []).append( (lt_obj.bbox, None)) elif isinstance(lt_obj, LTTextBox): parse_lt_objs(lt_obj, page_number, data) elif isinstance(lt_obj, LTFigure): parse_lt_objs(lt_obj, page_number, data)

Slide 33

Slide 33 text

Using PDFMiner (4) data = { 1: [ ((1.0, 1.0, 2.0, 0.25), 'Hello') ((3.25, 1.0, 2.0, 0.25), 'World') ((1.0, 1.3, 3.5, 0.25), 'This is') ((4.75, 1.3, 4.0, 0.25), 'your life.') ], 2: [...] ... }

Slide 34

Slide 34 text

Hello your life. This is World

Slide 35

Slide 35 text

Hello your life. This is World

Slide 36

Slide 36 text

Name Russell 37 Age 22 Alex 20 Zach Gaynor Voase Magee

Slide 37

Slide 37 text

Name Russell 37 Age 22 Alex 20 Zach Gaynor Voase Magee

Slide 38

Slide 38 text

Name Russell 37 Age 22 Alex 20 Zach Gaynor Voase Magee

Slide 39

Slide 39 text

Ad hoc parsers

Slide 40

Slide 40 text

A DSL for parsing?

Slide 41

Slide 41 text

Web scraping • Tools • LXML • BeautifulSoup • Scrapy • requests • Plenty of tutorials out there • http://newcoder.io/Intro-Scrape/

Slide 42

Slide 42 text

Guerrilla RSS • Some websites update data regularly • Should be RSS, but isn’t • Many of the same problems apply • UUIDs needed, often won’t be. • Frequency needs to be monitored

Slide 43

Slide 43 text

http://source.mozillaopennews.org/en-US/learning/ sane-data-updates-are-harder-you-think-part-3/

Slide 44

Slide 44 text

Data out

Slide 45

Slide 45 text

Web forms

Slide 46

Slide 46 text

Just use requests

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Portals

Slide 49

Slide 49 text

Not just about web forms

Slide 50

Slide 50 text

What’s the problem? • Server-side session state • Dynamic page manipulation • Dynamic form data manipulation

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

You need to re- implement a browser to get at this data.

Slide 55

Slide 55 text

How to poke a portal • User makes an API call (“Submit invoice”) • Queue a job • Selenium opens a browser session • Firefox opens a browser • Xvfb captures the visual output • Script pushes the same buttons as a human

Slide 56

Slide 56 text

Emerging alternatives • PhantomJS • http://phantomjs.org • Ghost.py • http://jeanphix.me/Ghost.py/