Most of the world hasn’t
embraced API-centric
development
Slide 15
Slide 15 text
Most of the world’s
interesting data isn’t
API accesible
Slide 16
Slide 16 text
If you want to use this
data, you need to use
unconventional tactics.
Slide 17
Slide 17 text
We can build a user-
facing API that works
the way we want it to
Slide 18
Slide 18 text
Tips and tricks for
dealing with
non-API systems
Slide 19
Slide 19 text
Not productized
(but could be)
Slide 20
Slide 20 text
Data In
• RSS
• Email
• MS Word Documents
• PDF documents
• Web scraping
Slide 21
Slide 21 text
RSS handling
• FeedParser
• http://code.google.com/p/feedparser/
• Timed tasks/cron to retrieve content
• Pay attention to RSS update frequency
• Pay attention to server failures
• Pay attention to object UUIDs
Slide 22
Slide 22 text
Email handling
• Low volume
• Timed task/cron job + Python script
• High volume
• Mail server filter scripts
• Mailgun et al.
Slide 23
Slide 23 text
Email content
• Text content - Regular Expressions
• HTML content - lxml
• Attachments - depends on content type
• Emails may be recursive - “message/rfc822”
Slide 24
Slide 24 text
MIME types
• Email describes it’s MIME type
• Each attachment is MIME typed
• Use MIME type to determine processing
• Unless sent by Microsoft tools!
• Use mimetypes.guess_type()
Slide 25
Slide 25 text
Word Documents
• Two different formats
• DOCX
• “Open Office XML” - easy to parse
• github.com/mikemaccana/python-docx
• DOC
• Covert to PDF using OpenOffice
Slide 26
Slide 26 text
PDF Processing
• PDF is a printing format
• Internally:
• Vector-based drawing instructions
• May contain attachments
Slide 27
Slide 27 text
PDF2HTML
Slide 28
Slide 28 text
PDFMiner
• Available on PyPI
• Not actively maintained... :-(
• ... but it works
• Can be used to extract raw drawing info
Slide 29
Slide 29 text
How to use PDFMiner
Slide 30
Slide 30 text
Using PDFMiner (1)
from pdfminer.pdfparser import PDFParser, PDFDocument
with open('myfile.pdf') as file_obj:
parser = PDFParser(file_obj)
doc = PDFDocument()
# connect the parser and document objects
parser.set_document(doc)
doc.set_parser(parser)
# supply the password for initialization
doc.initialize(password)
if doc.is_extractable:
# apply the function and return the result
data = parse_pages(doc)
Slide 31
Slide 31 text
Using PDFMiner (2)
from pdfminer.pdfinterp import *
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import *
def parse_pages(doc):
data = {}
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for i, page in enumerate(doc.get_pages()):
interpreter.process_page(page)
layout = device.get_result()
parse_layout_objs(layout, (i + 1), data)
return data
Slide 32
Slide 32 text
Using PDFMiner (3)
def parse_layout_objs(lt_objs, page_number, data):
for lt_obj in lt_objs:
if isinstance(lt_obj, LTTextLine):
data.setdefault(page_number, []).append(
(lt_obj.bbox, lt_obj.text.strip()))
elif isinstance(lt_obj, LTImage):
data.setdefault(page_number, []).append(
(lt_obj.bbox, None))
elif isinstance(lt_obj, LTTextBox):
parse_lt_objs(lt_obj, page_number, data)
elif isinstance(lt_obj, LTFigure):
parse_lt_objs(lt_obj, page_number, data)
Name
Russell 37
Age
22
Alex
20
Zach
Gaynor
Voase
Magee
Slide 37
Slide 37 text
Name
Russell 37
Age
22
Alex
20
Zach
Gaynor
Voase
Magee
Slide 38
Slide 38 text
Name
Russell 37
Age
22
Alex
20
Zach
Gaynor
Voase
Magee
Slide 39
Slide 39 text
Ad hoc parsers
Slide 40
Slide 40 text
A DSL for parsing?
Slide 41
Slide 41 text
Web scraping
• Tools
• LXML
• BeautifulSoup
• Scrapy
• requests
• Plenty of tutorials out there
• http://newcoder.io/Intro-Scrape/
Slide 42
Slide 42 text
Guerrilla RSS
• Some websites update data regularly
• Should be RSS, but isn’t
• Many of the same problems apply
• UUIDs needed, often won’t be.
• Frequency needs to be monitored
What’s the problem?
• Server-side session state
• Dynamic page manipulation
• Dynamic form data manipulation
Slide 51
Slide 51 text
No content
Slide 52
Slide 52 text
No content
Slide 53
Slide 53 text
No content
Slide 54
Slide 54 text
You need to re-
implement a browser
to get at this data.
Slide 55
Slide 55 text
How to poke a portal
• User makes an API call (“Submit invoice”)
• Queue a job
• Selenium opens a browser session
• Firefox opens a browser
• Xvfb captures the visual output
• Script pushes the same buttons as a human