Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Guerilla APIs: integrating web systems that weren't designed to be integrated

Guerilla APIs: integrating web systems that weren't designed to be integrated

In an ideal world, every web system would provide a well designed REST API with oAuth authentication. But what do you do when that doesn't exist?

In this talk, Russell Keith-Magee will explore some unorthodox techniques for extracting and inserting information into the sort of systems you see in the real world outside Silicon Valley -- systems that don't provide a nice ReSTful API.

Russell Keith-Magee

September 05, 2013
Tweet

More Decks by Russell Keith-Magee

Other Decks in Technology

Transcript

  1. Guerrilla APIs: Integrating web systems that weren’t designed to be

    integrated Dr Russell Keith-Magee DjangoCon US 2013
  2. If you want to use this data, you need to

    use unconventional tactics.
  3. Data In • RSS • Email • MS Word Documents

    • PDF documents • Web scraping
  4. RSS handling • FeedParser • http://code.google.com/p/feedparser/ • Timed tasks/cron to

    retrieve content • Pay attention to RSS update frequency • Pay attention to server failures • Pay attention to object UUIDs
  5. Email handling • Low volume • Timed task/cron job +

    Python script • High volume • Mail server filter scripts • Mailgun et al.
  6. Email content • Text content - Regular Expressions • HTML

    content - lxml • Attachments - depends on content type • Emails may be recursive - “message/rfc822”
  7. MIME types • Email describes it’s MIME type • Each

    attachment is MIME typed • Use MIME type to determine processing • Unless sent by Microsoft tools! • Use mimetypes.guess_type()
  8. Word Documents • Two different formats • DOCX • “Open

    Office XML” - easy to parse • github.com/mikemaccana/python-docx • DOC • Covert to PDF using OpenOffice
  9. PDF Processing • PDF is a printing format • Internally:

    • Vector-based drawing instructions • May contain attachments
  10. PDFMiner • Available on PyPI • Not actively maintained... :-(

    • ... but it works • Can be used to extract raw drawing info
  11. Using PDFMiner (1) from pdfminer.pdfparser import PDFParser, PDFDocument with open('myfile.pdf')

    as file_obj: parser = PDFParser(file_obj) doc = PDFDocument() # connect the parser and document objects parser.set_document(doc) doc.set_parser(parser) # supply the password for initialization doc.initialize(password) if doc.is_extractable: # apply the function and return the result data = parse_pages(doc)
  12. Using PDFMiner (2) from pdfminer.pdfinterp import * from pdfminer.converter import

    PDFPageAggregator from pdfminer.layout import * def parse_pages(doc): data = {} rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) for i, page in enumerate(doc.get_pages()): interpreter.process_page(page) layout = device.get_result() parse_layout_objs(layout, (i + 1), data) return data
  13. Using PDFMiner (3) def parse_layout_objs(lt_objs, page_number, data): for lt_obj in

    lt_objs: if isinstance(lt_obj, LTTextLine): data.setdefault(page_number, []).append( (lt_obj.bbox, lt_obj.text.strip())) elif isinstance(lt_obj, LTImage): data.setdefault(page_number, []).append( (lt_obj.bbox, None)) elif isinstance(lt_obj, LTTextBox): parse_lt_objs(lt_obj, page_number, data) elif isinstance(lt_obj, LTFigure): parse_lt_objs(lt_obj, page_number, data)
  14. Using PDFMiner (4) data = { 1: [ ((1.0, 1.0,

    2.0, 0.25), 'Hello') ((3.25, 1.0, 2.0, 0.25), 'World') ((1.0, 1.3, 3.5, 0.25), 'This is') ((4.75, 1.3, 4.0, 0.25), 'your life.') ], 2: [...] ... }
  15. Web scraping • Tools • LXML • BeautifulSoup • Scrapy

    • requests • Plenty of tutorials out there • http://newcoder.io/Intro-Scrape/
  16. Guerrilla RSS • Some websites update data regularly • Should

    be RSS, but isn’t • Many of the same problems apply • UUIDs needed, often won’t be. • Frequency needs to be monitored
  17. What’s the problem? • Server-side session state • Dynamic page

    manipulation • Dynamic form data manipulation
  18. How to poke a portal • User makes an API

    call (“Submit invoice”) • Queue a job • Selenium opens a browser session • Firefox opens a browser • Xvfb captures the visual output • Script pushes the same buttons as a human