Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Guerilla APIs: integrating web systems that weren't designed to be integrated

Guerilla APIs: integrating web systems that weren't designed to be integrated

In an ideal world, every web system would provide a well designed REST API with oAuth authentication. But what do you do when that doesn't exist?

In this talk, Russell Keith-Magee will explore some unorthodox techniques for extracting and inserting information into the sort of systems you see in the real world outside Silicon Valley -- systems that don't provide a nice ReSTful API.

B91373320dbc3bc52fcd870d3b21748f?s=128

Russell Keith-Magee

September 05, 2013
Tweet

Transcript

  1. Guerrilla APIs: Integrating web systems that weren’t designed to be

    integrated Dr Russell Keith-Magee DjangoCon US 2013
  2. None
  3. None
  4. http://en.wikipedia.org/wiki/File:Male_gorilla_in_SF_zoo.jpg

  5. http://en.wikipedia.org/wiki/File:Afrikaner_Commandos2.JPG

  6. Why Guerrilla APIs?

  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. Lessons learned

  14. Most of the world hasn’t embraced API-centric development

  15. Most of the world’s interesting data isn’t API accesible

  16. If you want to use this data, you need to

    use unconventional tactics.
  17. We can build a user- facing API that works the

    way we want it to
  18. Tips and tricks for dealing with non-API systems

  19. Not productized (but could be)

  20. Data In • RSS • Email • MS Word Documents

    • PDF documents • Web scraping
  21. RSS handling • FeedParser • http://code.google.com/p/feedparser/ • Timed tasks/cron to

    retrieve content • Pay attention to RSS update frequency • Pay attention to server failures • Pay attention to object UUIDs
  22. Email handling • Low volume • Timed task/cron job +

    Python script • High volume • Mail server filter scripts • Mailgun et al.
  23. Email content • Text content - Regular Expressions • HTML

    content - lxml • Attachments - depends on content type • Emails may be recursive - “message/rfc822”
  24. MIME types • Email describes it’s MIME type • Each

    attachment is MIME typed • Use MIME type to determine processing • Unless sent by Microsoft tools! • Use mimetypes.guess_type()
  25. Word Documents • Two different formats • DOCX • “Open

    Office XML” - easy to parse • github.com/mikemaccana/python-docx • DOC • Covert to PDF using OpenOffice
  26. PDF Processing • PDF is a printing format • Internally:

    • Vector-based drawing instructions • May contain attachments
  27. PDF2HTML

  28. PDFMiner • Available on PyPI • Not actively maintained... :-(

    • ... but it works • Can be used to extract raw drawing info
  29. How to use PDFMiner

  30. Using PDFMiner (1) from pdfminer.pdfparser import PDFParser, PDFDocument with open('myfile.pdf')

    as file_obj: parser = PDFParser(file_obj) doc = PDFDocument() # connect the parser and document objects parser.set_document(doc) doc.set_parser(parser) # supply the password for initialization doc.initialize(password) if doc.is_extractable: # apply the function and return the result data = parse_pages(doc)
  31. Using PDFMiner (2) from pdfminer.pdfinterp import * from pdfminer.converter import

    PDFPageAggregator from pdfminer.layout import * def parse_pages(doc): data = {} rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) for i, page in enumerate(doc.get_pages()): interpreter.process_page(page) layout = device.get_result() parse_layout_objs(layout, (i + 1), data) return data
  32. Using PDFMiner (3) def parse_layout_objs(lt_objs, page_number, data): for lt_obj in

    lt_objs: if isinstance(lt_obj, LTTextLine): data.setdefault(page_number, []).append( (lt_obj.bbox, lt_obj.text.strip())) elif isinstance(lt_obj, LTImage): data.setdefault(page_number, []).append( (lt_obj.bbox, None)) elif isinstance(lt_obj, LTTextBox): parse_lt_objs(lt_obj, page_number, data) elif isinstance(lt_obj, LTFigure): parse_lt_objs(lt_obj, page_number, data)
  33. Using PDFMiner (4) data = { 1: [ ((1.0, 1.0,

    2.0, 0.25), 'Hello') ((3.25, 1.0, 2.0, 0.25), 'World') ((1.0, 1.3, 3.5, 0.25), 'This is') ((4.75, 1.3, 4.0, 0.25), 'your life.') ], 2: [...] ... }
  34. Hello your life. This is World

  35. Hello your life. This is World

  36. Name Russell 37 Age 22 Alex 20 Zach Gaynor Voase

    Magee
  37. Name Russell 37 Age 22 Alex 20 Zach Gaynor Voase

    Magee
  38. Name Russell 37 Age 22 Alex 20 Zach Gaynor Voase

    Magee
  39. Ad hoc parsers

  40. A DSL for parsing?

  41. Web scraping • Tools • LXML • BeautifulSoup • Scrapy

    • requests • Plenty of tutorials out there • http://newcoder.io/Intro-Scrape/
  42. Guerrilla RSS • Some websites update data regularly • Should

    be RSS, but isn’t • Many of the same problems apply • UUIDs needed, often won’t be. • Frequency needs to be monitored
  43. http://source.mozillaopennews.org/en-US/learning/ sane-data-updates-are-harder-you-think-part-3/

  44. Data out

  45. Web forms

  46. Just use requests

  47. None
  48. Portals

  49. Not just about web forms

  50. What’s the problem? • Server-side session state • Dynamic page

    manipulation • Dynamic form data manipulation
  51. None
  52. None
  53. None
  54. You need to re- implement a browser to get at

    this data.
  55. How to poke a portal • User makes an API

    call (“Submit invoice”) • Queue a job • Selenium opens a browser session • Firefox opens a browser • Xvfb captures the visual output • Script pushes the same buttons as a human
  56. Emerging alternatives • PhantomJS • http://phantomjs.org • Ghost.py • http://jeanphix.me/Ghost.py/

  57. Huge market opportunity

  58. Assuming they even have a web form...

  59. “Submit your completed forms via email”

  60. “Submit your completed forms via email fax”

  61. Reportlab

  62. http://en.wikipedia.org/wiki/File:Afrikaner_Commandos2.JPG

  63. Questions? http://cecinestpasun.com russell@keith-magee.com @freakboy3742