Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping PDFs with Tabula

Scraping PDFs with Tabula

Manuel Aristarán

November 01, 2013
Tweet

Other Decks in Technology

Transcript

  1. Scraping PDFs with Tabula @manuelaristaran Knight-Mozilla OpenNews Fellow 2013 @

    La Nación Buenos Aires, Argentina tabula.nerdpower.org — @TabulaPDF
  2. • PDF is the worst possible format for information exchange.

    • Electronic paper: it’s meant to be rendered the same way regardless of the device. • But we care about content, not form.
  3. “The crime stats are subject to being corruptible in an

    excel sheet. They have been changed in the past by persons unknown and this affects the veracity of the original data posted. If stats are posted on-line in a PDF format, this reduces the risk of contamination. Note if data was kept on a SQL, the data could be viewed, manipulated and accessed by many and yet keep original and intact. This is cost prohibited and will not be pursued. Effective immediately the stats should just be posted in a PDF format.” http://www.minnpost.com/data/2013/09/update-minneapolis-police-department-restores-accessible-data-format
  4. • This data is trapped in a PDF. • To

    produce information, we need to process and analyze it.
  5. With the help of Mike Tigas (OpenNews fellow @ ProPublica)

    y Jeremy B. Merrill (News Apps fellow @ ProPublica) Free as in freedom, free as in beer: http:// github.com/jazzido/tabula
  6. • Tabula is making an ambitious claim. •Tabular information can

    be represented in many different ways. Very heuristic problem, lots of edge cases. Neverending story.
  7. • Holy Grail: scanned documents • Tabula can’t do anything

    with them. Yet. ¡We want your patches, amigo!