Scraping PDFs with Tabula

Scraping PDFs with Tabula @manuelaristaran Knight-Mozilla OpenNews Fellow 2013 @
La Nación Buenos Aires, Argentina tabula.nerdpower.org — @TabulaPDF

• PDF is the worst possible format for information exchange.
• Electronic paper: it’s meant to be rendered the same way regardless of the device. • But we care about content, not form.

Still, they’re regularly used for publishing important information.

Ignorance

Malice

“The crime stats are subject to being corruptible in an
excel sheet. They have been changed in the past by persons unknown and this affects the veracity of the original data posted. If stats are posted on-line in a PDF format, this reduces the risk of contamination. Note if data was kept on a SQL, the data could be viewed, manipulated and accessed by many and yet keep original and intact. This is cost prohibited and will not be pursued. Effective immediately the stats should just be posted in a PDF format.” http://www.minnpost.com/data/2013/09/update-minneapolis-police-department-restores-accessible-data-format

Extracting tables from PDF ﬁles is a pervasive problem in
data journalism.

• This data is trapped in a PDF. • To
produce information, we need to process and analyze it.

tabula.nerdpower.org

With the help of Mike Tigas (OpenNews fellow @ ProPublica)
y Jeremy B. Merrill (News Apps fellow @ ProPublica) Free as in freedom, free as in beer: http:// github.com/jazzido/tabula

Demo time

CC-by-sa Flip Federowicz Still lots to do...

• Tabula is making an ambitious claim. •Tabular information can
be represented in many diﬀerent ways. Very heuristic problem, lots of edge cases. Neverending story.

• Holy Grail: scanned documents • Tabula can’t do anything
with them. Yet. ¡We want your patches, amigo!

¡Muchas Gracias! @manuelaristaran Hacks/Hackers Philly 2013

Scraping PDFs with Tabula

Scraping PDFs with Tabula

Manuel Aristarán

Other Decks in Technology

Featured

Transcript