Extracting tabular data from PDFs using Camelot & Excalibur - PyCon US 2019

Extracting tabular data from PDFs using Camelot & Excalibur Vinayak
Mehta

@vortex_ape

What to expect from this talk?

Why is Python called Python?

Portable Document Format: History

http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf

PDF: History • Documents should be viewable on any display
and printable on any modern printer • Hence, Portable Document Format • Built on top of PostScript • Packages components required to build a document

PDF: History https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm

PDF: History https://euske.github.io/pdfminer/

Is that a table?

Error 404: Table not found

Unlike

• Joined SocialCops as an intern in Jan. 2016 •
Scraped tabular data from open data sources • Helped analysts track key metrics in various projects whoami

Existing PDF table extraction tools • Open-source ◦ Tabula, pdfplumber,
... • Closed-source ◦ Smallpdf, PDFtables, ...

Problems with existing tools

A Solution

pdftotext

• Output is a text ﬁle • Ad hoc code
for each diﬀerent type of table structure • Expensive and time-consuming • Not scalable, not maintainable Problems with this solution

The Solution

Portable Document Format: History

There is a table!

Why Camelot?

You are in control • Complete control over table extraction
with some tweakable parameters • Override table areas, columns • Tweak line recognition • “Some other things”

Dataframes!

Parsing report

“Some other things”

Flag superscripts and subscripts

Strip unnecessary characters

Shift text in cells that span multiple rows/columns

Copy text in cells that span multiple rows/columns

Multiple output formats Replace csv with json, html or excel
ﬁle

Command-line interface

“What’s in a name?” • As you can already guess,
this library is named after The Camelot Project.

Another fun fact “You... do have some cheese, don't you?”

Installation $ pip install camelot-py Comparison with open-source PDF table
extraction libraries and tools: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools

How it works • Two parsing ﬂavors, Lattice and Stream.
• Lattice looks for lines on a page to identify a table. • Stream looks for whitespaces between words to identify a table. More details here: https://camelot-py.readthedocs.io/en/master/user/how-it-works.html

“But I don’t want to write code” :(

You can use the web interface!

Excalibur $ excalibur webserver Go to localhost:5000

Upload a PDF

Autodetect tables

Or draw table areas/columns

Download extracted tables in your favorite format!

Why Excalibur? • Web interface • Save once, apply anywhere
• You data is safe on your machine • MySQL and Celery for parallel and distributed workloads

“What’s in a name?”

Another fun fact “Well, there's egg and bacon, egg sausage
and bacon, egg and spam, egg bacon and spam, …”

Installation $ pip install excalibur-py

The road ahead • Autodetect parsing ﬂavor • OCR support
• “Make it fast!” • Web interface enhancements

github.com/vinayak-mehta github.com/socialcopsdev/camelot github.com/camelot-dev/excalibur

Questions? vinayakmehta.com

Extracting tabular data from PDFs using Camelot...

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon US 2019

More Decks by Vinayak Mehta

Other Decks in Programming

Featured

Transcript