Extracting tabular data from PDFs using Camelot & Excalibur - PyCon AU 2019

Extracting tabular data from PDFs using Camelot & Excalibur Vinayak
Mehta

Bangalore github.com/pydatabangalore/talks

High-level overview

• Portable Document Format: History • Table extraction problems •
Demo! • Project roadmap • Q&A And some Python fun facts... High-level overview

Why is Python called Python?

Portable Document Format: History

http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf

PDF: History • Created in early 1990s by Adobe •
Predates the World Wide Web and HTML • Proprietary format initially, ISO standard as of v1.7 (2008) • 13 versions released

PDF: History • Documents should be viewable on any display
and printable on any modern printer • Hence, Portable Document Format • Subset of Adobe PostScript • Encapsulates components required to build a document

PDF: History https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm

PDF: History https://euske.github.io/pdfminer/

Is that a table?

Error 404: Table not found

Text selection & PDF “tables”

PDF Table Extraction Tools • Tabula — Java-based, open-source •
pdfplumber — Python, open-source • pdftables — Python, proprietary and paid • pdf-table-extract — Python, open-source • Smallpdf — Free and paid online service

Problems with existing tools

A Solution

pdftotext

• Output is a text ﬁle • Ad hoc code
for each diﬀerent type of table structure • Not scalable • Not maintainable Problems with this solution

The Solution

Camelot & Excalibur PDF Table Extraction for Humans Started at
SocialCops in 2016, open-sourced in 2018.

Why Camelot? • Works well out-of-the-box, but very conﬁgurable •
Visual debugging and plotting using matplotlib • Export to multiple formats, including CSV, JSON, Excel, HTML or pandas DataFrames • Python-based, MIT licensed • Excellent documentation :)

The table!

Command-line interface

Installation Using conda (easiest way) Using pip (after installing tk
and ghostscript)

How it works • Built on top of pdfminer •
Two parsing ﬂavors, Lattice and Stream • Lattice looks for lines on a page to identify a table. • Stream looks for whitespaces between words to identify a table. • Disclaimer: Works only with text-based PDFs and not scanned documents.

FUN FACTS AHEAD!

“What’s in a name?” • As you can already guess,
this library is named after The Camelot Project.

Another fun fact “You... do have some cheese, don't you?”

“But I don’t want to write code” :(

You can use the web interface!

Excalibur $ excalibur webserver Go to localhost:5000

Why Excalibur? • Web interface • Save once, apply anywhere
• You data is safe on your machine • MySQL and Celery for parallel and distributed workloads

Installation Using pip (after installing tk and ghostscript)

FUN FACTS AHEAD!

“What’s in a name?”

Another fun fact “Well, there's egg and bacon, egg sausage
and bacon, egg and spam, egg bacon and spam, …”

Roadmap • Removing ghostscript and opencv from requirements • Performance
enhancements • Web interface enhancements • OCR support • <your-favorite-feature>?

github.com/vinayak-mehta github.com/camelot-dev/camelot github.com/camelot-dev/excalibur

Questions? vinayakmehta.com @vortex_ape

Extracting tabular data from PDFs using Camelot...

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon AU 2019

More Decks by Vinayak Mehta

Other Decks in Programming

Featured

Transcript