Extracting tabular data from PDFs using Camelot & Excalibur - PyCon India 2019

Extracting tabular data from PDFs using Camelot & Excalibur Vinayak
Mehta vinayakmehta.com | @vortex_ape

$ whoami vinayak vinayakmehta.com | @vortex_ape

vinayakmehta.com | @vortex_ape

Bangalore github.com/pydatabangalore/talks vinayakmehta.com | @vortex_ape Meetup on Oct 19!

What is this talk about? vinayakmehta.com | @vortex_ape

• Portable Document Format: History • Table extraction problems •
Demo! • Where we’re headed • Q&A And some Python fun facts... High-level overview vinayakmehta.com | @vortex_ape

Why is Python called Python? vinayakmehta.com | @vortex_ape

Portable Document Format: History vinayakmehta.com | @vortex_ape

http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf vinayakmehta.com | @vortex_ape

PDF: History • Documents should be viewable on any display
and printable on any modern printer • Hence, Portable Document Format • Subset of Adobe PostScript • Encapsulates components required to build a document vinayakmehta.com | @vortex_ape

PDF: History • Created in early 1990s by Adobe •
Predates the World Wide Web and HTML • Proprietary format initially • Released as ISO standard with v1.7 (2006) vinayakmehta.com | @vortex_ape

PDF: History https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm vinayakmehta.com | @vortex_ape

PDF: History https://euske.github.io/pdfminer/ vinayakmehta.com | @vortex_ape

Is that a table? vinayakmehta.com | @vortex_ape

Error 404: Table not found vinayakmehta.com | @vortex_ape

CSV vinayakmehta.com | @vortex_ape

JSON vinayakmehta.com | @vortex_ape

Text selection & PDF “tables”

PDF Table Extraction Tools • Tabula — Java-based, open-source •
pdfplumber — Python, open-source • pdftables — Python, proprietary and paid • pdf-table-extract — Python, open-source • Smallpdf — Free and paid online service vinayakmehta.com | @vortex_ape

Problems with existing tools vinayakmehta.com | @vortex_ape

pdftotext vinayakmehta.com | @vortex_ape

• Output is a text ﬁle • Ad hoc code
for each diﬀerent type of table structure • Not scalable and maintainable Problems with this solution vinayakmehta.com | @vortex_ape

Camelot: PDF Table Extraction for Humans vinayakmehta.com | @vortex_ape Started
at SocialCops in 2016

Why Camelot? • Works well out-of-the-box, but very conﬁgurable •
Visual debugging and plotting using matplotlib • Export to multiple formats, including CSV, JSON, Excel, HTML or pandas DataFrames • Python-based, MIT licensed • Excellent documentation :) vinayakmehta.com | @vortex_ape

The table!

Command-line interface vinayakmehta.com | @vortex_ape

Installation Using conda (easiest way) Using pip (after installing tk
and ghostscript) vinayakmehta.com | @vortex_ape

How it works • Built on top of pdfminer •
Two parsing ﬂavors, Lattice and Stream • Lattice looks for lines on a page to identify a table. • Stream looks for whitespaces between words to identify a table. • Disclaimer: Works only with text-based PDFs and not scanned documents. vinayakmehta.com | @vortex_ape

FUN FACTS AHEAD! vinayakmehta.com | @vortex_ape

“What’s in a name?” • As you can already guess,
this library is named after The Camelot Project. vinayakmehta.com | @vortex_ape

Fun fact “You... do have some cheese, don't you?” vinayakmehta.com
| @vortex_ape

“But I don’t want to write code” :( vinayakmehta.com |
@vortex_ape

You can use the web interface! vinayakmehta.com | @vortex_ape

Excalibur $ excalibur webserver Go to localhost:5000 vinayakmehta.com | @vortex_ape

Why Excalibur? • Web interface • Save once, apply anywhere
• You data is safe on your machine • MySQL and Celery for parallel and distributed workloads vinayakmehta.com | @vortex_ape

Installation Using pip (after installing tk and ghostscript) vinayakmehta.com |
@vortex_ape

FUN FACTS AHEAD! vinayakmehta.com | @vortex_ape

“What’s in a name?” vinayakmehta.com | @vortex_ape

Fun fact “Well, there's egg and bacon, egg sausage and
bacon, egg and spam, egg bacon and spam, …” vinayakmehta.com | @vortex_ape

Roadmap • Removing ghostscript and opencv from requirements • Performance
enhancements • Web interface enhancements • OCR support • <your-favorite-feature>? vinayakmehta.com | @vortex_ape

github.com/vinayak-mehta github.com/camelot-dev/camelot github.com/camelot-dev/excalibur vinayakmehta.com | @vortex_ape

Questions? vinayakmehta.com @vortex_ape Camelot & Excalibur devsprint on Oct 14
& Oct 15! #Hacktoberfest

Extracting tabular data from PDFs using Camelot...

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon India 2019

More Decks by Vinayak Mehta

Other Decks in Programming

Featured

Transcript