Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Extracting tabular data from PDFs using Camelot...

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon India 2019

Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn't work.

This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.

Vinayak Mehta

October 12, 2019
Tweet

More Decks by Vinayak Mehta

Other Decks in Programming

Transcript

  1. • Portable Document Format: History • Table extraction problems •

    Demo! • Where we’re headed • Q&A And some Python fun facts... High-level overview vinayakmehta.com | @vortex_ape
  2. PDF: History • Documents should be viewable on any display

    and printable on any modern printer • Hence, Portable Document Format • Subset of Adobe PostScript • Encapsulates components required to build a document vinayakmehta.com | @vortex_ape
  3. PDF: History • Created in early 1990s by Adobe •

    Predates the World Wide Web and HTML • Proprietary format initially • Released as ISO standard with v1.7 (2006) vinayakmehta.com | @vortex_ape
  4. PDF Table Extraction Tools • Tabula — Java-based, open-source •

    pdfplumber — Python, open-source • pdftables — Python, proprietary and paid • pdf-table-extract — Python, open-source • Smallpdf — Free and paid online service vinayakmehta.com | @vortex_ape
  5. • Output is a text file • Ad hoc code

    for each different type of table structure • Not scalable and maintainable Problems with this solution vinayakmehta.com | @vortex_ape
  6. Why Camelot? • Works well out-of-the-box, but very configurable •

    Visual debugging and plotting using matplotlib • Export to multiple formats, including CSV, JSON, Excel, HTML or pandas DataFrames • Python-based, MIT licensed • Excellent documentation :) vinayakmehta.com | @vortex_ape
  7. Installation Using conda (easiest way) Using pip (after installing tk

    and ghostscript) vinayakmehta.com | @vortex_ape
  8. How it works • Built on top of pdfminer •

    Two parsing flavors, Lattice and Stream • Lattice looks for lines on a page to identify a table. • Stream looks for whitespaces between words to identify a table. • Disclaimer: Works only with text-based PDFs and not scanned documents. vinayakmehta.com | @vortex_ape
  9. “What’s in a name?” • As you can already guess,

    this library is named after The Camelot Project. vinayakmehta.com | @vortex_ape
  10. Why Excalibur? • Web interface • Save once, apply anywhere

    • You data is safe on your machine • MySQL and Celery for parallel and distributed workloads vinayakmehta.com | @vortex_ape
  11. Fun fact “Well, there's egg and bacon, egg sausage and

    bacon, egg and spam, egg bacon and spam, …” vinayakmehta.com | @vortex_ape
  12. Roadmap • Removing ghostscript and opencv from requirements • Performance

    enhancements • Web interface enhancements • OCR support • <your-favorite-feature>? vinayakmehta.com | @vortex_ape