Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon AU 2019

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon AU 2019

Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn't work.

This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.

Vinayak Mehta

August 02, 2019
Tweet

More Decks by Vinayak Mehta

Other Decks in Programming

Transcript

  1. Hi.

  2. • Portable Document Format: History • Table extraction problems •

    Demo! • Project roadmap • Q&A And some Python fun facts... High-level overview
  3. PDF: History • Created in early 1990s by Adobe •

    Predates the World Wide Web and HTML • Proprietary format initially, ISO standard as of v1.7 (2008) • 13 versions released
  4. PDF: History • Documents should be viewable on any display

    and printable on any modern printer • Hence, Portable Document Format • Subset of Adobe PostScript • Encapsulates components required to build a document
  5. CSV

  6. PDF Table Extraction Tools • Tabula — Java-based, open-source •

    pdfplumber — Python, open-source • pdftables — Python, proprietary and paid • pdf-table-extract — Python, open-source • Smallpdf — Free and paid online service
  7. • Output is a text file • Ad hoc code

    for each different type of table structure • Not scalable • Not maintainable Problems with this solution
  8. Camelot & Excalibur PDF Table Extraction for Humans Started at

    SocialCops in 2016, open-sourced in 2018.
  9. Why Camelot? • Works well out-of-the-box, but very configurable •

    Visual debugging and plotting using matplotlib • Export to multiple formats, including CSV, JSON, Excel, HTML or pandas DataFrames • Python-based, MIT licensed • Excellent documentation :)
  10. How it works • Built on top of pdfminer •

    Two parsing flavors, Lattice and Stream • Lattice looks for lines on a page to identify a table. • Stream looks for whitespaces between words to identify a table. • Disclaimer: Works only with text-based PDFs and not scanned documents.
  11. “What’s in a name?” • As you can already guess,

    this library is named after The Camelot Project.
  12. Why Excalibur? • Web interface • Save once, apply anywhere

    • You data is safe on your machine • MySQL and Celery for parallel and distributed workloads
  13. Another fun fact “Well, there's egg and bacon, egg sausage

    and bacon, egg and spam, egg bacon and spam, …”
  14. Roadmap • Removing ghostscript and opencv from requirements • Performance

    enhancements • Web interface enhancements • OCR support • <your-favorite-feature>?