Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon US 2019

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon US 2019

Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn't work.

This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.

Vinayak Mehta

May 03, 2019
Tweet

More Decks by Vinayak Mehta

Other Decks in Programming

Transcript

  1. Hi.

  2. PDF: History • Documents should be viewable on any display

    and printable on any modern printer • Hence, Portable Document Format • Built on top of PostScript • Packages components required to build a document
  3. CSV

  4. • Joined SocialCops as an intern in Jan. 2016 •

    Scraped tabular data from open data sources • Helped analysts track key metrics in various projects whoami
  5. Existing PDF table extraction tools • Open-source ◦ Tabula, pdfplumber,

    ... • Closed-source ◦ Smallpdf, PDFtables, ...
  6. • Output is a text file • Ad hoc code

    for each different type of table structure • Expensive and time-consuming • Not scalable, not maintainable Problems with this solution
  7. You are in control • Complete control over table extraction

    with some tweakable parameters • Override table areas, columns • Tweak line recognition • “Some other things”
  8. “What’s in a name?” • As you can already guess,

    this library is named after The Camelot Project.
  9. Installation $ pip install camelot-py Comparison with open-source PDF table

    extraction libraries and tools: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
  10. How it works • Two parsing flavors, Lattice and Stream.

    • Lattice looks for lines on a page to identify a table. • Stream looks for whitespaces between words to identify a table. More details here: https://camelot-py.readthedocs.io/en/master/user/how-it-works.html
  11. Why Excalibur? • Web interface • Save once, apply anywhere

    • You data is safe on your machine • MySQL and Celery for parallel and distributed workloads
  12. Another fun fact “Well, there's egg and bacon, egg sausage

    and bacon, egg and spam, egg bacon and spam, …”
  13. The road ahead • Autodetect parsing flavor • OCR support

    • “Make it fast!” • Web interface enhancements