Extracting tabular data from PDFs using Camelot & Excalibur - PyCon AU 2019

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon AU 2019

Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn't work.

This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.

5c9f6b47dcb9445449eefb07bc719483?s=128

Vinayak Mehta

August 02, 2019
Tweet

Transcript

  1. Extracting tabular data from PDFs using Camelot & Excalibur Vinayak

    Mehta
  2. Hi.

  3. Bangalore github.com/pydatabangalore/talks

  4. None
  5. High-level overview

  6. • Portable Document Format: History • Table extraction problems •

    Demo! • Project roadmap • Q&A And some Python fun facts... High-level overview
  7. Why is Python called Python?

  8. None
  9. Portable Document Format: History

  10. http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf

  11. PDF: History • Created in early 1990s by Adobe •

    Predates the World Wide Web and HTML • Proprietary format initially, ISO standard as of v1.7 (2008) • 13 versions released
  12. PDF: History • Documents should be viewable on any display

    and printable on any modern printer • Hence, Portable Document Format • Subset of Adobe PostScript • Encapsulates components required to build a document
  13. PDF: History https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm

  14. PDF: History https://euske.github.io/pdfminer/

  15. Is that a table?

  16. Error 404: Table not found

  17. CSV

  18. JSON

  19. Text selection & PDF “tables”

  20. None
  21. PDF Table Extraction Tools • Tabula — Java-based, open-source •

    pdfplumber — Python, open-source • pdftables — Python, proprietary and paid • pdf-table-extract — Python, open-source • Smallpdf — Free and paid online service
  22. Problems with existing tools

  23. None
  24. None
  25. None
  26. A Solution

  27. pdftotext

  28. pdftotext

  29. • Output is a text file • Ad hoc code

    for each different type of table structure • Not scalable • Not maintainable Problems with this solution
  30. The Solution

  31. Camelot & Excalibur PDF Table Extraction for Humans Started at

    SocialCops in 2016, open-sourced in 2018.
  32. Why Camelot? • Works well out-of-the-box, but very configurable •

    Visual debugging and plotting using matplotlib • Export to multiple formats, including CSV, JSON, Excel, HTML or pandas DataFrames • Python-based, MIT licensed • Excellent documentation :)
  33. None
  34. The table!

  35. Command-line interface

  36. Installation Using conda (easiest way) Using pip (after installing tk

    and ghostscript)
  37. How it works • Built on top of pdfminer •

    Two parsing flavors, Lattice and Stream • Lattice looks for lines on a page to identify a table. • Stream looks for whitespaces between words to identify a table. • Disclaimer: Works only with text-based PDFs and not scanned documents.
  38. FUN FACTS AHEAD!

  39. “What’s in a name?” • As you can already guess,

    this library is named after The Camelot Project.
  40. Another fun fact “You... do have some cheese, don't you?”

  41. “But I don’t want to write code” :(

  42. You can use the web interface!

  43. Excalibur $ excalibur webserver Go to localhost:5000

  44. Why Excalibur? • Web interface • Save once, apply anywhere

    • You data is safe on your machine • MySQL and Celery for parallel and distributed workloads
  45. Installation Using pip (after installing tk and ghostscript)

  46. FUN FACTS AHEAD!

  47. “What’s in a name?”

  48. Another fun fact “Well, there's egg and bacon, egg sausage

    and bacon, egg and spam, egg bacon and spam, …”
  49. Roadmap • Removing ghostscript and opencv from requirements • Performance

    enhancements • Web interface enhancements • OCR support • <your-favorite-feature>?
  50. github.com/vinayak-mehta github.com/camelot-dev/camelot github.com/camelot-dev/excalibur

  51. Questions? vinayakmehta.com @vortex_ape