Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon India 2019

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon India 2019

Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn't work.

This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.

Vinayak Mehta

October 12, 2019

More Decks by Vinayak Mehta

Other Decks in Programming


  1. Extracting tabular data from PDFs using Camelot & Excalibur Vinayak

    Mehta vinayakmehta.com | @vortex_ape
  2. $ whoami vinayak vinayakmehta.com | @vortex_ape

  3. vinayakmehta.com | @vortex_ape

  4. Bangalore github.com/pydatabangalore/talks vinayakmehta.com | @vortex_ape Meetup on Oct 19!

  5. What is this talk about? vinayakmehta.com | @vortex_ape

  6. • Portable Document Format: History • Table extraction problems •

    Demo! • Where we’re headed • Q&A And some Python fun facts... High-level overview vinayakmehta.com | @vortex_ape
  7. Why is Python called Python? vinayakmehta.com | @vortex_ape

  8. vinayakmehta.com | @vortex_ape

  9. Portable Document Format: History vinayakmehta.com | @vortex_ape

  10. http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf vinayakmehta.com | @vortex_ape

  11. PDF: History • Documents should be viewable on any display

    and printable on any modern printer • Hence, Portable Document Format • Subset of Adobe PostScript • Encapsulates components required to build a document vinayakmehta.com | @vortex_ape
  12. PDF: History • Created in early 1990s by Adobe •

    Predates the World Wide Web and HTML • Proprietary format initially • Released as ISO standard with v1.7 (2006) vinayakmehta.com | @vortex_ape
  13. PDF: History https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm vinayakmehta.com | @vortex_ape

  14. PDF: History https://euske.github.io/pdfminer/ vinayakmehta.com | @vortex_ape

  15. Is that a table? vinayakmehta.com | @vortex_ape

  16. Error 404: Table not found vinayakmehta.com | @vortex_ape

  17. CSV vinayakmehta.com | @vortex_ape

  18. JSON vinayakmehta.com | @vortex_ape

  19. Text selection & PDF “tables”

  20. None
  21. PDF Table Extraction Tools • Tabula — Java-based, open-source •

    pdfplumber — Python, open-source • pdftables — Python, proprietary and paid • pdf-table-extract — Python, open-source • Smallpdf — Free and paid online service vinayakmehta.com | @vortex_ape
  22. Problems with existing tools vinayakmehta.com | @vortex_ape

  23. None
  24. vinayakmehta.com | @vortex_ape

  25. vinayakmehta.com | @vortex_ape

  26. pdftotext vinayakmehta.com | @vortex_ape

  27. vinayakmehta.com | @vortex_ape

  28. vinayakmehta.com | @vortex_ape

  29. • Output is a text file • Ad hoc code

    for each different type of table structure • Not scalable and maintainable Problems with this solution vinayakmehta.com | @vortex_ape
  30. Camelot: PDF Table Extraction for Humans vinayakmehta.com | @vortex_ape Started

    at SocialCops in 2016
  31. Why Camelot? • Works well out-of-the-box, but very configurable •

    Visual debugging and plotting using matplotlib • Export to multiple formats, including CSV, JSON, Excel, HTML or pandas DataFrames • Python-based, MIT licensed • Excellent documentation :) vinayakmehta.com | @vortex_ape
  32. None
  33. The table!

  34. Command-line interface vinayakmehta.com | @vortex_ape

  35. Installation Using conda (easiest way) Using pip (after installing tk

    and ghostscript) vinayakmehta.com | @vortex_ape
  36. How it works • Built on top of pdfminer •

    Two parsing flavors, Lattice and Stream • Lattice looks for lines on a page to identify a table. • Stream looks for whitespaces between words to identify a table. • Disclaimer: Works only with text-based PDFs and not scanned documents. vinayakmehta.com | @vortex_ape
  37. FUN FACTS AHEAD! vinayakmehta.com | @vortex_ape

  38. “What’s in a name?” • As you can already guess,

    this library is named after The Camelot Project. vinayakmehta.com | @vortex_ape
  39. Fun fact “You... do have some cheese, don't you?” vinayakmehta.com

    | @vortex_ape
  40. “But I don’t want to write code” :( vinayakmehta.com |

  41. You can use the web interface! vinayakmehta.com | @vortex_ape

  42. Excalibur $ excalibur webserver Go to localhost:5000 vinayakmehta.com | @vortex_ape

  43. Why Excalibur? • Web interface • Save once, apply anywhere

    • You data is safe on your machine • MySQL and Celery for parallel and distributed workloads vinayakmehta.com | @vortex_ape
  44. Installation Using pip (after installing tk and ghostscript) vinayakmehta.com |

  45. FUN FACTS AHEAD! vinayakmehta.com | @vortex_ape

  46. “What’s in a name?” vinayakmehta.com | @vortex_ape

  47. Fun fact “Well, there's egg and bacon, egg sausage and

    bacon, egg and spam, egg bacon and spam, …” vinayakmehta.com | @vortex_ape
  48. Roadmap • Removing ghostscript and opencv from requirements • Performance

    enhancements • Web interface enhancements • OCR support • <your-favorite-feature>? vinayakmehta.com | @vortex_ape
  49. github.com/vinayak-mehta github.com/camelot-dev/camelot github.com/camelot-dev/excalibur vinayakmehta.com | @vortex_ape

  50. Questions? vinayakmehta.com @vortex_ape Camelot & Excalibur devsprint on Oct 14

    & Oct 15! #Hacktoberfest