Extracting tabular data from PDFs using Camelot & Excalibur - PyCon US 2019

Extracting tabular data from PDFs using Camelot & Excalibur - PyCon US 2019

Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn't work.

This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.

5c9f6b47dcb9445449eefb07bc719483?s=128

Vinayak Mehta

May 03, 2019
Tweet

Transcript

  1. Extracting tabular data from PDFs using Camelot & Excalibur Vinayak

    Mehta
  2. Hi.

  3. @vortex_ape

  4. None
  5. What to expect from this talk?

  6. Why is Python called Python?

  7. None
  8. Portable Document Format: History

  9. http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf

  10. PDF: History • Documents should be viewable on any display

    and printable on any modern printer • Hence, Portable Document Format • Built on top of PostScript • Packages components required to build a document
  11. PDF: History https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm

  12. PDF: History https://euske.github.io/pdfminer/

  13. Is that a table?

  14. Error 404: Table not found

  15. Unlike

  16. CSV

  17. JSON

  18. None
  19. • Joined SocialCops as an intern in Jan. 2016 •

    Scraped tabular data from open data sources • Helped analysts track key metrics in various projects whoami
  20. None
  21. Existing PDF table extraction tools • Open-source ◦ Tabula, pdfplumber,

    ... • Closed-source ◦ Smallpdf, PDFtables, ...
  22. Problems with existing tools

  23. None
  24. None
  25. None
  26. A Solution

  27. pdftotext

  28. pdftotext

  29. • Output is a text file • Ad hoc code

    for each different type of table structure • Expensive and time-consuming • Not scalable, not maintainable Problems with this solution
  30. The Solution

  31. Portable Document Format: History

  32. None
  33. There is a table!

  34. Why Camelot?

  35. You are in control • Complete control over table extraction

    with some tweakable parameters • Override table areas, columns • Tweak line recognition • “Some other things”
  36. Dataframes!

  37. Parsing report

  38. “Some other things”

  39. Flag superscripts and subscripts

  40. Flag superscripts and subscripts

  41. Strip unnecessary characters

  42. Strip unnecessary characters

  43. Shift text in cells that span multiple rows/columns

  44. Copy text in cells that span multiple rows/columns

  45. Multiple output formats Replace csv with json, html or excel

    file
  46. Command-line interface

  47. “What’s in a name?” • As you can already guess,

    this library is named after The Camelot Project.
  48. Another fun fact “You... do have some cheese, don't you?”

  49. Installation $ pip install camelot-py Comparison with open-source PDF table

    extraction libraries and tools: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
  50. How it works • Two parsing flavors, Lattice and Stream.

    • Lattice looks for lines on a page to identify a table. • Stream looks for whitespaces between words to identify a table. More details here: https://camelot-py.readthedocs.io/en/master/user/how-it-works.html
  51. “But I don’t want to write code” :(

  52. You can use the web interface!

  53. Excalibur $ excalibur webserver Go to localhost:5000

  54. Upload a PDF

  55. Autodetect tables

  56. Or draw table areas/columns

  57. Download extracted tables in your favorite format!

  58. Why Excalibur? • Web interface • Save once, apply anywhere

    • You data is safe on your machine • MySQL and Celery for parallel and distributed workloads
  59. “What’s in a name?”

  60. Another fun fact “Well, there's egg and bacon, egg sausage

    and bacon, egg and spam, egg bacon and spam, …”
  61. Installation $ pip install excalibur-py

  62. The road ahead • Autodetect parsing flavor • OCR support

    • “Make it fast!” • Web interface enhancements
  63. github.com/vinayak-mehta github.com/socialcopsdev/camelot github.com/camelot-dev/excalibur

  64. None
  65. Questions? vinayakmehta.com