PDF: History
●
Documents should be viewable on any
display and printable on any modern printer
●
Hence, Portable Document Format
●
Built on top of PostScript
●
Packages components required to build a
document
Slide 11
Slide 11 text
PDF: History
https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm
Slide 12
Slide 12 text
PDF: History
https://euske.github.io/pdfminer/
Slide 13
Slide 13 text
Is that a table?
Slide 14
Slide 14 text
Error 404: Table not found
Slide 15
Slide 15 text
Unlike
Slide 16
Slide 16 text
CSV
Slide 17
Slide 17 text
JSON
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
●
Joined SocialCops as an intern in Jan. 2016
●
Scraped tabular data from open data sources
●
Helped analysts track key metrics in various
projects
whoami
●
Output is a text file
●
Ad hoc code for each different type of table
structure
●
Expensive and time-consuming
●
Not scalable, not maintainable
Problems with this solution
Slide 30
Slide 30 text
The Solution
Slide 31
Slide 31 text
Portable Document Format: History
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
There is a table!
Slide 34
Slide 34 text
Why Camelot?
Slide 35
Slide 35 text
You are in control
●
Complete control over table extraction with
some tweakable parameters
●
Override table areas, columns
●
Tweak line recognition
●
“Some other things”
Slide 36
Slide 36 text
Dataframes!
Slide 37
Slide 37 text
Parsing report
Slide 38
Slide 38 text
“Some other things”
Slide 39
Slide 39 text
Flag superscripts and subscripts
Slide 40
Slide 40 text
Flag superscripts and subscripts
Slide 41
Slide 41 text
Strip unnecessary characters
Slide 42
Slide 42 text
Strip unnecessary characters
Slide 43
Slide 43 text
Shift text in cells that span multiple rows/columns
Slide 44
Slide 44 text
Copy text in cells that span multiple rows/columns
Slide 45
Slide 45 text
Multiple output formats
Replace csv with json, html or excel file
Slide 46
Slide 46 text
Command-line interface
Slide 47
Slide 47 text
“What’s in a name?”
●
As you can already guess, this library is
named after The Camelot Project.
Slide 48
Slide 48 text
Another fun fact
“You... do have some cheese, don't you?”
Slide 49
Slide 49 text
Installation
$ pip install camelot-py
Comparison with open-source PDF table extraction libraries and tools:
https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
Slide 50
Slide 50 text
How it works
●
Two parsing flavors, Lattice and Stream.
●
Lattice looks for lines on a page to identify a
table.
●
Stream looks for whitespaces between words
to identify a table.
More details here: https://camelot-py.readthedocs.io/en/master/user/how-it-works.html
Slide 51
Slide 51 text
“But I don’t want to write code” :(
Slide 52
Slide 52 text
You can use the web interface!
Slide 53
Slide 53 text
Excalibur
$ excalibur webserver
Go to localhost:5000
Slide 54
Slide 54 text
Upload a PDF
Slide 55
Slide 55 text
Autodetect tables
Slide 56
Slide 56 text
Or draw table areas/columns
Slide 57
Slide 57 text
Download extracted tables in your favorite format!
Slide 58
Slide 58 text
Why Excalibur?
●
Web interface
●
Save once, apply anywhere
●
You data is safe on your machine
●
MySQL and Celery for parallel and
distributed workloads
Slide 59
Slide 59 text
“What’s in a name?”
Slide 60
Slide 60 text
Another fun fact
“Well, there's egg and bacon, egg sausage and bacon, egg and spam, egg bacon and spam, …”
Slide 61
Slide 61 text
Installation
$ pip install excalibur-py
Slide 62
Slide 62 text
The road ahead
●
Autodetect parsing flavor
●
OCR support
●
“Make it fast!”
●
Web interface enhancements