Slide 1

Slide 1 text

Data doesn’t grow in tables Dealing with large sets of documents

Slide 2

Slide 2 text

–An investigative reporter “We're working with 40 GB of XXX and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”

Slide 3

Slide 3 text

Some lingo • OCR (Optical Character Recognition) • NLP (Natural Language Processing) • NER (Named
 Entity
 Recognition) • Regular
 Expressions

Slide 4

Slide 4 text

Cases

Slide 5

Slide 5 text

Exhibit A

Slide 6

Slide 6 text

Exhibit B

Slide 7

Slide 7 text

Exhibit C

Slide 8

Slide 8 text

Exhibit D

Slide 9

Slide 9 text

Tools

Slide 10

Slide 10 text

Tables in disguise http://tabula.nerdpower.org

Slide 11

Slide 11 text

Docs in a cloud http://documentcloud.org

Slide 12

Slide 12 text

Clustering, tagging, mining http://overview.ap.org

Slide 13

Slide 13 text

Let them eat PDF https://github.com/CrowData

Slide 14

Slide 14 text

All the visuals Jigsaw

Slide 15

Slide 15 text

Spoken word magic http://sayit.mysociety.org/

Slide 16

Slide 16 text

Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines

Slide 17

Slide 17 text

Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! ! ! Friedrich Lindenberg, codeforafrica.org, @pudo

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content