Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
260
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
410
Getting started with OCCRP Data
pudo
0
1.5k
#nr16: Recherche-Tools
pudo
1
98
data.occrp.org
pudo
0
150
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
390
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
Javaの新しめの機能を知ったかぶれるようになる話 #kanjava
irof
3
1.3k
技術的負債を正しく理解し、正しく付き合う #phperkaigi / PHPerKaigi 2025
shogogg
6
1.3k
RF問の対策をした話
bata_24
0
140
みんなで育てるNewsPicksのSLO
troter
4
1k
View Transition API
shirakaba
1
260
Github Copilot Chatは本日よりケ◯ロ軍曹でありま〜〜〜すッ!!!(たぶん)
yu_yukk_y
1
120
これからクラウドエンジニアになるために本当に必要なスキル 5選
hiyanger
1
410
バックエンドエンジニアによるフロントエンドテスト拡充の具体的手法
kinosuke01
1
390
セマンティックレイヤー入門
ikkimiyazaki
3
410
DIってなんだか難しい? 依存という概念を「使う・使われる」 という言葉で整理しよう
akinoriakatsuka
1
660
技術を育てる組織・組織を育てる技術 / technology and organization
motemen
11
4.2k
Cloudflare Pages 4年使って分かった良さと注意点
kyosuke
0
220
Featured
See All Featured
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
34
2.9k
KATA
mclloyd
29
14k
The World Runs on Bad Software
bkeepers
PRO
67
11k
Fireside Chat
paigeccino
37
3.3k
Mobile First: as difficult as doing things right
swwweet
223
9.5k
A Modern Web Designer's Workflow
chriscoyier
693
190k
YesSQL, Process and Tooling at Scale
rocio
172
14k
Designing Experiences People Love
moore
140
23k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
14
1.1k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.2k
Building Better People: How to give real-time feedback that sticks.
wjessup
367
19k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None