Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Friedrich Lindenberg
July 16, 2014
Technology
2
280
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
110
data.occrp.org
pudo
0
170
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
290
Dr. Freezefile
pudo
2
440
Intro presentation for Naivasha
pudo
1
170
Other Decks in Technology
See All in Technology
「AIでできますか?」から「Agentを作ってみました」へ ~「理論上わかる」と「やってみる」の隔たりを埋める方法
applism118
2
260
ビジュアルプログラミングIoTLT vol.22
1ftseabass
PRO
0
110
人はいかにして 確率的な挙動を 受け入れていくのか
vaaaaanquish
3
1.8k
Bill One 開発エンジニア 紹介資料
sansan33
PRO
4
17k
2026/01/16_実体験から学ぶ 2025年の失敗と対策_Progate Bar
teba_eleven
1
210
ファインディにおけるフロントエンド技術選定の歴史
puku0x
2
1.6k
kintone開発のプラットフォームエンジニアの紹介
cybozuinsideout
PRO
0
570
BPaaSオペレーション・kubell社内 n8n活用による効率化検証事例紹介
kentarofujii
0
250
名刺メーカーDevグループ 紹介資料
sansan33
PRO
0
1k
エンジニアとして長く走るために気づいた2つのこと_大賀愛一郎
nanaism
0
210
AIAgentを駆使してSREが貢献する開発体験の向上
yoshiiryo1
4
1k
The Engineer with a Three-Year Cycle
e99h2121
0
160
Featured
See All Featured
Evolving SEO for Evolving Search Engines
ryanjones
0
100
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
55
49k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
140
Git: the NoSQL Database
bkeepers
PRO
432
66k
Mind Mapping
helmedeiros
PRO
0
53
The Pragmatic Product Professional
lauravandoore
37
7.1k
Pawsitive SEO: Lessons from My Dog (and Many Mistakes) on Thriving as a Consultant in the Age of AI
davidcarrasco
0
49
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
Ruling the World: When Life Gets Gamed
codingconduct
0
130
How to build an LLM SEO readiness audit: a practical framework
nmsamuel
1
620
Mobile First: as difficult as doing things right
swwweet
225
10k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
3.9k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None