Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Friedrich Lindenberg
July 16, 2014
Technology
290
2
Share
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.7k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
180
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
450
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
2026年春から始めるOpenTelemetry | sogaoh's LT @ PHP Conference ODAWARA 2026
sogaoh
PRO
0
100
"まず試す"ためのDatabricks Apps活用法 / Databricks Apps for Early Experiments and Validation
nttcom
1
230
シン・リスコフの置換原則 〜現代風に考えるSOLIDの原則〜
jinwatanabe
0
170
「決め方」の渡し方 / How to hand over the "decision-making process"
pauli
8
1.3k
建設的な現実逃避のしかた / How to practice constructive escapism
pauli
4
300
プロダクトを育てるように生成AIによる開発プロセスを育てよう
kakehashi
PRO
1
920
Autonomous Database - Dedicated 技術詳細 / adb-d_technical_detail_jp
oracle4engineer
PRO
5
13k
プロダクトを触って語って理解する、チーム横断バグバッシュのすすめ / 20260411 Naoki Takahashi
shift_evolve
PRO
1
250
TanStack Start エコシステムの現在地 / TanStack Start Ecosystem 2026
iktakahiro
1
360
2026-04-02 IBM Bobオンボーディング入門
yutanonaka
0
260
チームで育てるAI自走環境_20260409
fuktig
0
990
さくらのクラウドでつくるCloudNative Daysのオブザーバビリティ基盤
b1gb4by
0
140
Featured
See All Featured
Principles of Awesome APIs and How to Build Them.
keavy
128
17k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.2k
Designing for Performance
lara
611
70k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
Docker and Python
trallard
47
3.8k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
133
19k
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
330
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
1k
How STYLIGHT went responsive
nonsquared
100
6k
The Language of Interfaces
destraynor
162
26k
30 Presentation Tips
portentint
PRO
1
270
SEOcharity - Dark patterns in SEO and UX: How to avoid them and build a more ethical web
sarafernandez
0
160
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None