Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Friedrich Lindenberg
July 16, 2014
Technology
290
2
Share
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.7k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
180
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
260
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
450
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
Choose your own adventure in agentic design patterns
glaforge
0
160
Keeping Ruby Running on Cygwin
fd0
0
200
EMから幅を広げるために最近挑戦していること / Recent challenges I'm undertaking to expand my horizons beyond EM
hiro_torii
1
170
はじめての MagicPod生成AI機能 機能紹介から活用方法まで
magicpod
0
130
AI活用時代の事業判断高度化を導くエンジニアリング基盤 / 20260424 Atsushi Funahashi
shift_evolve
PRO
2
120
AIと乗り切った1,500ページ超のヘルプサイト基盤刷新とさらにその先の話
mugi_uno
1
200
FessのAI検索モード:検索システムとLLMへの取り組み
marevol
0
160
Angular Architecture Revisited Modernizing Angular Architectural Patterns
rainerhahnekamp
0
110
AgentCore×VPCでの設計パターンn選と勘所
har1101
4
360
世界の中心でApp Runnerを叫ぶ FINAL
tsukuboshi
0
160
需要創出(Chatwork)×供給(BPaaS) フライホイールとMoat 実行能力の最適配置とAI戦略
kubell_hr
0
1.6k
AI와 협업하는 조직으로의 여정
arawn
0
570
Featured
See All Featured
Accessibility Awareness
sabderemane
1
110
Lightning Talk: Beautiful Slides for Beginners
inesmontani
PRO
1
530
Color Theory Basics | Prateek | Gurzu
gurzu
0
300
DevOps and Value Stream Thinking: Enabling flow, efficiency and business value
helenjbeal
1
170
From π to Pie charts
rasagy
0
180
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
エンジニアに許された特別な時間の終わり
watany
106
240k
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
0
220
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
1
360
Navigating Algorithm Shifts & AI Overviews - #SMXNext
aleyda
1
1.2k
State of Search Keynote: SEO is Dead Long Live SEO
ryanjones
0
180
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None