Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
280
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
110
data.occrp.org
pudo
0
170
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
290
Dr. Freezefile
pudo
2
440
Intro presentation for Naivasha
pudo
1
170
Other Decks in Technology
See All in Technology
SREじゃなかった僕らがenablingを通じて「SRE実践者」になるまでのリアル / SRE Kaigi 2026
aeonpeople
6
2.3k
Red Hat OpenStack Services on OpenShift
tamemiya
0
110
レガシー共有バッチ基盤への挑戦 - SREドリブンなリアーキテクチャリングの取り組み
tatsukoni
0
210
変化するコーディングエージェントとの現実的な付き合い方 〜Cursor安定択説と、ツールに依存しない「資産」〜
empitsu
4
1.4k
プロダクト成長を支える開発基盤とスケールに伴う課題
yuu26
4
1.3k
日本の85%が使う公共SaaSは、どう育ったのか
taketakekaho
1
150
Cosmos World Foundation Model Platform for Physical AI
takmin
0
880
顧客との商談議事録をみんなで読んで顧客解像度を上げよう
shibayu36
0
230
Introduction to Sansan for Engineers / エンジニア向け会社紹介
sansan33
PRO
6
68k
広告の効果検証を題材にした因果推論の精度検証について
zozotech
PRO
0
170
OWASP Top 10:2025 リリースと 少しの日本語化にまつわる裏話
okdt
PRO
3
740
こんなところでも(地味に)活躍するImage Modeさんを知ってるかい?- Image Mode for OpenShift -
tsukaman
0
140
Featured
See All Featured
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.4k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.4k
Code Review Best Practice
trishagee
74
20k
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
54
Building a Scalable Design System with Sketch
lauravandoore
463
34k
Designing for Timeless Needs
cassininazir
0
130
Making Projects Easy
brettharned
120
6.6k
What Being in a Rock Band Can Teach Us About Real World SEO
427marketing
0
170
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
110
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.2k
Breaking role norms: Why Content Design is so much more than writing copy - Taylor Woolridge
uxyall
0
170
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.3k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None