Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Friedrich Lindenberg
July 16, 2014
Technology
2
290
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
170
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
440
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
AWS CDK「読めるけど書けない」を脱却するファーストステップ
smt7174
3
210
フロントエンド刷新 4年間の軌跡
yotahada3
0
520
It’s “Time” to use Temporal
sajikix
3
240
スピンアウト講座06_認証系(API-OAuth-MCP)入門
overflowinc
0
260
生成AIで速度と品質を両立する、QAエンジニア・開発者連携のAI協調型テストプロセス
shota_kusaba
0
320
Kiroで見直す開発プロセスとAI-DLC
k_adachi_01
0
110
【社内勉強会】新年度からコーディングエージェントを使いこなす - 構造と制約で引き出すClaude Codeの実践知
nwiizo
7
4.6k
AgentCoreとLINEを使った飲食店おすすめアプリを作ってみた
yakumo
2
120
TinyTroupeで人狼ゲームやってみた!
ueponx
0
160
Goのerror型がシンプルであることの恩恵について理解する
yamatai1212
1
280
大規模ECサイトのあるバッチのパフォーマンスを改善するために僕たちのチームがしてきたこと
panda_program
1
320
WebアクセシビリティをCI/CDで担保する ― axe DevTools × Playwright C#実践ガイド
tomokusaba
2
200
Featured
See All Featured
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
118
110k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
The Invisible Side of Design
smashingmag
302
51k
Beyond borders and beyond the search box: How to win the global "messy middle" with AI-driven SEO
davidcarrasco
3
80
Navigating Team Friction
lara
192
16k
Tell your own story through comics
letsgokoyo
1
850
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
1
320
Breaking role norms: Why Content Design is so much more than writing copy - Taylor Woolridge
uxyall
0
210
Exploring anti-patterns in Rails
aemeredith
2
290
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
180
Designing Experiences People Love
moore
143
24k
Agile that works and the tools we love
rasmusluckow
331
21k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None