Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Friedrich Lindenberg
July 16, 2014
Technology
2
290
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
170
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
440
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
VLAモデル構築のための AIロボット向け模倣学習キット
kmatsuiugo
0
310
Kiro Powers 入門
k_adachi_01
0
130
Phase06_ClaudeCode実践
overflowinc
0
390
Windows ファイル共有(SMB)を再確認する
murachiakira
PRO
0
210
スピンアウト講座06_認証系(API-OAuth-MCP)入門
overflowinc
0
220
ReactのdangerouslySetInnerHTMLは“dangerously”だから危険 / Security.any #09 卒業したいセキュリティLT
flatt_security
0
370
モジュラモノリス導入から4年間の総括:アーキテクチャと組織の相互作用について / Architecture and Organizational Interaction
nazonohito51
3
1.2k
ソフトバンク流!プラットフォームエンジニアリング実現へのアプローチ
sbtechnight
1
230
Phase07_実務適用
overflowinc
0
370
DDD×仕様駆動で回す高品質開発のプロセス設計
littlehands
0
880
A Casual Introduction to RISC-V
omasanori
0
510
スピンアウト講座01_GitHub管理
overflowinc
0
270
Featured
See All Featured
From π to Pie charts
rasagy
0
150
Joys of Absence: A Defence of Solitary Play
codingconduct
1
320
Agile that works and the tools we love
rasmusluckow
331
21k
The SEO identity crisis: Don't let AI make you average
varn
0
420
We Have a Design System, Now What?
morganepeng
55
8k
Art, The Web, and Tiny UX
lynnandtonic
304
21k
jQuery: Nuts, Bolts and Bling
dougneiner
65
8.4k
A better future with KSS
kneath
240
18k
Heart Work Chapter 1 - Part 1
lfama
PRO
5
35k
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
190
The Curious Case for Waylosing
cassininazir
0
270
Rebuilding a faster, lazier Slack
samanthasiow
85
9.4k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None