Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
100
data.occrp.org
pudo
0
160
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
290
Dr. Freezefile
pudo
2
420
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
現地速報!Microsoft Ignite 2025 M365 Copilotアップデートレポート
kasada
2
1.5k
今すぐGoogle Antigravityを触りましょう
rfdnxbro
0
100
ABEJA FIRST GUIDE for Software Engineers
abeja
0
3.2k
レガシーシステム刷新における TypeSpec スキーマ駆動開発のすゝめ
tsukuha
1
490
プロダクト負債と歩む持続可能なサービスを育てるための挑戦
sansantech
PRO
1
680
Perlの生きのこり - YAPC::Fukuoka 2025
kfly8
0
740
入社したばかりでもできる、 アクセシビリティ改善の第一歩
unachang113
2
340
Error.prototype.stack の今と未来
progfay
1
200
Javaコミュニティの歩き方 ~参加から貢献まで、すべて教えます~
tabatad
0
140
新しい風。SolidFlutterで実現するシンプルな状態管理
zozotech
PRO
0
130
大規模プロダクトで実践するAI活用の仕組みづくり
k1tikurisu
5
1.7k
マルチドライブアーキテクチャ: 複数の駆動力でプロダクトを前進させる
knih
0
7.8k
Featured
See All Featured
Java REST API Framework Comparison - PWX 2021
mraible
34
9k
Unsuck your backbone
ammeep
671
58k
Art, The Web, and Tiny UX
lynnandtonic
303
21k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.6k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.3k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
36
6.1k
Raft: Consensus for Rubyists
vanstee
140
7.2k
Docker and Python
trallard
46
3.7k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
127
54k
Learning to Love Humans: Emotional Interface Design
aarron
274
41k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None