Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
100
data.occrp.org
pudo
0
160
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
410
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
AWSを利用する上で知っておきたい名前解決のはなし(10分版)
nagisa53
0
110
「魔法少女まどか☆マギカ Magia Exedra」のグローバル展開を支える、開発チームと翻訳チームの「意識しない協創」を実現するローカライズシステム
gree_tech
PRO
0
540
2025年にHCP Vaultを学び直して見えた景色 / Lessons and New Perspectives from Relearning HCP Vault in 2025
aeonpeople
0
200
テストを軸にした生き残り術
kworkdev
PRO
0
160
機械学習を扱うプラットフォーム開発と運用事例
lycorptech_jp
PRO
0
160
落ちる 落ちるよ サーバーは落ちる
suehiromasatoshi
0
140
なぜテストマネージャの視点が 必要なのか? 〜 一歩先へ進むために 〜
moritamasami
0
110
250905 大吉祥寺.pm 2025 前夜祭 「プログラミングに出会って20年、『今』が1番楽しい」
msykd
PRO
1
450
オブザーバビリティが広げる AIOps の世界 / The World of AIOps Expanded by Observability
aoto
PRO
0
300
Vault meets Kubernetes
mochizuki875
0
270
Obsidian応用活用術
onikun94
1
390
エラーとアクセシビリティ
schktjm
0
970
Featured
See All Featured
Designing for humans not robots
tammielis
253
25k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.4k
Agile that works and the tools we love
rasmusluckow
330
21k
GitHub's CSS Performance
jonrohan
1032
460k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
15k
Music & Morning Musume
bryan
46
6.8k
Building a Modern Day E-commerce SEO Strategy
aleyda
43
7.5k
Unsuck your backbone
ammeep
671
58k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
139
34k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
31
2.2k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None