Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
260
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
400
Getting started with OCCRP Data
pudo
0
1.5k
#nr16: Recherche-Tools
pudo
1
96
data.occrp.org
pudo
0
150
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
390
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
ディスプレイ広告(Yahoo!広告・LINE広告)におけるバックエンド開発
lycorptech_jp
PRO
0
520
エンジニア主導の企画立案を可能にする組織とは?
recruitengineers
PRO
1
280
クラウド関連のインシデントケースを収集して見えてきたもの
lhazy
9
1.8k
【内製開発Summit 2025】イオンスマートテクノロジーの内製化組織の作り方/In-house-development-summit-AST
aeonpeople
2
1.1k
IAMのマニアックな話2025
nrinetcom
PRO
6
1.4k
LINE NEWSにおけるバックエンド開発
lycorptech_jp
PRO
0
340
リクルートのエンジニア組織を下支えする 新卒の育成の仕組み
recruitengineers
PRO
1
140
Exadata Database Service on Cloud@Customer セキュリティ、ネットワーク、および管理について
oracle4engineer
PRO
2
1.6k
AWSアカウントのセキュリティ自動化、どこまで進める? 最適な設計と実践ポイント
yuobayashi
7
1.1k
手を動かしてレベルアップしよう!
maruto
0
240
AWS Well-Architected Frameworkで学ぶAmazon ECSのセキュリティ対策
umekou
2
150
30→150人のエンジニア組織拡大に伴うアジャイル文化を醸成する役割と取り組みの変化
nagata03
0
270
Featured
See All Featured
For a Future-Friendly Web
brad_frost
176
9.6k
Navigating Team Friction
lara
183
15k
Done Done
chrislema
182
16k
YesSQL, Process and Tooling at Scale
rocio
172
14k
Bash Introduction
62gerente
611
210k
Raft: Consensus for Rubyists
vanstee
137
6.8k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
28
9.3k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.7k
The Art of Programming - Codeland 2020
erikaheidi
53
13k
Gamification - CAS2011
davidbonilla
80
5.2k
What’s in a name? Adding method to the madness
productmarketing
PRO
22
3.3k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None