Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
100
data.occrp.org
pudo
0
160
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
410
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
Modern Linux
oracle4engineer
PRO
0
160
未経験者・初心者に贈る!40分でわかるAndroidアプリ開発の今と大事なポイント
operando
6
750
バイブスに「型」を!Kent Beckに学ぶ、AI時代のテスト駆動開発
amixedcolor
2
590
S3アクセス制御の設計ポイント
tommy0124
3
210
react-callを使ってダイヤログをいろんなとこで再利用しよう!
shinaps
2
270
メルカリIBISの紹介
0gm
0
330
「何となくテストする」を卒業するためにプロダクトが動く仕組みを理解しよう
kawabeaver
0
440
使いやすいプラットフォームの作り方 ー LINEヤフーのKubernetes基盤に学ぶ理論と実践
lycorptech_jp
PRO
1
160
2つのフロントエンドと状態管理
mixi_engineers
PRO
3
160
【NoMapsTECH 2025】AI Edge Computing Workshop
akit37
0
230
AWSを利用する上で知っておきたい名前解決のはなし(10分版)
nagisa53
10
3.2k
RSCの時代にReactとフレームワークの境界を探る
uhyo
10
3.5k
Featured
See All Featured
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Build your cross-platform service in a week with App Engine
jlugia
231
18k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
113
20k
Writing Fast Ruby
sferik
628
62k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
The World Runs on Bad Software
bkeepers
PRO
70
11k
Speed Design
sergeychernyshev
32
1.1k
Optimising Largest Contentful Paint
csswizardry
37
3.4k
4 Signs Your Business is Dying
shpigford
184
22k
Code Reviewing Like a Champion
maltzj
525
40k
Music & Morning Musume
bryan
46
6.8k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None