Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
250
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
400
Getting started with OCCRP Data
pudo
0
1.4k
#nr16: Recherche-Tools
pudo
1
93
data.occrp.org
pudo
0
140
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
230
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
270
Dr. Freezefile
pudo
2
370
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
ExaDB-D dbaascli で出来ること
oracle4engineer
PRO
0
3.8k
New Relicを活用したSREの最初のステップ / NRUG OKINAWA VOL.3
isaoshimizu
2
580
Making your applications cross-environment - OSCG 2024 NA
salaboy
0
180
【若手エンジニア応援LT会】ソフトウェアを学んできた私がインフラエンジニアを目指した理由
kazushi_ohata
0
150
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
2
3.2k
TanStack Routerに移行するのかい しないのかい、どっちなんだい! / Are you going to migrate to TanStack Router or not? Which one is it?
kaminashi
0
580
ドメインの本質を掴む / Get the essence of the domain
sinsoku
2
150
RubyのWebアプリケーションを50倍速くする方法 / How to Make a Ruby Web Application 50 Times Faster
hogelog
3
940
複雑なState管理からの脱却
sansantech
PRO
1
140
ISUCONに強くなるかもしれない日々の過ごしかた/Findy ISUCON 2024-11-14
fujiwara3
8
860
【Pycon mini 東海 2024】Google Colaboratoryで試すVLM
kazuhitotakahashi
2
490
AWS Media Services 最新サービスアップデート 2024
eijikominami
0
190
Featured
See All Featured
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
26
2.1k
The World Runs on Bad Software
bkeepers
PRO
65
11k
How To Stay Up To Date on Web Technology
chriscoyier
788
250k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
48k
Embracing the Ebb and Flow
colly
84
4.5k
The Language of Interfaces
destraynor
154
24k
10 Git Anti Patterns You Should be Aware of
lemiorhan
654
59k
Raft: Consensus for Rubyists
vanstee
136
6.6k
[RailsConf 2023] Rails as a piece of cake
palkan
52
4.9k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
1.9k
Adopting Sorbet at Scale
ufuk
73
9.1k
Building Applications with DynamoDB
mza
90
6.1k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None