Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
100
data.occrp.org
pudo
0
160
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
410
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
Trust as Infrastructure
bcantrill
0
340
多様な事業ドメインのクリエイターへ 価値を届けるための営みについて
massyuu
0
110
職種別ミートアップで社内から盛り上げる アウトプット文化の醸成と関係強化/ #DevRelKaigi
nishiuma
2
140
GC25 Recap+: Advancing Go Garbage Collection with Green Tea
logica0419
1
410
SwiftUIのGeometryReaderとScrollViewを基礎から応用まで学び直す:設計と活用事例
fumiyasac0921
0
140
GA technologiesでのAI-Readyの取り組み@DataOps Night
yuto16
0
270
自作LLM Native GORM Pluginで実現する AI Agentバックテスト基盤構築
po3rin
2
250
SoccerNet GSRの紹介と技術応用:選手視点映像を提供するサッカー作戦盤ツール
mixi_engineers
PRO
1
170
KAGのLT会 #8 - 東京リージョンでGAしたAmazon Q in QuickSightを使って、報告用の資料を作ってみた
0air
0
200
研究開発部メンバーの働き⽅ / Sansan R&D Profile
sansan33
PRO
3
20k
Green Tea Garbage Collector の今
zchee
PRO
2
390
about #74462 go/token#FileSet
tomtwinkle
1
320
Featured
See All Featured
Build The Right Thing And Hit Your Dates
maggiecrowley
37
2.9k
Docker and Python
trallard
46
3.6k
The Cost Of JavaScript in 2023
addyosmani
53
9k
GraphQLとの向き合い方2022年版
quramy
49
14k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Done Done
chrislema
185
16k
Imperfection Machines: The Place of Print at Facebook
scottboms
269
13k
Code Reviewing Like a Champion
maltzj
525
40k
Statistics for Hackers
jakevdp
799
220k
How GitHub (no longer) Works
holman
315
140k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
252
21k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None