Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
300
2
Share
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.7k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
180
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
260
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
460
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
イベントストーミングとKiroの仕様駆動開発で実現する要件の認識合わせプロセス
syobochim
5
530
キャリア25年目にしてTypeScript に出会うまで - 「型」を通じて振り返るプログラミング言語遍歴 / Meeting TypeScript After 25 Years in Tech - Looking Back at My Programming Language Journey Through "Types"
bitkey
PRO
2
280
さきさん文庫の書籍ができるまで
sakiengineer
0
180
データ基盤構築・運用の現場から 〜 Snowflake Intelligence 導入で変わった、データ活用の未来 〜
wonohe
0
190
Claude Code x Accounting
kawaguti
PRO
1
320
【ハノーバーメッセ振り返りイベントat名古屋】データは集約からAI起点の収集に ~組織内・組織間でのデータ連携~
tanakaseiya
0
120
20260528_生成AIを専属DSに_Howの次にすべきことを考える
doradora09
PRO
0
200
サプライチェーン攻撃への備えについて考えている #湘なんか
stefafafan
3
2.4k
TypeScriptはどのようにどこまで推論できるのか ─ とにかく as は禁止で
ypresto
3
430
自作エディターをOSSにして分かった、一人に刺さる開発が世界を動かす理由
shinyasaita
1
400
情シスがMCP環境導入時に打ちのめされる認可の崖
oidfj
0
460
TSKaigi 2026 - enumよ、さようなら
teamlab
PRO
3
550
Featured
See All Featured
Exploring anti-patterns in Rails
aemeredith
3
360
Believing is Seeing
oripsolob
1
130
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
340
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
120
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
160
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.9k
SEOcharity - Dark patterns in SEO and UX: How to avoid them and build a more ethical web
sarafernandez
0
190
jQuery: Nuts, Bolts and Bling
dougneiner
66
8.5k
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
220
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.7k
The Cost Of JavaScript in 2023
addyosmani
55
9.9k
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
190
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None