$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
280
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
110
data.occrp.org
pudo
0
170
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
290
Dr. Freezefile
pudo
2
420
Intro presentation for Naivasha
pudo
1
170
Other Decks in Technology
See All in Technology
IAMユーザーゼロの運用は果たして可能なのか
yama3133
1
430
SSO方式とJumpアカウント方式の比較と設計方針
yuobayashi
7
690
Lookerで実現するセキュアな外部データ提供
zozotech
PRO
0
150
ログ管理の新たな可能性?CloudWatchの新機能をご紹介
ikumi_ono
1
790
会社紹介資料 / Sansan Company Profile
sansan33
PRO
11
390k
Debugging Edge AI on Zephyr and Lessons Learned
iotengineer22
0
220
今年のデータ・ML系アップデートと気になるアプデのご紹介
nayuts
1
430
生成AI活用の型ハンズオン〜顧客課題起点で設計する7つのステップ
yushin_n
0
220
【AWS re:Invent 2025速報】AIビルダー向けアップデートをまとめて解説!
minorun365
4
530
[デモです] NotebookLM で作ったスライドの例
kongmingstrap
0
160
Database イノベーショントークを振り返る/reinvent-2025-database-innovation-talk-recap
emiki
0
210
re:Inventで気になったサービスを10分でいけるところまでお話しします
yama3133
1
120
Featured
See All Featured
4 Signs Your Business is Dying
shpigford
186
22k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.3k
The Art of Programming - Codeland 2020
erikaheidi
56
14k
Statistics for Hackers
jakevdp
799
230k
Automating Front-end Workflow
addyosmani
1371
200k
GraphQLとの向き合い方2022年版
quramy
50
14k
Building Adaptive Systems
keathley
44
2.9k
Java REST API Framework Comparison - PWX 2021
mraible
34
9k
Producing Creativity
orderedlist
PRO
348
40k
Build The Right Thing And Hit Your Dates
maggiecrowley
38
3k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
The Cult of Friendly URLs
andyhume
79
6.7k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None