Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
250
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
390
Getting started with OCCRP Data
pudo
0
1.4k
#nr16: Recherche-Tools
pudo
1
89
data.occrp.org
pudo
0
140
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
230
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
270
Dr. Freezefile
pudo
2
360
Intro presentation for Naivasha
pudo
1
150
Other Decks in Technology
See All in Technology
DDDにおける認可の扱いとKotlinにおける実装パターン / authorization-for-ddd-and-kotlin-implement-pattern
urmot
4
390
LINE WORKSへ簡単通知!Incoming Webhookアプリの紹介
mmclsntr
0
110
運用改善、不都合な真実 / 20240722-ssmjp-kaizen
opelab
17
8.2k
AWS IAMのアンチパターン/AWSが考える最低権限実現へのアプローチ概略(JAWS-UG朝会#59資料改修20分版)
htan
0
330
推薦システムを本番導入する上で一番優先すべきだったこと~NewsPicks記事推薦機能の改善事例を元に~
morinota
0
130
[I/O Extended Android 2024] What`s new in Android 2024
kyeongwan
0
220
たくさん本を読んだけど 1年後には綺麗サッパリ!を乗り越えて 学習の鬼になるぞ👹
yum3
0
160
AWSで”最小権限の原則”を実現するための考え方 /20240722-ssmjp-aws-least-privilege
opelab
10
4.4k
20240725 LLMによるDXのビジョンと、今何からやるべきか @Azure OpenAI Service Dev Day
nrryuya
3
1.2k
シフトレフトで挑む セキュリティの生産性向上
sekido
PRO
0
270
目標設定は好きですか? アジャイルとともに目標と向き合い続ける方法 / Do you like target Management?
kakehashi
10
3k
データベース研修 分析向けSQL入門【MIXI 24新卒技術研修】
mixi_engineers
PRO
0
110
Featured
See All Featured
Statistics for Hackers
jakevdp
792
220k
Embracing the Ebb and Flow
colly
81
4.3k
Reflections from 52 weeks, 52 projects
jeffersonlam
346
19k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
20
7.2k
KATA
mclloyd
20
13k
Writing Fast Ruby
sferik
623
60k
Practical Orchestrator
shlominoach
185
10k
Bootstrapping a Software Product
garrettdimon
PRO
304
110k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
245
1.2M
VelocityConf: Rendering Performance Case Studies
addyosmani
321
23k
Faster Mobile Websites
deanohume
303
30k
Put a Button on it: Removing Barriers to Going Fast.
kastner
58
3.3k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None