Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
250
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
390
Getting started with OCCRP Data
pudo
0
1.4k
#nr16: Recherche-Tools
pudo
1
89
data.occrp.org
pudo
0
140
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
230
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
270
Dr. Freezefile
pudo
2
370
Intro presentation for Naivasha
pudo
1
150
Other Decks in Technology
See All in Technology
CI/CDやテスト自動化の開発プロジェクトへの適用
megascus
3
650
とあるユーザー企業におけるリスクベースで考えるセキュリティ業務のお話し
4su_para
0
250
サイバーエージェントにおける生成AIのリスキリング施策の取り組み / cyber-ai-reskilling
cyberagentdevelopers
PRO
1
130
急成長中のWINTICKETにおける品質と開発スピードと向き合ったQA戦略と今後の展望 / winticket-autify
cyberagentdevelopers
PRO
1
120
DFTの実践的基礎理論
pfn
PRO
2
100
Java x Spring Boot Warm up
kazu_kichi_67
2
420
クラシルの現在とこれから
am1157154
1
340
独自ツール開発でスタジオ撮影をDX!「VLS(Virtual LED Studio)」 / dx-studio-vls
cyberagentdevelopers
PRO
0
110
Why and Why not of enabling swap in Kubernetes
hwchiu
0
470
Comparing Apache Flink and Spark for Modern Stream Data Processing
sharonx
0
180
Tokyo dbt Meetup #10 dbt Cloudユーザー会 & パネルディスカッション
dbttokyo
1
180
What's in a Postgres major release? An analysis of contributions in the v17 timeframe | Claire Giordano | PGConf EU 2024
clairegiordano
1
680
Featured
See All Featured
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
Code Review Best Practice
trishagee
64
17k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
47
5k
Docker and Python
trallard
40
3k
Building Applications with DynamoDB
mza
90
6k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
27
1.9k
Why You Should Never Use an ORM
jnunemaker
PRO
53
9k
BBQ
matthewcrist
85
9.3k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
228
52k
Gamification - CAS2011
davidbonilla
80
5k
Building a Modern Day E-commerce SEO Strategy
aleyda
38
6.9k
Typedesign – Prime Four
hannesfritz
39
2.4k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None