Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
100
data.occrp.org
pudo
0
160
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
290
Dr. Freezefile
pudo
2
420
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
AI時代の発信活動 ~技術者として認知してもらうための発信法~ / 20251028 Masaki Okuda
shift_evolve
PRO
1
110
あなたの知らない Linuxカーネル脆弱性の世界
recruitengineers
PRO
3
160
ソースを読む時の思考プロセスの例-MkDocs
sat
PRO
1
270
SCONE - 動画配信の帯域を最適化する新プロトコル
kazuho
1
390
プレイドのユニークな技術とインターンのリアル
plaidtech
PRO
1
400
NLPコロキウム20251022_超効率化への挑戦: LLM 1bit量子化のロードマップ
yumaichikawa
3
530
JSConf JPのwebsiteをGatsbyからNext.jsに移行した話 - Next.jsの多言語静的サイトと課題
leko
2
190
AI-Readyを目指した非構造化データのメダリオンアーキテクチャ
r_miura
1
330
Oracle Base Database Service 技術詳細
oracle4engineer
PRO
14
82k
AI時代におけるデータの重要性 ~データマネジメントの第一歩~
ryoichi_ota
0
720
Okta Identity Governanceで実現する最小権限の原則 / Implementing the Principle of Least Privilege with Okta Identity Governance
tatsumin39
0
180
ストレージエンジニアの仕事と、近年の計算機について / 第58回 情報科学若手の会
pfn
PRO
3
870
Featured
See All Featured
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Scaling GitHub
holman
463
140k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
10
890
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
9.7k
GitHub's CSS Performance
jonrohan
1032
470k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
15k
Navigating Team Friction
lara
190
15k
Building a Scalable Design System with Sketch
lauravandoore
463
33k
Designing for humans not robots
tammielis
254
26k
4 Signs Your Business is Dying
shpigford
185
22k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
140
34k
Git: the NoSQL Database
bkeepers
PRO
431
66k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None