Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
260
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
400
Getting started with OCCRP Data
pudo
0
1.5k
#nr16: Recherche-Tools
pudo
1
94
data.occrp.org
pudo
0
140
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
270
Dr. Freezefile
pudo
2
380
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
JuliaTokaiとJuliaLangJaの紹介 for NGK2025S
antimon2
1
110
2024年活動報告会(人材育成推進WG・ビジネスサブWG) / 20250114-OIDF-J-EduWG-BizSWG
oidfj
0
150
【NGK2025S】動物園(PINTO_model_zoo)に遊びに行こう
kazuhitotakahashi
0
220
コロプラのオンボーディングを採用から語りたい
colopl
5
950
Bring Your Own Container: When Containers Turn the Key to EDR Bypass/byoc-avtokyo2024
tkmru
0
840
ABWGのRe:Cap!
hm5ug
1
120
Building Scalable Backend Services with Firebase
wisdommatt
0
110
ゼロからわかる!!AWSの構成図を書いてみようワークショップ 問題&解答解説 #デッカイギ #羽田デッカイギおつ
_mossann_t
0
1.5k
チームが毎日小さな変化と適応を続けたら1年間でスケール可能なアジャイルチームができた話 / Building a Scalable Agile Team
kakehashi
2
230
Oracle Exadata Database Service(Dedicated Infrastructure):サービス概要のご紹介
oracle4engineer
PRO
0
12k
.NET 最新アップデート ~ AI とクラウド時代のアプリモダナイゼーション
chack411
0
200
AWS re:Invent 2024 re:Cap Taipei (for Developer): New Launches that facilitate Developer Workflow and Continuous Innovation
dwchiang
0
160
Featured
See All Featured
Documentation Writing (for coders)
carmenintech
67
4.5k
Designing Experiences People Love
moore
139
23k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
30
2.1k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
3
180
Facilitating Awesome Meetings
lara
51
6.2k
The Power of CSS Pseudo Elements
geoffreycrofte
74
5.4k
A Modern Web Designer's Workflow
chriscoyier
693
190k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
192
16k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
44
9.4k
Adopting Sorbet at Scale
ufuk
74
9.2k
Intergalactic Javascript Robots from Outer Space
tanoku
270
27k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
2k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None