Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data de andar por casa | Shirt-sleeve Big Data
Search
jorgeleria
November 21, 2014
Programming
4
900
Big data de andar por casa | Shirt-sleeve Big Data
#codemotion_es #codemotion #2014 #codemotion2014
Jorge Lería - @jorgeleria
William Viana - @vianasw
jorgeleria
November 21, 2014
Tweet
Share
More Decks by jorgeleria
See All by jorgeleria
DON'T PANIC: Large scale web development
jorgeleria
0
97
Other Decks in Programming
See All in Programming
AI によるインシデント初動調査の自動化を行う AI インシデントコマンダーを作った話
azukiazusa1
1
750
Raku Raku Notion 20260128
hareyakayuruyaka
0
360
CSC307 Lecture 05
javiergs
PRO
0
500
Automatic Grammar Agreementと Markdown Extended Attributes について
kishikawakatsumi
0
200
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
610
Lambda のコードストレージ容量に気をつけましょう
tattwan718
0
140
Package Management Learnings from Homebrew
mikemcquaid
0
230
登壇資料を作る時に意識していること #登壇資料_findy
konifar
4
1.7k
ノイジーネイバー問題を解決する 公平なキューイング
occhi
0
110
[KNOTS 2026登壇資料]AIで拡張‧交差する プロダクト開発のプロセス および携わるメンバーの役割
hisatake
0
300
CSC307 Lecture 10
javiergs
PRO
1
660
15年続くIoTサービスのSREエンジニアが挑む分散トレーシング導入
melonps
2
230
Featured
See All Featured
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
140
The browser strikes back
jonoalderson
0
420
Unsuck your backbone
ammeep
671
58k
It's Worth the Effort
3n
188
29k
Google's AI Overviews - The New Search
badams
0
910
Paper Plane
katiecoart
PRO
0
46k
Information Architects: The Missing Link in Design Systems
soysaucechin
0
780
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
0
440
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
56
The Illustrated Children's Guide to Kubernetes
chrisshort
51
51k
More Than Pixels: Becoming A User Experience Designer
marktimemedia
3
330
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
71
Transcript
Shirt-sleeve big data
None
None
“Any collection of data sets so large and complex that
it becomes difficult to process them using traditional data processing applications”
“Any collection of data sets so large and complex that
it becomes difficult to process them using traditional data processing applications”
Jorge Lería @jorgeleria
Jorge Lería @jorgeleria
Jorge Lería @jorgeleria
Jorge Lería @jorgeleria
William Viana @vianasw
William Viana @vianasw
William Viana @vianasw
William Viana @vianasw
None
None
None
None
http://youtu.be/vWa9CUsNzdw
A Search Engine All domains ~700.000.000 Home only Metadata ~60kb
libs,frameworks
A Search Engine All domains ~700.000.000 Home only Metadata ~60kb
libs,frameworks ~40Tb
None
40Tb of small documents
$ $ $ $ 40Tb of small documents
Considerations Cheap Fixed problem Fresh data
Single machine approach Cheap Fixed problem Fresh data (10gb) Single
point of failure Waste of resources Less funny?
Multi machine approach Cheap? Fixed problem Fresh data (10gb)
Multi machine approach Main server W1 W2 Wn ... workers
AWS Simple Monthly Calculator http://calculator.s3.amazonaws.com/index.html
None
Some Numbers [on a napkin] 700,000,000 pages / 2 days
= 4050 reqs/sec 700,000,000 rows * 500 bytes/row = 325 GBs 325 GBs every two days from AWS = $584 a month
Some Numbers [on a napkin] 700,000,000 pages / 2 days
= 4050 reqs/sec 4050 reqs/sec / 40 instances = 100 reqs/instance 40 c3.large * 30 days = $3074 per month
4.9 TB and $3658 per month
None
We can do better
None
None
http://aws.amazon.com/ec2/purchasing-options/spot-instances/spot-tutorials/
http://aws.amazon.com/ec2/purchasing-options/spot-instances/spot-tutorials/ START of ugly slides made by Amazon
None
None
None
None
None
None
None
None
None
None
http://aws.amazon.com/ec2/purchasing-options/spot-instances/spot-tutorials/ END of ugly slides made by Amazon
Main server W1 W2 Wn ... workers
None
Isn’t Python slow? Crawling is mostly I/O bound Parsing with
bindings to fast C libraries
Isn’t Python slow? Crawling is mostly I/O bound Parsing with
bindings to fast C libraries Python rocks!
The case of the 10x improvement
The case of the 10x improvement with one line of
code
import re2 as re
import re2 as re https://code.google.com/p/re2/
Main server W1 W2 Wn ... workers Spot instances
Main server W1 W2 Wn ... workers Spot instances On
demand instance
Main server W1 W2 Wn ... workers Spot instances On
demand instance same availability zone
Main server W1 W2 Wn ... workers Spot instances On
demand instance same availability zone ? ? ?
None
None
None
Robust messaging for applications Easy to use Runs on all
major operating systems Supports a huge number of developer platforms Open source and commercially supported [from their website]
None
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone
None
Cassandra Key Features Distributed and Decentralized High Performance Fault Tolerant
Highly Available Column Oriented Key Value
Cassandra Key Features Distributed and Decentralized High Performance Fault Tolerant
Highly Available Column Oriented Key Value
Cassandra Key Features Distributed and Decentralized High Performance (3k writes/sec)
Fault Tolerant Highly Available Column Oriented Key Value Also comes with compression, incremental backups and many problems
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone Beefy machine outside Amazon
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone Dedicated server
Bloom filter "A space-efficient probabilistic data structure, that is used
to test whether an element is a member of a set" bitarray + fast hash function 0 0 0 1 0 0 1 0 0 1 2 3 4 5 6 7
Bloom filter
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone Dedicated server
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone Dedicated server Bloom filter
None
$ curl -XGET 'http://localhost:9200/twitter/_search? q=user:kimchy' … and a nice REST
API
Thank you! Jorge Lería - @jorgeleria William Viana - @vianasw
Follow us!