Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data de andar por casa | Shirt-sleeve Big Data
Search
jorgeleria
November 21, 2014
Programming
4
880
Big data de andar por casa | Shirt-sleeve Big Data
#codemotion_es #codemotion #2014 #codemotion2014
Jorge Lería - @jorgeleria
William Viana - @vianasw
jorgeleria
November 21, 2014
Tweet
Share
More Decks by jorgeleria
See All by jorgeleria
DON'T PANIC: Large scale web development
jorgeleria
0
93
Other Decks in Programming
See All in Programming
StoreKit2によるiOSのアプリ内課金のリニューアル
kangnux
0
110
単体テストを書かない技術 #phpcon_odawara
o0h
PRO
27
8.3k
Scalable Customer Journey Orchestration (CJO)
lewuathe
0
320
MetricKitで予期せぬ終了を検知する話 / Detect unexpected termination with MetricKit
nekowen
1
190
Hanami and htmx
bkuhlmann
0
210
CA.swift19 恋するAIアプリ開発の裏側
oskmr
0
360
エンターテイメント業界で利用されるAWS
demuyan
0
210
2 週間で Twitter Bot を作ってみた
contour_gara
0
390
Komplexe Oberflächen mit SVG und der Web Animation API
joergneumann
0
670
What We Can Learn From OSS
inouehi
0
420
AWS Application Composerで始める、 サーバーレスなデータ基盤構築 / 20240406-jawsug-hokuriku-shinkansen
kasacchiful
1
260
0→1と1→10の狭間で Javaという技術選定を振り返る/Reflecting on the Decision to Choose Java Between Scaling from 0 to 1 and 1 to 10
jaguar_imo
2
380
Featured
See All Featured
Building Flexible Design Systems
yeseniaperezcruz
319
37k
StorybookのUI Testing Handbookを読んだ
zakiyama
13
4.6k
The Art of Programming - Codeland 2020
erikaheidi
42
12k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
155
14k
Gamification - CAS2011
davidbonilla
76
4.6k
10 Git Anti Patterns You Should be Aware of
lemiorhan
648
58k
Code Reviewing Like a Champion
maltzj
514
39k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
227
16k
Docker and Python
trallard
34
2.7k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
274
13k
Rails Girls Zürich Keynote
gr2m
91
13k
How To Stay Up To Date on Web Technology
chriscoyier
782
250k
Transcript
Shirt-sleeve big data
None
None
“Any collection of data sets so large and complex that
it becomes difficult to process them using traditional data processing applications”
“Any collection of data sets so large and complex that
it becomes difficult to process them using traditional data processing applications”
Jorge Lería @jorgeleria
Jorge Lería @jorgeleria
Jorge Lería @jorgeleria
Jorge Lería @jorgeleria
William Viana @vianasw
William Viana @vianasw
William Viana @vianasw
William Viana @vianasw
None
None
None
None
http://youtu.be/vWa9CUsNzdw
A Search Engine All domains ~700.000.000 Home only Metadata ~60kb
libs,frameworks
A Search Engine All domains ~700.000.000 Home only Metadata ~60kb
libs,frameworks ~40Tb
None
40Tb of small documents
$ $ $ $ 40Tb of small documents
Considerations Cheap Fixed problem Fresh data
Single machine approach Cheap Fixed problem Fresh data (10gb) Single
point of failure Waste of resources Less funny?
Multi machine approach Cheap? Fixed problem Fresh data (10gb)
Multi machine approach Main server W1 W2 Wn ... workers
AWS Simple Monthly Calculator http://calculator.s3.amazonaws.com/index.html
None
Some Numbers [on a napkin] 700,000,000 pages / 2 days
= 4050 reqs/sec 700,000,000 rows * 500 bytes/row = 325 GBs 325 GBs every two days from AWS = $584 a month
Some Numbers [on a napkin] 700,000,000 pages / 2 days
= 4050 reqs/sec 4050 reqs/sec / 40 instances = 100 reqs/instance 40 c3.large * 30 days = $3074 per month
4.9 TB and $3658 per month
None
We can do better
None
None
http://aws.amazon.com/ec2/purchasing-options/spot-instances/spot-tutorials/
http://aws.amazon.com/ec2/purchasing-options/spot-instances/spot-tutorials/ START of ugly slides made by Amazon
None
None
None
None
None
None
None
None
None
None
http://aws.amazon.com/ec2/purchasing-options/spot-instances/spot-tutorials/ END of ugly slides made by Amazon
Main server W1 W2 Wn ... workers
None
Isn’t Python slow? Crawling is mostly I/O bound Parsing with
bindings to fast C libraries
Isn’t Python slow? Crawling is mostly I/O bound Parsing with
bindings to fast C libraries Python rocks!
The case of the 10x improvement
The case of the 10x improvement with one line of
code
import re2 as re
import re2 as re https://code.google.com/p/re2/
Main server W1 W2 Wn ... workers Spot instances
Main server W1 W2 Wn ... workers Spot instances On
demand instance
Main server W1 W2 Wn ... workers Spot instances On
demand instance same availability zone
Main server W1 W2 Wn ... workers Spot instances On
demand instance same availability zone ? ? ?
None
None
None
Robust messaging for applications Easy to use Runs on all
major operating systems Supports a huge number of developer platforms Open source and commercially supported [from their website]
None
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone
None
Cassandra Key Features Distributed and Decentralized High Performance Fault Tolerant
Highly Available Column Oriented Key Value
Cassandra Key Features Distributed and Decentralized High Performance Fault Tolerant
Highly Available Column Oriented Key Value
Cassandra Key Features Distributed and Decentralized High Performance (3k writes/sec)
Fault Tolerant Highly Available Column Oriented Key Value Also comes with compression, incremental backups and many problems
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone Beefy machine outside Amazon
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone Dedicated server
Bloom filter "A space-efficient probabilistic data structure, that is used
to test whether an element is a member of a set" bitarray + fast hash function 0 0 0 1 0 0 1 0 0 1 2 3 4 5 6 7
Bloom filter
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone Dedicated server
W1 W2 Wn ... workers Spot instances On demand instance
same availability zone Dedicated server Bloom filter
None
$ curl -XGET 'http://localhost:9200/twitter/_search? q=user:kimchy' … and a nice REST
API
Thank you! Jorge Lería - @jorgeleria William Viana - @vianasw
Follow us!