Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big data de andar por casa | Shirt-sleeve Big Data

jorgeleria
November 21, 2014

Big data de andar por casa | Shirt-sleeve Big Data

#codemotion_es #codemotion #2014 #codemotion2014
Jorge Lería - @jorgeleria
William Viana - @vianasw

jorgeleria

November 21, 2014
Tweet

More Decks by jorgeleria

Other Decks in Programming

Transcript

  1. “Any collection of data sets so large and complex that

    it becomes difficult to process them using traditional data processing applications”
  2. “Any collection of data sets so large and complex that

    it becomes difficult to process them using traditional data processing applications”
  3. Single machine approach Cheap Fixed problem Fresh data (10gb) Single

    point of failure Waste of resources Less funny?
  4. Some Numbers [on a napkin] 700,000,000 pages / 2 days

    = 4050 reqs/sec 700,000,000 rows * 500 bytes/row = 325 GBs 325 GBs every two days from AWS = $584 a month
  5. Some Numbers [on a napkin] 700,000,000 pages / 2 days

    = 4050 reqs/sec 4050 reqs/sec / 40 instances = 100 reqs/instance 40 c3.large * 30 days = $3074 per month
  6. Isn’t Python slow? Crawling is mostly I/O bound Parsing with

    bindings to fast C libraries Python rocks!
  7. Main server W1 W2 Wn ... workers Spot instances On

    demand instance same availability zone
  8. Main server W1 W2 Wn ... workers Spot instances On

    demand instance same availability zone ? ? ?
  9. Robust messaging for applications Easy to use Runs on all

    major operating systems Supports a huge number of developer platforms Open source and commercially supported [from their website]
  10. Cassandra Key Features Distributed and Decentralized High Performance (3k writes/sec)

    Fault Tolerant Highly Available Column Oriented Key Value Also comes with compression, incremental backups and many problems
  11. W1 W2 Wn ... workers Spot instances On demand instance

    same availability zone Beefy machine outside Amazon
  12. W1 W2 Wn ... workers Spot instances On demand instance

    same availability zone Dedicated server
  13. Bloom filter "A space-efficient probabilistic data structure, that is used

    to test whether an element is a member of a set" bitarray + fast hash function 0 0 0 1 0 0 1 0 0 1 2 3 4 5 6 7
  14. W1 W2 Wn ... workers Spot instances On demand instance

    same availability zone Dedicated server
  15. W1 W2 Wn ... workers Spot instances On demand instance

    same availability zone Dedicated server Bloom filter