Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Meetup UFRJ

Meetup UFRJ

Julio Faerman

August 21, 2015
Tweet

More Decks by Julio Faerman

Other Decks in Programming

Transcript

  1. Desde  1997   2000+  funcionários   40M+  usuários   hEp://aws.amazon.com/soluKons/case-­‐studies/neMlix/

      hEp://techblog.neMlix.com/2013/12/neMlix-­‐presentaKon-­‐videos-­‐from-­‐aws.html     100%  Amazon   Web  Services   34.2%  de  toda   largura  de  banda   EUA  em  horário   nobre  
  2. Amazon   Simple   Storage   Service   •  Durable,

    scalable and fast storage (99.999999999%) •  2+ Trillion (1012) objects •  1.1+ Million RPS •  Native HTTP/S •  And more: Permissions, Static Hosting, Logging, Versionamento, Archival and Expiration Lifecycle, Torrent, Tags, Redundancy, Requester Pays, Criptography, Reduced Redundancy and more hEp://aws.amazon.com/s3/  
  3.  “Any  dataset  that  is  worth  retaining  is  stored  on  S3.

      This  includes  data  from  billions  of  streaming  events   from  televisions,  laptops,  and  mobile  devices  every   hour  captured  by  our  log  data  pipeline,  plus   dimension  data  from  Cassandra  supplied  by  our   Aegisthus  pipeline.”   hEp://techblog.neMlix.com/2013/01/hadoop-­‐plaMorm-­‐as-­‐service-­‐in-­‐cloud.html      “87%  Cost  ReducKon  per  Streaming  Start.”   hEp://youtu.be/XBgkZxAljbs   “In  terms  of  scale,  we  have  a  10  petabyte  data   warehouse  on  S3.”   hEp://techblog.neMlix.com/2014/10/using-­‐presto-­‐in-­‐our-­‐big-­‐data-­‐plaMorm.html  
  4. Amazon   ElasKc   MapReduce   •  Distributed processing with

    Apache Hadoop hEp://aws.amazon.com/elasKcmapreduce/  
  5. Structured   RelaKonal   On-­‐Line   GB-­‐TB-­‐PB   Semi-­‐structured  

    Map  Reduce   Batch   TB-­‐PB-­‐EB   Era  uma  vez…  
  6. Structured   On-­‐Line   GB   TB   PB  

    EB   Semi-­‐structured   Unstructured   Distributed  Cache   In-­‐Memory  Data  Grid   Map  Reduce   ETL   Extract-­‐Transfer-­‐Load   Graph  Database   Document  Database   Columnar  Database   Batch   Real  Time   Machine  Learning   RelaHonal  Database   hEp://nathanmarz.com/   Data  Structure  Server   Stream  Processing   Rule  Engine   NoSQL   Hoje  em  dia…  
  7. hEp://aws.amazon.com/soluKons/case-­‐studies/pinterest/   50K  -­‐>  17M  Usuários  em  9  Meses  

    12-­‐  Funcionários   48M  Usuários   8  Bilhões  de  Objetos   400+  TB  de  dados  
  8. April  2013:     400+  Web  Engines   400+  API

     Engines   70x2+  MySQL  DBs   100+  Redis  Instances   230+  Memcache  Instances   10  Redis  Task  Manager   500  Redis  Task  Processors   80  Sharded  Solr   20  HBase   12  Kala  +  Azkabhan   8  Zookeeper  Instances     12  Varnish   hEp://www.infoq.com/presentaKons/scaling-­‐pinterest  
  9. Amazon   RelaKonal   Database   Service   •  MySQL,

    Postgres, Oracle or SQL Server hEp://aws.amazon.com/rds/  
  10. Amazon   ElasKCache   •  Memcached and Redis •  Replication

    •  Backup and Restore •  Managed patch management, failure detection and recovery •  Elastic •  Reliable hEp://aws.amazon.com/elasKcache/  
  11. •  Petabyte Scale Data Warehousing •  Massively parallel OnLine Analytic

    Processing •  Resizable without downtime •  Managed provisioning and administration •  Compatible with PostgreSQL Amazon   Redshio   hEp://aws.amazon.com/redshio/  
  12. Amazon Redshift Architecture Leader Node •  SQL endpoint •  Stores

    metadata •  Coordinates query execution Compute Nodes •  Local, columnar storage •  Execute queries in parallel •  Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH Two hardware platforms •  OpKmized  for  data  processing   •  DW1:  HDD;  scale  from  2TB  to  1.6PB   •  DW2:  SSD;  scale  from  160GB  to  256TB   10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores Amazon S3 / DynamoDB / SSH JDBC/ODBC 128GB RAM 16TB disk 16 cores Compute Node 128GB RAM 16TB disk 16 cores Compute Node 128GB RAM 16TB disk 16 cores Compute Node Leader Node
  13. ETL  from  EMR/Hive  to  Amazon  Redshio   trough  Amazon  S3

      EMR   S3   Redshio   Extract  &  Transform   Load     Unstructured   Unclean         Structured   Clean       Columnar   Compressed    
  14. Amazon  Redshio  at  Pinterest  Today   •  16  node  256TB

     cluster     •  2TB  data  per  day   •  100+  regular  users   •  500+  queries  per  day   75%  <=  35  seconds,  90%  <=  2  minute   •  OperaKonal  effort  <=  5  hours/week  
  15. •  NoSQL Database •  Provisioned Throughput •  Seamless Salability • 

    Zero Admin •  Single digit millisecond latency Amazon   DynamoDB   hEp://aws.amazon.com/dynamodb/  
  16. ~5TB  em  Base  de  Dados   1  Bilhão  de  Requests/Mês

      67.000  Requests/Minuto   34  milhões  de  Recomendações/Dia   4  milhões  de  produtos   27  Milhões  de  usuário   "A  gente  não  pode     se  dar  ao  luxo   de  jogar  fora   informação"  
  17. 2a  Etapa   Availability Zone Tomcat  6   EhCache  

    NewRelic   MySQL  Primário   Availability Zone MySQL  Secundário   EBS  RAID0   EBS  RAID0   Replicação  
  18. Availability Zone Availability Zone 3a  Etapa   Availability Zone Tomcat

     6  +  EhCache   Nginx   HAProxy   Availability Zone Availability Zone MySQL  1   EBS  RAID0   MySQL  2   EBS  RAID0   Replicação   Memcached   ElasKc   Load   Balancer  
  19. 4a  Etapa   Auto Scaling group Nginx   HAProxy  

    JeEy   EhCache   Availability Zone Memcached   Availability Zone Availability Zone region region
  20. Amazon   Kinesis     Amazon   Data   Pipeline

      Cenas  dos  próximos  capítulos…   hEp://aws.amazon.com/datapipeline/   hEp://aws.amazon.com/kinesis/