Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analytics System based on AWS Redshift and Kinesis

Analytics System based on AWS Redshift and Kinesis

Presentation from AWS User Group (2015.09.24) in Warsaw by Artur Boniecki from Vivid Games S.A.

AWS User Group Poland

September 24, 2015
Tweet

More Decks by AWS User Group Poland

Other Decks in Technology

Transcript

  1. VG-A: Agenda 1. Short introduction, 2. VG Analytics System -

    business need, 3. Technology research, 4. Overview of AWS Redshift & Kinesis.
  2. VG-A: Short Introduction • VIVID GAMES S.A. ◦ most technologically

    advanced, independent development studio in Europe, and the biggest producer of mobile games in Poland, ◦ over 160 projects for smartphones and mobile devices with two best known brands: Real Boxing & GodFire, now prepare for Real Boxing 2… coming soon. • Artur Boniecki ◦ Head of R&D Department at Vivid Games S.A., previously was working 14 years for Alcatel-Lucent, ◦ involved in a project related to processing of large quantities of data identifying devices, traits and behavior of gamers and analysis of collected data.
  3. VG-A: The business need Need: • a system to gather

    billions of events generated in mobile applications used by millions of users, • a system to process the gathered data and present results: ◦ metrics like DAU, MAU, RET, ARPU, LTV etc., ◦ aggregation per 1hr, 1d, 1w, 1m etc., ◦ segmentation per sex, age, country, device etc., ◦ targetting, ◦ funnels, ◦ a/b tests, ◦ data mining algorithms.
  4. VG-A: The time pressure & cost limitations Time for development

    & cost limitations: ◦ very short time, need effects asap, no time for deep research and months of prototyping, ◦ must be a system which does not require niche technology expertise, ◦ no time for building own reliability architecture, ◦ don’t want to spend much for maintenance/administration, ◦ DevOps model is important.
  5. VG-A: Technology research Research: • mostly oriented on data warehouse,

    but of course also data stream processor in scope, for research purposes let’s take a look at data store... ◦ HBASE ▪ free and can handle massive amount of data, but no friendly data retrieval API (this is not sql :) ), setting up HA / multi node is very complex, ◦ CASSANDRA ▪ free and can handle huge data sets, has SQL-like language (good) but still need to set up HA / multi node is very complex and columnar schema designing is hard, ◦ COUCHBASE ▪ free, nosql db, very fast, trends are optimistic, but not sure if used in real world for such purposes as we have, yes… some analytics companies are using that, need to get experts or invest in research, self-study a lot
  6. VG-A: Tech research - WINNER Research: • VG is on

    Amazon, we are in DevOps model, lets see what we have here... ◦ REDSHIFT ▪ cite: “Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud”, © Amazon Web Services ▪ SQL interface, no need to invest in research, development self-study, etc. ▪ no need to take care of HA, backups, scaling, managed service all does that, ◦ KINESIS ▪ cite: “Kinesis is a fully managed streaming data service. You can continuously add various types of data such as clickstreams, application logs, and social media to an Amazon Kinesis stream from hundreds of thousands of sources”, © Amazon Web Services ▪ HTTP based interface, multiplatform, anywhere & anything, ▪ no need to take care of HA, scaling, managed service all does that.
  7. VG-A: REDSHIFT • WHY REDSHIFT: ◦ can store and query

    massive amount of data, ◦ columnar storage technology to improve I/O efficiently ▪ parallelizing queries across multiple nodes, ◦ fully managed service with PayG pricing model: ▪ storage size and type, node type, ◦ scalable - more store or io? add node and go with that, ◦ relational model on top – easy to pick up for developers who are familiar with other RDBMS (VERY NICE!!!), ◦ columnar model is not exposed (we don’t need it), ◦ easy backup and restore + data safety – data is replicated across nodes.
  8. VG-A: REDSHIFT • REDSHIFT LIMITS: ◦ default number of nodes

    per cluster /based on node type - can be changed via support contact, ◦ tables in cluster - max 9990 / per cluster!!!, ◦ max databases - 60 / per cluster, ◦ schemas max 256 / per database, ◦ max concurrent connections to redshift - 500, ◦ max aws accounts per cluster - 20, ◦ maximum size of a single row loaded by using the COPY command is 4 MB, ◦ naming (cluster, db, param_grp, user, sec_grp, subnet_grp, snapshot) - must be lowercase only, unique on account level + other limits - see doc.
  9. VG-A: REDSHIFT • REDSHIFT TIPS: ◦ heavy insert penalty -

    use COPY mechanism instead (agregate and batch- load it), ◦ when using COPY don’t store data file on S3 (takes time to upload), just put link/manifest (lightweight) on S3 to your records on EC2, ◦ don’t forget about effective hashing dist key and sort key ◦ S3 & RS should be in the same region to improve performance, ◦ use column compression feature to reduce storage space and reduce of data read to improve performance, ◦ remember you can scale out but you CAN NOT elastically scale in (you must do admin procedures to go down).
  10. VG-A: REDSHIFT • REDSHIFT INFORMATION: ◦ http://docs.aws.amazon.com/redshift/ ◦ http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-limits.html ◦

    http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html ◦ http://aws.amazon.com/solutions/case-studies/nokia/
  11. VG-A: KINESIS • WHY KINESIS: ◦ if we are on

    AWS… why to grow eg. Kafka and maintain it ;), ◦ fully managed service with PayG pricing model: ▪ stream x nof shards x HTTP request count, ◦ scalable - bigger traffic -> add shards and go, lower traffic -> downgrade shards and go, ◦ keeps requests queued up to 24 hours.
  12. VG-A: KINESIS • KINESIS LIMITS: ◦ default shard max limit

    - 10 (need contact amazon to increase), ◦ input data cached max 24h (this is nice feature !!!), ◦ max size of datablob (payload before base64 encoding) is 50kb / aggregate data when possible, ◦ write per shard - max 1000 writes / s; 1MB / s, ◦ read per shard - max 5 reads / s; 2MB / s.
  13. VG-A: KINESIS • KINESIS TIPS: ◦ create stream - takes

    several minutes max (up to 10min wait for ACTIVE), ◦ the data sequence is not a guaranteed, unless you set - SequenceNumberForOrdering, ◦ distribute traffic evenly across the shards - use random partition keys, make sure traffic from one client goes always to the same shard, ◦ there is about 3sec latency from the time that a record is added to the stream to the time that it is available from GetRecords(), ◦ resharding - split & merge, always remember to read from parent shard(s) first. There may be duplicates in child shard. When merging, the “killed” shard is still available to read queued msgs (24h), but not active.