Analytics System based on AWS Redshift and Kinesis

Analytics System based on AWS Redshift & Kinesis

VG-A: Agenda 1. Short introduction, 2. VG Analytics System -
business need, 3. Technology research, 4. Overview of AWS Redshift & Kinesis.

VG-A: Short Introduction • VIVID GAMES S.A. ◦ most technologically
advanced, independent development studio in Europe, and the biggest producer of mobile games in Poland, ◦ over 160 projects for smartphones and mobile devices with two best known brands: Real Boxing & GodFire, now prepare for Real Boxing 2… coming soon. • Artur Boniecki ◦ Head of R&D Department at Vivid Games S.A., previously was working 14 years for Alcatel-Lucent, ◦ involved in a project related to processing of large quantities of data identifying devices, traits and behavior of gamers and analysis of collected data.

VG-A: The business need Need: • a system to gather
billions of events generated in mobile applications used by millions of users, • a system to process the gathered data and present results: ◦ metrics like DAU, MAU, RET, ARPU, LTV etc., ◦ aggregation per 1hr, 1d, 1w, 1m etc., ◦ segmentation per sex, age, country, device etc., ◦ targetting, ◦ funnels, ◦ a/b tests, ◦ data mining algorithms.

VG-A: The time pressure & cost limitations Time for development
& cost limitations: ◦ very short time, need effects asap, no time for deep research and months of prototyping, ◦ must be a system which does not require niche technology expertise, ◦ no time for building own reliability architecture, ◦ don’t want to spend much for maintenance/administration, ◦ DevOps model is important.

VG-A: Technology research Research: • mostly oriented on data warehouse,
but of course also data stream processor in scope, for research purposes let’s take a look at data store... ◦ HBASE ▪ free and can handle massive amount of data, but no friendly data retrieval API (this is not sql :) ), setting up HA / multi node is very complex, ◦ CASSANDRA ▪ free and can handle huge data sets, has SQL-like language (good) but still need to set up HA / multi node is very complex and columnar schema designing is hard, ◦ COUCHBASE ▪ free, nosql db, very fast, trends are optimistic, but not sure if used in real world for such purposes as we have, yes… some analytics companies are using that, need to get experts or invest in research, self-study a lot

VG-A: Tech research - WINNER Research: • VG is on
Amazon, we are in DevOps model, lets see what we have here... ◦ REDSHIFT ▪ cite: “Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud”, © Amazon Web Services ▪ SQL interface, no need to invest in research, development self-study, etc. ▪ no need to take care of HA, backups, scaling, managed service all does that, ◦ KINESIS ▪ cite: “Kinesis is a fully managed streaming data service. You can continuously add various types of data such as clickstreams, application logs, and social media to an Amazon Kinesis stream from hundreds of thousands of sources”, © Amazon Web Services ▪ HTTP based interface, multiplatform, anywhere & anything, ▪ no need to take care of HA, scaling, managed service all does that.

VG-A: REDSHIFT • WHY REDSHIFT: ◦ can store and query
massive amount of data, ◦ columnar storage technology to improve I/O efficiently ▪ parallelizing queries across multiple nodes, ◦ fully managed service with PayG pricing model: ▪ storage size and type, node type, ◦ scalable - more store or io? add node and go with that, ◦ relational model on top – easy to pick up for developers who are familiar with other RDBMS (VERY NICE!!!), ◦ columnar model is not exposed (we don’t need it), ◦ easy backup and restore + data safety – data is replicated across nodes.

VG-A: REDSHIFT • REDSHIFT ARCHITECTURE: © Amazon Web Services

VG-A: REDSHIFT • REDSHIFT LIMITS: ◦ default number of nodes
per cluster /based on node type - can be changed via support contact, ◦ tables in cluster - max 9990 / per cluster!!!, ◦ max databases - 60 / per cluster, ◦ schemas max 256 / per database, ◦ max concurrent connections to redshift - 500, ◦ max aws accounts per cluster - 20, ◦ maximum size of a single row loaded by using the COPY command is 4 MB, ◦ naming (cluster, db, param_grp, user, sec_grp, subnet_grp, snapshot) - must be lowercase only, unique on account level + other limits - see doc.

VG-A: REDSHIFT • REDSHIFT TIPS: ◦ heavy insert penalty -
use COPY mechanism instead (agregate and batch- load it), ◦ when using COPY don’t store data file on S3 (takes time to upload), just put link/manifest (lightweight) on S3 to your records on EC2, ◦ don’t forget about effective hashing dist key and sort key ◦ S3 & RS should be in the same region to improve performance, ◦ use column compression feature to reduce storage space and reduce of data read to improve performance, ◦ remember you can scale out but you CAN NOT elastically scale in (you must do admin procedures to go down).

VG-A: REDSHIFT • REDSHIFT INFORMATION: ◦ http://docs.aws.amazon.com/redshift/ ◦ http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-limits.html ◦
http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html ◦ http://aws.amazon.com/solutions/case-studies/nokia/

VG-A: KINESIS AND NOW DATA STREAM - Kinesis...

VG-A: KINESIS • WHY KINESIS: ◦ if we are on
AWS… why to grow eg. Kafka and maintain it ;), ◦ fully managed service with PayG pricing model: ▪ stream x nof shards x HTTP request count, ◦ scalable - bigger traffic -> add shards and go, lower traffic -> downgrade shards and go, ◦ keeps requests queued up to 24 hours.

VG-A: KINESIS • KINESIS LIMITS: ◦ default shard max limit
- 10 (need contact amazon to increase), ◦ input data cached max 24h (this is nice feature !!!), ◦ max size of datablob (payload before base64 encoding) is 50kb / aggregate data when possible, ◦ write per shard - max 1000 writes / s; 1MB / s, ◦ read per shard - max 5 reads / s; 2MB / s.

VG-A: KINESIS • KINESIS TIPS: ◦ create stream - takes
several minutes max (up to 10min wait for ACTIVE), ◦ the data sequence is not a guaranteed, unless you set - SequenceNumberForOrdering, ◦ distribute traffic evenly across the shards - use random partition keys, make sure traffic from one client goes always to the same shard, ◦ there is about 3sec latency from the time that a record is added to the stream to the time that it is available from GetRecords(), ◦ resharding - split & merge, always remember to read from parent shard(s) first. There may be duplicates in child shard. When merging, the “killed” shard is still available to read queued msgs (24h), but not active.

VG-A: KINESIS • KINESIS INFORMATION: ◦ http://docs.aws.amazon.com/kinesis ◦ http://docs.aws.amazon.com/kinesis/latest/dev/service-sizes-and-limits.html ◦
http://aws.amazon.com/solutions/case-studies/supercell/

THANK YOU [email protected]

Analytics System based on AWS Redshift and Kinesis

Analytics System based on AWS Redshift and Kinesis

AWS User Group Poland

More Decks by AWS User Group Poland

Other Decks in Technology

Featured

Transcript

Analytics System based on AWS Redshift & Kinesis

VG-A: Agenda 1. Short introduction, 2. VG Analytics System -

VG-A: Short Introduction • VIVID GAMES S.A. ◦ most technologically

VG-A: The business need Need: • a system to gather

VG-A: The time pressure & cost limitations Time for development

VG-A: Technology research Research: • mostly oriented on data warehouse,

VG-A: Tech research - WINNER Research: • VG is on

VG-A: REDSHIFT • WHY REDSHIFT: ◦ can store and query

VG-A: REDSHIFT • REDSHIFT ARCHITECTURE: © Amazon Web Services

VG-A: REDSHIFT • REDSHIFT LIMITS: ◦ default number of nodes

VG-A: REDSHIFT • REDSHIFT TIPS: ◦ heavy insert penalty -

VG-A: REDSHIFT • REDSHIFT INFORMATION: ◦ http://docs.aws.amazon.com/redshift/ ◦ http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-limits.html ◦

VG-A: KINESIS AND NOW DATA STREAM - Kinesis...

VG-A: KINESIS • WHY KINESIS: ◦ if we are on

VG-A: KINESIS • KINESIS LIMITS: ◦ default shard max limit

VG-A: KINESIS • KINESIS TIPS: ◦ create stream - takes

VG-A: KINESIS • KINESIS INFORMATION: ◦ http://docs.aws.amazon.com/kinesis ◦ http://docs.aws.amazon.com/kinesis/latest/dev/service-sizes-and-limits.html ◦

THANK YOU [email protected]