Analytics System based on AWS Redshift and Kinesis

Slide 1

Slide 1 text

Analytics System based on AWS Redshift & Kinesis

Slide 2

Slide 2 text

VG-A: Agenda 1. Short introduction, 2. VG Analytics System - business need, 3. Technology research, 4. Overview of AWS Redshift & Kinesis.

Slide 3

Slide 3 text

VG-A: Short Introduction ● VIVID GAMES S.A. ○ most technologically advanced, independent development studio in Europe, and the biggest producer of mobile games in Poland, ○ over 160 projects for smartphones and mobile devices with two best known brands: Real Boxing & GodFire, now prepare for Real Boxing 2… coming soon. ● Artur Boniecki ○ Head of R&D Department at Vivid Games S.A., previously was working 14 years for Alcatel-Lucent, ○ involved in a project related to processing of large quantities of data identifying devices, traits and behavior of gamers and analysis of collected data.

Slide 4

Slide 4 text

VG-A: The business need Need: ● a system to gather billions of events generated in mobile applications used by millions of users, ● a system to process the gathered data and present results: ○ metrics like DAU, MAU, RET, ARPU, LTV etc., ○ aggregation per 1hr, 1d, 1w, 1m etc., ○ segmentation per sex, age, country, device etc., ○ targetting, ○ funnels, ○ a/b tests, ○ data mining algorithms.

Slide 5

Slide 5 text

VG-A: The time pressure & cost limitations Time for development & cost limitations: ○ very short time, need effects asap, no time for deep research and months of prototyping, ○ must be a system which does not require niche technology expertise, ○ no time for building own reliability architecture, ○ don’t want to spend much for maintenance/administration, ○ DevOps model is important.

Slide 6

Slide 6 text

VG-A: Technology research Research: ● mostly oriented on data warehouse, but of course also data stream processor in scope, for research purposes let’s take a look at data store... ○ HBASE ■ free and can handle massive amount of data, but no friendly data retrieval API (this is not sql :) ), setting up HA / multi node is very complex, ○ CASSANDRA ■ free and can handle huge data sets, has SQL-like language (good) but still need to set up HA / multi node is very complex and columnar schema designing is hard, ○ COUCHBASE ■ free, nosql db, very fast, trends are optimistic, but not sure if used in real world for such purposes as we have, yes… some analytics companies are using that, need to get experts or invest in research, self-study a lot

Slide 7

Slide 7 text

VG-A: Tech research - WINNER Research: ● VG is on Amazon, we are in DevOps model, lets see what we have here... ○ REDSHIFT ■ cite: “Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud”, © Amazon Web Services ■ SQL interface, no need to invest in research, development self-study, etc. ■ no need to take care of HA, backups, scaling, managed service all does that, ○ KINESIS ■ cite: “Kinesis is a fully managed streaming data service. You can continuously add various types of data such as clickstreams, application logs, and social media to an Amazon Kinesis stream from hundreds of thousands of sources”, © Amazon Web Services ■ HTTP based interface, multiplatform, anywhere & anything, ■ no need to take care of HA, scaling, managed service all does that.

Slide 8

Slide 8 text

VG-A: REDSHIFT ● WHY REDSHIFT: ○ can store and query massive amount of data, ○ columnar storage technology to improve I/O efficiently ■ parallelizing queries across multiple nodes, ○ fully managed service with PayG pricing model: ■ storage size and type, node type, ○ scalable - more store or io? add node and go with that, ○ relational model on top – easy to pick up for developers who are familiar with other RDBMS (VERY NICE!!!), ○ columnar model is not exposed (we don’t need it), ○ easy backup and restore + data safety – data is replicated across nodes.

Slide 9

Slide 9 text

VG-A: REDSHIFT ● REDSHIFT ARCHITECTURE: © Amazon Web Services

Slide 10

Slide 10 text

VG-A: REDSHIFT ● REDSHIFT LIMITS: ○ default number of nodes per cluster /based on node type - can be changed via support contact, ○ tables in cluster - max 9990 / per cluster!!!, ○ max databases - 60 / per cluster, ○ schemas max 256 / per database, ○ max concurrent connections to redshift - 500, ○ max aws accounts per cluster - 20, ○ maximum size of a single row loaded by using the COPY command is 4 MB, ○ naming (cluster, db, param_grp, user, sec_grp, subnet_grp, snapshot) - must be lowercase only, unique on account level + other limits - see doc.

Slide 11

Slide 11 text

VG-A: REDSHIFT ● REDSHIFT TIPS: ○ heavy insert penalty - use COPY mechanism instead (agregate and batch- load it), ○ when using COPY don’t store data file on S3 (takes time to upload), just put link/manifest (lightweight) on S3 to your records on EC2, ○ don’t forget about effective hashing dist key and sort key ○ S3 & RS should be in the same region to improve performance, ○ use column compression feature to reduce storage space and reduce of data read to improve performance, ○ remember you can scale out but you CAN NOT elastically scale in (you must do admin procedures to go down).

Slide 12

Slide 12 text

VG-A: REDSHIFT ● REDSHIFT INFORMATION: ○ http://docs.aws.amazon.com/redshift/ ○ http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-limits.html ○ http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html ○ http://aws.amazon.com/solutions/case-studies/nokia/

Slide 13

Slide 13 text

VG-A: KINESIS AND NOW DATA STREAM - Kinesis...

Slide 14

Slide 14 text

VG-A: KINESIS ● WHY KINESIS: ○ if we are on AWS… why to grow eg. Kafka and maintain it ;), ○ fully managed service with PayG pricing model: ■ stream x nof shards x HTTP request count, ○ scalable - bigger traffic -> add shards and go, lower traffic -> downgrade shards and go, ○ keeps requests queued up to 24 hours.

Slide 15

Slide 15 text

VG-A: KINESIS ● KINESIS LIMITS: ○ default shard max limit - 10 (need contact amazon to increase), ○ input data cached max 24h (this is nice feature !!!), ○ max size of datablob (payload before base64 encoding) is 50kb / aggregate data when possible, ○ write per shard - max 1000 writes / s; 1MB / s, ○ read per shard - max 5 reads / s; 2MB / s.

Slide 16

Slide 16 text

VG-A: KINESIS ● KINESIS TIPS: ○ create stream - takes several minutes max (up to 10min wait for ACTIVE), ○ the data sequence is not a guaranteed, unless you set - SequenceNumberForOrdering, ○ distribute traffic evenly across the shards - use random partition keys, make sure traffic from one client goes always to the same shard, ○ there is about 3sec latency from the time that a record is added to the stream to the time that it is available from GetRecords(), ○ resharding - split & merge, always remember to read from parent shard(s) first. There may be duplicates in child shard. When merging, the “killed” shard is still available to read queued msgs (24h), but not active.

Slide 17

Slide 17 text

VG-A: KINESIS ● KINESIS INFORMATION: ○ http://docs.aws.amazon.com/kinesis ○ http://docs.aws.amazon.com/kinesis/latest/dev/service-sizes-and-limits.html ○ http://aws.amazon.com/solutions/case-studies/supercell/

Slide 18

Slide 18 text

THANK YOU [email protected]