Lessons Learned building a 150 million events serverless data pipeline

www.neosperience.com | blog.neosperience.com | [email protected] Neosperience Empathy in Technology Lessons
Learned building a 150 million events serverless data pipeline

Luca Bianchi Who am I? github.com/aletheia https://it.linkedin.com/in/lucabianchipavia https://speakerdeck.com/aletheia Chief Technology
Of fi cer @ Neosperience Chief Technology Of fi cer @ WizKey Serverless Meetup and ServerlessDays Italy co-organizer www.bianchiluca.com @bianchiluca

what makes every customer unique, them in 1:1 experiences and
your customer base. Neosperience Cloud Understand Engage Grow

Neosperience Cloud Cloud Understand Engage Grow

Understand customers’ behaviour and their Psychographics Pro fi le. A
Customer Analytics solution Neosperience User Insight is the customer analytical tool you need to track your website and app users’ behavior. Collect and explore relevant insights that provide you with a complete understanding of the psychographic and behavioral characteristics of each person, so that you can o ff er them hyper-personalized content, products, and experiences.

events from the browser

Client JS library embedded into the webpage Browser generated events
An user navigating the webpage produces events with a fl exible structure that are sent to the backend Three types of events: • low-level: in response to mouse/touch events, agnostic • mid-level: related to webpage actions, domain-speci fi c • high-level: structured customer-speci fi c events Constraints • response time: beacon support is strict on time • volume: millions of events within a single month • throughput: events could peak to thousands within a few seconds

a pipeline to ingest and process events

convert analyze ingest Collect and process events through pipeline stages.
From the browser to customer insights Data processing pipeline A service unable to ramp up as quickly as the events fl ow to the system would result in loss of data User Insight ingestion service collects data from many di ff erent customers leading to unpredictable load Events needs to be stored, then processed and consolidated into an user pro fi le ingest events collect and send to storage store raw events extract, transform, load store baked events process events to build insights store customer profile

1. ingest

convert analyze ingest The server receiving data IS a bottleneck
and a point of failure Ingestion Pipeline ingest events collect and send to storage store raw events extract, transform, load store baked events process events to build insights store customer profile

Focus on ingestion pipeline, from the browser to data storage
Which architecture to manage data? ingest events collect and send to storage store raw events The fi rst architecture to be exploited is a standard microservice architecture with separated concerns between each component. Since we do not want the pain to manage servers and containers, we embraced serverless since day one, to ease development and focus only on what matter most: our business logic.

a computing model built for scalability

Adopting a serviceful model Serverless (a simpli fi ed model)
VM container OS ✓ do not manage servers nor VMs ✓ no OS to manage, con fi g or patch ✓ do not pay for idle ✓ built-in scalability ✓ do not manage resource provisioning ✓ no runtime to install or manage ✓ focus only on your code runtime code

someone else duty your business Adopting a serviceful model Serverless
(a simpli fi ed model) VM container OS ✓ do not manage servers nor VMs ✓ no OS to manage, con fi g or patch ✓ do not pay for idle ✓ built-in scalability ✓ do not manage resource provisioning ✓ no runtime to install or manage ✓ focus only on your code runtime code

which architecture?

Choosing the right database for the right job A serverless
microservices architecture Aurora Serverless e ff i ciently scales up within seconds DocumentDB is great to store nested documents DynamoDB is great to handle millions of events, but… API Gateway Lambda Focusing on managed database services, to avoid scaling the cluster and managing data recovery

Using DynamoDB with no clear access pattern, is not the
most suitable use case for this technology Using DynamoDB as an analytics database The amount of data collected by database grows to millions of data points very quickly. i.e. for a single customer, ~130M events collected in just one month Data access pattern is not well de fi ned (parameters within query) and could change whenever high level events are managed for a customer speci fi c context Pulling data from DynamoDB with no clear access pattern means a full table scan for each query. It is not just slow, but also very expensive.

A consolidated technology with unparalleled fl exibility Introducing the Data
Lake Amazon S3 - an object storage - 99.99% availability - designed from the ground up to handle tra ff i c for any Internet application - multi AZ reliability - cost e ff ective Amazon S3

Amazon Kinesis Firehose provides streaming support to storing data into
Amazon S3. Streaming to S3 API Gateway Lambda - Up to 5000 records/second (can be increased to 10K records/second) - support data bu ff ering (to decouple input / output frequency) - store with partition year=<year> / month=<month> / day=<day> Kinesis Firehose S3

Lambda — code complexity

AWS Lambda is used to transport events from API Gateway
to Amazon Firehose A closer look to Lambda Event re-mapping: unwraps event payload from API Gateway Lambda proxy event and packages them into Amazon Firehose records payload Lambda presents issues that can be detected looking into the metrics Lambda validates events sent by the browser using a custom event schema

Lambda — metrics

Lambda — invocation errors

Lambda — concurrent executions

Lambda — invocations

use Lambda to transform data, not to transport data

Cold starts have an unpredictable impact Improving architecture API Gateway
Lambda - Lambda su ff ers cold start issues with unpredictable patterns - Provisioned capacity could fi x this issu, thanks to cyclic nature of invocation pattern Kinesis Firehose S3

Removing AWS Lambda Improving architecture - Amazon API Gateway supports
direct connection to Amazon Kinesis Firehose - API to REST method mapping is achieved through VTL templates - Event validation can be achieved at the gateway level through model validation API Gateway Kinesis Firehose S3

let’s talk about events

Each browser generates di ff erent types of events Event
Types Low-Level events are massively produced when the user moves the mouse, clicks, or scrolls a web page. Ingesting means correlate and store them, then build additional metadata such as session, path, and patterns. Medium Level events are produced when users perform some expected navigation behavior. Ingesting means storing data, then build statistical metrics upon them. High-Level events are produced when users trigger a unique sequence of events de fi ned by a customer analyst, which focuses on a particular behavior.

Split event pipeline based on event type Fan-out strategies -
add a fan-out stage to route events - dedicated processing strategy for each stream type as Amazon Kinesis Analytics SQL https://docs.aws.amazon.com/ apigateway/latest/ developerguide/integrating-api- with-aws-services-kinesis.html https://github.com/alexcasalboni/kinesis- streams-fan-out-kinesis-analytics raw events stored on S3 Kinesis Analytics applies counting and basic statistical logics to mid level events, then stored on S3 high level events are ready to be counted and stored into Amazon ElasticSearch PR EVIEW

Split event pipeline based on event type, from Amazon Api
Gateway Fan-out strategies (improved) - dispatch events using VTL templates - map event type attribute to the right Amazon Kinesis Firehose or Amazon Kinesis Data Stream (based on stream id) raw events stored on S3 Kinesis Analytics applies counting and basic statistical logics to mid level events, then stored on S3 high level events are ready to be counted and stored into Amazon ElasticSearch PR EVIEW

Split event pipeline based on event type, from Amazon Api
Gateway Fan-out strategies (improved, cost-e ff ective) - use AWS SQS and AWS Lambda instead of Amazon Kinesis Data Stream and Kinesis Data Analytics - make the solution fully serverless (no hourly costs) raw events stored on S3 Kinesis Analytics applies counting and basic statistical logics to mid level events, then stored on S3 high level events are ready to be counted and stored into Amazon ElasticSearch PR EVIEW

2. convert

convert analyze ingest Data conversion stage User Insight data pipeline
ingest events collect and send to storage store raw events extract, transform, load store baked events process events to build insights store customer profile

Extract, Transform raw events and load into a data catalog
User Insight data pipeline extract, transform, load -Processes events and load them into AWS Glue catalog, then saves to S3 -Aggregates events based on their visit time extracting user sessions -Transforms events encoding respective types into readable and compact format -Uses Apache Spark language to build processing jobs

3. analyze

convert analyze ingest Data analysis stage User Insight data pipeline
ingest events collect and send to storage store raw events extract, transform, load store baked events process events to build insights store customer profile

Data analysis stage User Insight data pipeline store baked events
process events to build insights store customer profile -Data is loaded into AWS Glue catalog and S3 from previous stage -Amazon Athena queries build customer insights, leveraging external ML services through Amazon SageMaker -Resulting insights are stored into Amazon Elasticsearch

Data analysis stage User Insight data pipeline -Data is loaded
into AWS Glue catalog and into Amazon S3 from previous stage -Amazon Athena queries build customer insights, leveraging external ML services through Amazon SageMaker -Resulting insights are stored into Amazon Elasticsearch

Amazon Athena to query data

putting all together..

User Insight data pipeline

Empathy in Technology

Thank you.

www.neosperience.com | blog.neosperience.com | [email protected]

Lessons Learned building a 150 million events s...

Lessons Learned building a 150 million events serverless data pipeline

More Decks by Aletheia

Other Decks in Technology

Featured

Transcript