Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned building a 150 million events serverless data pipeline

20d0ddc61e80bce04a63680da0160756?s=47 Aletheia
February 24, 2021

Lessons Learned building a 150 million events serverless data pipeline

In recent years the amount of data generated by brands increased dramatically, thanks to affordable storage costs and faster internet connections. In this article, we explore the advantages serverless technologies offer when dealing with a large amount of data and common pitfalls of these architectures. We are going to outline tips everyone should figure out before starting your next big data project At Neosperience, building our SaaS cloud on AWS, we managed to leverage a number of AWS services. This talk is a deep dive of the choices we made and the reason behind them that made us evolve a standard pipeline with API Gateway + Lambda + DynamoDB into an architecture, able to process hundreds of events per second. In this journey we’ll discover some unexpected behavior, tips and hidden gems of AWS services and how to use them in a real life use case. Basic knowledge of AWS services is required.

20d0ddc61e80bce04a63680da0160756?s=128

Aletheia

February 24, 2021
Tweet

Transcript

  1. www.neosperience.com | blog.neosperience.com | info@neosperience.com Neosperience Empathy in Technology Lessons

    Learned building a 150 million events serverless data pipeline
  2. Luca Bianchi Who am I? github.com/aletheia https://it.linkedin.com/in/lucabianchipavia https://speakerdeck.com/aletheia Chief Technology

    Of fi cer @ Neosperience Chief Technology Of fi cer @ WizKey Serverless Meetup and ServerlessDays Italy co-organizer www.bianchiluca.com @bianchiluca
  3. what makes every customer unique, them in 1:1 experiences and

    your customer base. Neosperience Cloud Understand Engage Grow
  4. Neosperience Cloud Cloud Understand Engage Grow

  5. Understand customers’ behaviour and their Psychographics Pro fi le. A

    Customer Analytics solution Neosperience User Insight is the customer analytical tool you need to track your website and app users’ behavior. Collect and explore relevant insights that provide you with a complete understanding of the psychographic and behavioral characteristics of each person, so that you can o ff er them hyper-personalized content, products, and experiences.
  6. events from the browser

  7. Client JS library embedded into the webpage Browser generated events

    An user navigating the webpage produces events with a fl exible structure that are sent to the backend Three types of events: • low-level: in response to mouse/touch events, agnostic • mid-level: related to webpage actions, domain-speci fi c • high-level: structured customer-speci fi c events Constraints • response time: beacon support is strict on time • volume: millions of events within a single month • throughput: events could peak to thousands within a few seconds
  8. a pipeline to ingest and process events

  9. convert analyze ingest Collect and process events through pipeline stages.

    From the browser to customer insights Data processing pipeline A service unable to ramp up as quickly as the events fl ow to the system would result in loss of data User Insight ingestion service collects data from many di ff erent customers leading to unpredictable load Events needs to be stored, then processed and consolidated into an user pro fi le ingest events collect and send to storage store raw events extract, transform, load store baked events process events to build insights store customer profile
  10. 1. ingest

  11. convert analyze ingest The server receiving data IS a bottleneck

    and a point of failure Ingestion Pipeline ingest events collect and send to storage store raw events extract, transform, load store baked events process events to build insights store customer profile
  12. Focus on ingestion pipeline, from the browser to data storage

    Which architecture to manage data? ingest events collect and send to storage store raw events The fi rst architecture to be exploited is a standard microservice architecture with separated concerns between each component. Since we do not want the pain to manage servers and containers, we embraced serverless since day one, to ease development and focus only on what matter most: our business logic.
  13. a computing model built for scalability

  14. Adopting a serviceful model Serverless (a simpli fi ed model)

    VM container OS ✓ do not manage servers nor VMs ✓ no OS to manage, con fi g or patch ✓ do not pay for idle ✓ built-in scalability ✓ do not manage resource provisioning ✓ no runtime to install or manage ✓ focus only on your code runtime code
  15. someone else duty your business Adopting a serviceful model Serverless

    (a simpli fi ed model) VM container OS ✓ do not manage servers nor VMs ✓ no OS to manage, con fi g or patch ✓ do not pay for idle ✓ built-in scalability ✓ do not manage resource provisioning ✓ no runtime to install or manage ✓ focus only on your code runtime code
  16. which architecture?

  17. Choosing the right database for the right job A serverless

    microservices architecture Aurora Serverless e ff i ciently scales up within seconds DocumentDB is great to store nested documents DynamoDB is great to handle millions of events, but… API Gateway Lambda Focusing on managed database services, to avoid scaling the cluster and managing data recovery
  18. Using DynamoDB with no clear access pattern, is not the

    most suitable use case for this technology Using DynamoDB as an analytics database The amount of data collected by database grows to millions of data points very quickly. i.e. for a single customer, ~130M events collected in just one month Data access pattern is not well de fi ned (parameters within query) and could change whenever high level events are managed for a customer speci fi c context Pulling data from DynamoDB with no clear access pattern means a full table scan for each query. It is not just slow, but also very expensive.
  19. A consolidated technology with unparalleled fl exibility Introducing the Data

    Lake Amazon S3 - an object storage - 99.99% availability - designed from the ground up to handle tra ff i c for any Internet application - multi AZ reliability - cost e ff ective Amazon S3
  20. Amazon Kinesis Firehose provides streaming support to storing data into

    Amazon S3. Streaming to S3 API Gateway Lambda - Up to 5000 records/second (can be increased to 10K records/second) - support data bu ff ering (to decouple input / output frequency) - store with partition year=<year> / month=<month> / day=<day> Kinesis Firehose S3
  21. Lambda — code complexity

  22. AWS Lambda is used to transport events from API Gateway

    to Amazon Firehose A closer look to Lambda Event re-mapping: unwraps event payload from API Gateway Lambda proxy event and packages them into Amazon Firehose records payload Lambda presents issues that can be detected looking into the metrics Lambda validates events sent by the browser using a custom event schema
  23. Lambda — metrics

  24. Lambda — invocation errors

  25. Lambda — concurrent executions

  26. Lambda — concurrent executions

  27. Lambda — invocations

  28. Lambda — invocations

  29. use Lambda to transform data, not to transport data

  30. Cold starts have an unpredictable impact Improving architecture API Gateway

    Lambda - Lambda su ff ers cold start issues with unpredictable patterns - Provisioned capacity could fi x this issu, thanks to cyclic nature of invocation pattern Kinesis Firehose S3
  31. Removing AWS Lambda Improving architecture - Amazon API Gateway supports

    direct connection to Amazon Kinesis Firehose - API to REST method mapping is achieved through VTL templates - Event validation can be achieved at the gateway level through model validation API Gateway Kinesis Firehose S3
  32. let’s talk about events

  33. Each browser generates di ff erent types of events Event

    Types Low-Level events are massively produced when the user moves the mouse, clicks, or scrolls a web page. Ingesting means correlate and store them, then build additional metadata such as session, path, and patterns. Medium Level events are produced when users perform some expected navigation behavior. Ingesting means storing data, then build statistical metrics upon them. High-Level events are produced when users trigger a unique sequence of events de fi ned by a customer analyst, which focuses on a particular behavior.
  34. Split event pipeline based on event type Fan-out strategies -

    add a fan-out stage to route events - dedicated processing strategy for each stream type as Amazon Kinesis Analytics SQL https://docs.aws.amazon.com/ apigateway/latest/ developerguide/integrating-api- with-aws-services-kinesis.html https://github.com/alexcasalboni/kinesis- streams-fan-out-kinesis-analytics raw events stored on S3 Kinesis Analytics applies counting and basic statistical logics to mid level events, then stored on S3 high level events are ready to be counted and stored into Amazon ElasticSearch PR EVIEW
  35. Split event pipeline based on event type, from Amazon Api

    Gateway Fan-out strategies (improved) - dispatch events using VTL templates - map event type attribute to the right Amazon Kinesis Firehose or Amazon Kinesis Data Stream (based on stream id) raw events stored on S3 Kinesis Analytics applies counting and basic statistical logics to mid level events, then stored on S3 high level events are ready to be counted and stored into Amazon ElasticSearch PR EVIEW
  36. Split event pipeline based on event type, from Amazon Api

    Gateway Fan-out strategies (improved, cost-e ff ective) - use AWS SQS and AWS Lambda instead of Amazon Kinesis Data Stream and Kinesis Data Analytics - make the solution fully serverless (no hourly costs) raw events stored on S3 Kinesis Analytics applies counting and basic statistical logics to mid level events, then stored on S3 high level events are ready to be counted and stored into Amazon ElasticSearch PR EVIEW
  37. 2. convert

  38. convert analyze ingest Data conversion stage User Insight data pipeline

    ingest events collect and send to storage store raw events extract, transform, load store baked events process events to build insights store customer profile
  39. Extract, Transform raw events and load into a data catalog

    User Insight data pipeline extract, transform, load -Processes events and load them into AWS Glue catalog, then saves to S3 -Aggregates events based on their visit time extracting user sessions -Transforms events encoding respective types into readable and compact format -Uses Apache Spark language to build processing jobs
  40. 3. analyze

  41. convert analyze ingest Data analysis stage User Insight data pipeline

    ingest events collect and send to storage store raw events extract, transform, load store baked events process events to build insights store customer profile
  42. Data analysis stage User Insight data pipeline store baked events

    process events to build insights store customer profile -Data is loaded into AWS Glue catalog and S3 from previous stage -Amazon Athena queries build customer insights, leveraging external ML services through Amazon SageMaker -Resulting insights are stored into Amazon Elasticsearch
  43. Data analysis stage User Insight data pipeline -Data is loaded

    into AWS Glue catalog and into Amazon S3 from previous stage -Amazon Athena queries build customer insights, leveraging external ML services through Amazon SageMaker -Resulting insights are stored into Amazon Elasticsearch
  44. Amazon Athena to query data

  45. putting all together..

  46. User Insight data pipeline

  47. Empathy in Technology

  48. Thank you.

  49. www.neosperience.com | blog.neosperience.com | info@neosperience.com