Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A serverless data pipeline for Insurance Telematics

A serverless data pipeline for Insurance Telematics

How hard is it to scale up to ingest hundreds of millions messages per day? Can you get real-time insights from big-data? During this session we learned how to build a data pipeline with a completely serverless architecture. We discussed advantages in terms of scalability and reliability.

Francesco Lerro

May 21, 2019

More Decks by Francesco Lerro

Other Decks in Technology


  1. Build a serverless data pipeline for Insurance Telematics HPC for

    Industry Workshop Milan 21/05/2019 and sleep at night!
  2. “ How hard is it to ingest hundreds of millions

    messages per day and get real-time insights? 2
  3. On-premise solution 3 Devices send streaming data (GPS, acceleration) Kafka

    buffers data in pub/sub topics to be consumed Spark jobs filter aggregate or transform data HDFS cluster stores data for analytics or further processing
  4. 5

  5. Hello! I am Francesco Lerro Delivering highly available solutions with

    millions interactions since 2005 In love with cloud computing and buffalo mozzarella 6 @flerro
  6. A factory which develops data-intensive solutions, applications or components, pursuing

    following goals: • preserve/extract value/enrich existing data • push forward services innovation/digital transformation • investigate future technology scenarios for Insurance 7 We are an InsurTech inside
  7. Data Science Data mining Big Data Machine Learning Image processing

    Natural Language Processing Computer Science SW Engineering Data Engineering Big Data architectures High Perf Computing IoT & Mobile Signal processing #WeAreHiring Data Scientists and SW Engineers
  8. Limits of on-premise • Complex system with high operational costs

    • Data ingestion delays on high usage peaks • Jobs for data analysis may take too long • Analysis of ingested data not always easy 10
  9. 11

  10. Serverless is a Paradigm Shift • Automated high availability •

    Flexible scalability • Pay for what you use • Focus on your business 12
  11. Amazon IoT Core • Managed platform to handle IoT devices

    • Provides a rule-engine to build an “IoT application” • Routes incoming messages to other AWS services (Kinesis, Lambda, ...) 13
  12. Amazon Kinesis Firehose • Managed service to stream data to

    storage services • Available destinations include S3, Redshift, Splunk • Max data delay to storage is 60 seconds 14
  13. Amazon S3 • Managed blob storage service available via API

    • High-durability and availability • Can trigger AWS Lambda on data change • Support data lifecycle management automation 15
  14. AWS Lambda • Easy to write lightweight processing functions •

    Triggered by events from other AWS components • Many supported runtimes (Node, Python, Java, ...) 16
  15. Amazon Athena • Interactive query service for structured data on

    S3 • SQL expression support • No data-preparation or ETL needed, just schema definition 18
  16. Serverless solution benefits • Unlimited and reliable storage, managed by

    AWS • Easier to reason about smaller unit of computation to build pipeline of data analysis/transformation • Elastic data ingestion, data always delivered on time • No maintenance or upgrade costs 19
  17. Serverless solution limits • Lambda functions have time execution, memory

    and max concurrency limits • Tuning Kinesis for cost-effectiveness can be tricky • Storing on S3 with no data lifecycle management can be expensive in the long run 20