Slide 1

Slide 1 text

Build a serverless data pipeline for Insurance Telematics HPC for Industry Workshop Milan 21/05/2019 and sleep at night!

Slide 2

Slide 2 text

“ How hard is it to ingest hundreds of millions messages per day and get real-time insights? 2

Slide 3

Slide 3 text

On-premise solution 3 Devices send streaming data (GPS, acceleration) Kafka buffers data in pub/sub topics to be consumed Spark jobs filter aggregate or transform data HDFS cluster stores data for analytics or further processing

Slide 4

Slide 4 text

4 Image credits: https://www.edureka.co/blog/hadoop-ecosystem

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

Hello! I am Francesco Lerro Delivering highly available solutions with millions interactions since 2005 In love with cloud computing and buffalo mozzarella 6 @flerro

Slide 7

Slide 7 text

A factory which develops data-intensive solutions, applications or components, pursuing following goals: ● preserve/extract value/enrich existing data ● push forward services innovation/digital transformation ● investigate future technology scenarios for Insurance 7 We are an InsurTech inside

Slide 8

Slide 8 text

Data Science Data mining Big Data Machine Learning Image processing Natural Language Processing Computer Science SW Engineering Data Engineering Big Data architectures High Perf Computing IoT & Mobile Signal processing #WeAreHiring Data Scientists and SW Engineers

Slide 9

Slide 9 text

10.000.000 9 Vehicles with UnipolSai insurance 4.000.000 Black-box installed on vehicles 150.000.000 Events produced daily

Slide 10

Slide 10 text

Limits of on-premise ● Complex system with high operational costs ● Data ingestion delays on high usage peaks ● Jobs for data analysis may take too long ● Analysis of ingested data not always easy 10

Slide 11

Slide 11 text

11

Slide 12

Slide 12 text

Serverless is a Paradigm Shift ● Automated high availability ● Flexible scalability ● Pay for what you use ● Focus on your business 12

Slide 13

Slide 13 text

Amazon IoT Core ● Managed platform to handle IoT devices ● Provides a rule-engine to build an “IoT application” ● Routes incoming messages to other AWS services (Kinesis, Lambda, ...) 13

Slide 14

Slide 14 text

Amazon Kinesis Firehose ● Managed service to stream data to storage services ● Available destinations include S3, Redshift, Splunk ● Max data delay to storage is 60 seconds 14

Slide 15

Slide 15 text

Amazon S3 ● Managed blob storage service available via API ● High-durability and availability ● Can trigger AWS Lambda on data change ● Support data lifecycle management automation 15

Slide 16

Slide 16 text

AWS Lambda ● Easy to write lightweight processing functions ● Triggered by events from other AWS components ● Many supported runtimes (Node, Python, Java, ...) 16

Slide 17

Slide 17 text

Serverless data ingestion 17 Devices send streaming data (GPS, acceleration) AWS IoT + Amazon Kinesis Amazon S3 AWS Lambda

Slide 18

Slide 18 text

Amazon Athena ● Interactive query service for structured data on S3 ● SQL expression support ● No data-preparation or ETL needed, just schema definition 18

Slide 19

Slide 19 text

Serverless solution benefits ● Unlimited and reliable storage, managed by AWS ● Easier to reason about smaller unit of computation to build pipeline of data analysis/transformation ● Elastic data ingestion, data always delivered on time ● No maintenance or upgrade costs 19

Slide 20

Slide 20 text

Serverless solution limits ● Lambda functions have time execution, memory and max concurrency limits ● Tuning Kinesis for cost-effectiveness can be tricky ● Storing on S3 with no data lifecycle management can be expensive in the long run 20

Slide 21

Slide 21 text

21 Credits Forrest Brazeal

Slide 22

Slide 22 text

Thanks! Any questions? @flerro francesco.lerro@leitha.eu 22 Presentation template by SlidesCarnival is hiring Data Scientists and SW Engineers