Architecting for Real-Time Insights with Streaming Data (AWS Kinesis / Apache Kafka)

Slide 1

Slide 1 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Antonello Mantuano Head of Software Engineering Cerved Dr Frank Munz Senior Technical Evangelist Amazon Web Services Architecting for Real-Time Insights with Streaming Data

Slide 2

Slide 2 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Introductory - 200 “These sessions provide an overview of AWS services and features, and they assume that attendees are new to the topic. These sessions highlight basic use cases, features, functions, and benefits."

Slide 3

Slide 3 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T - Streaming Architectures - Amazon Kinesis - Serverless Stream Processing - Amazon Managed Streaming for Kafka (MSK) - Customer Success Story: Antonello Mantuano from Cerved Agenda

Slide 4

Slide 4 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Streaming Data Web Clickstream Application Logs IoT Sensors [Wed Oct 11 14:32:52 2018] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/ht docs/test Continuously generated, small size events, low latency requirements

Slide 5

Slide 5 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Transform and Process Continuously Streaming Ingest video & data as it’s generated Process data on the fly Real-time analytics/ML, alerts, actions

Slide 6

Slide 6 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T From Batch to Streaming Analytics https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Slide 7

Slide 7 text

Slide 8

Slide 8 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Amazon Kinesis Real-time data streaming and analytics Easily collect, process, and analyze streams in real time Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL or Java Build custom applications that analyze data streams NEW!

Slide 9

Slide 9 text

Slide 10

Slide 10 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Data Ingestion from a Variety of Sources Kinesis Data Streams Transactions ERP Web logs/ cookies Connected devices AWS SDKs • Publish directly from application code via APIs • AWS Mobile SDK • Managed AWS sources: CloudWatch Logs, AWS IoT, Kinesis Data Analytics and more • RDS Aurora via Lambda Kinesis Agent • Monitors log files and forwards lines as messages to Kinesis Data Streams 3rd party and open source • Log4j appender • Apache Kafka • Flume, fluentd, and more … Kinesis Producer Library (KPL) • Background process aggregates and batches messages

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Kinesis Data Streams: Standard consumers Shard 1 Shard 2 Shard 3 Shard n Consumer Application A GetRecords() Data GetRecords(): 5 transactions or 2MB per second, per shard Data Producer up to 1 MB or 1000 records per second, per shard With one consumer application, records can be retrieved every 200 ms. Stream

Slide 14

Slide 14 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Kinesis Data Streams: Enhanced fan-out consumers Every consumer gets dedicated 2MB per second, per shard. Latency is typically less than 70 msec. Shard 1 Data Producer Consumer Application B Consumer Application A RegisterStreamConsumer() EFO Pipe SubscribeToShard() Data: up to 2MB per second EFO Pipe HTTP/2: Consumers do not poll. Messages are pushed to the consumer as they arrive. RegisterStreamConsumer() SubscribeToShard() Data: up to 2MB per second Stream

Slide 15

Slide 15 text

Slide 16

Slide 16 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T The Serverless Operational Model No provisioning, no management Pay for value Automatic scaling Highly available and secure

Slide 17

Slide 17 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Processing a Data Stream with Lambda data producer Kinesis Data Streams Amazon SNS Continuously stream data Lambda service Lambda function A Lambda function B Continuously polls for new data, 1 poll per second Automatically invokes your function(s) when data found Lambda polls each shard once per second Lambda’s maximum execution time is 15 minutes

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Kinesis Streaming Data Analytics / Apache Flink Framework and engine for stateful processing of data streams. Simple programming High performance Stateful Processing Strong data integrity Easy to use and flexible APIs make building apps fast In-memory computing provides low latency & high throughput Durable application state saves Exactly-once processing and consistent state

Slide 21

Slide 21 text

Slide 22

Slide 22 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Kinesis Data Firehose—How it Works Ingest Transform Deliver Amazon S3 Amazon Redshift Amazon Elasticsearch Service AWS IoT Amazon Kinesis Agent Amazon Kinesis Streams Amazon CloudWatch Logs Amazon CloudWatch Events Apache Kafka

Slide 23

Slide 23 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Kinesis Data Firehose: Record format Conversion Kinesis Data Firehose Amazon S3 Glue Data Catalog Data Producer schema convert to columnar format JSON data /failed

Slide 24

Slide 24 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Amazon Kinesis Data – Streams vs. Firehose Scalable and durable real-time data streaming service with provisioned throughput and sub-second latency that can continuously capture gigabytes of data per second from hundreds of thousands of sources. Kinesis Data Streams Kinesis Data Firehose Capture, transform, convert and load data streams into AWS data stores for near real-time analytics. Data latency 60 seconds.

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Comparing Amazon Kinesis Data Streams to MSK Amazon Kinesis Data Streams Amazon MSK Newest data Oldest data 5 0 1 2 3 4 0 1 2 3 0 1 2 3 4 Shard 2 Shard 1 Shard 3 Writes from Producers Stream with 3 shards Newest data Oldest data 5 0 1 2 3 4 0 1 2 3 0 1 2 3 4 Partition 2 Partition 1 Partition 3 Writes from Producers Topic with 3 partitions

Slide 30

Slide 30 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T TopicA Partition1 TopicA Partition3 Partition Replica Replica Producer Zoo- keeper Zoo- keeper Zoo- keeper State & Config TopicA Partition2 Replica Cluster Apache Kafka: Partitioned, Replicated Commit Log

Slide 31

Slide 31 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Challenges operating Apache Kafka Difficult to setup, configure and operate Hard to achieve high availability Tricky to scale AWS integrations = development No console, no visible metrics

Slide 32

Slide 32 text

Getting started with Amazon MSK Preview is easy

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Fast Data from Legacy to Cloud The battle to overcome the gravity Marzo ’19 Antonello Mantuano Head of Software Engineering antonellomantuano @manant74

Slide 35

Slide 35 text

Cerved – the Data Driven Company 42 Credit Information Credit Management Marketing Solutions LEAD GENERATION CREDIT COLLECTION DATA PROVIDING & MARKETING ANALYSIS CREDIT INFORMATION CREDIT SCORING BAD CREDITS EVALUATION We are deeply passionate about data. Our data enables various financial services, from credit risk analysis to marketing solutions to managing non-performing loans and bad debt. •1M companies sites Web Data •4M info from open data set Open Data •70M payment •60M scoring Cerved Data •70M Real Estate Property •20M companies •16M shareholders Company Data 2.600 Persons 34.000 Customers 40M€ In Data & Technologies 30 M Decisions 1.400 TB Of Data

Slide 36

Slide 36 text

Why Cerved in Cloud? 43 TIME-TO-MARKET Rapid implementation for basic services. Benefits of Cloud for Cerved PRIVACY & SEGREGATION Manage customers data in secure mode AVAILABILITY Services available 7x24 SCALABILITY Infrastructure quickly adaptable to the load

Slide 37

Slide 37 text

The Data Gravity 44 As data accumulates, it begins to have gravity. This Data Gravity pulls services and applications closer to the data. - Dave McCrory, 2010 DATA Services Apps Latency Throughpu t This attraction (gravitational force) is caused by the need for services and applications to have higher bandwidth and/or lower latency access to the data.

Slide 38

Slide 38 text

45 Data Ecosystem in Cerved Sourcing Liv.2 Sourcing Liv. 1 REPOS SYNTH Mondo Dati Lince Dati clienti NCA ERG EBS HUB DWH MBD Teradata Tabula Mongo4 DW DB4You XPCH 2 MATCH NEMO Quaes tio LUDO Tabula (su AWS) Aracne G4U MBD1 R3 Pragma Splunk CDR Mambo CAS Dedalo ELK CSS CR-RIBA (Payline)

Slide 39

Slide 39 text

How to overcome Gravity? How to lift to the cloud with Data Gravity?

Slide 40

Slide 40 text

Cerved Data in Cloud Architecture Cerved DBs CDC DBs Operational Online Data OLTP Processes Batch Hadoop DataLake NoSql Tabula Cloud DB DynamoDB RDS S3 Streaming is the new ETL CDC Producer Raw Events Aggregator Basic Events Aggregator HL Events Aggregator NoSql Ingestion Synk Connector For Cloud Hadoop Ingestion Stream Processing Streaming is the Anti-Gravity

Slide 41

Slide 41 text

Cerved API: a Data In Cloud use case Cerved DBs CDC DBs Operational Online Data OLTP Processes Batch Hadoop DataLake NoSql Tabula Cloud DB DynamoDB RDS S3 CDC Producer Raw Events Aggregator Basic Events Aggregator HL Events Aggregator NoSql Ingestion Sync Connector to Cloud Hadoop Ingestion Stream Processing Back End AWS Lambda Spring Boot API Gateway Front End ReactJs Redux Swagger

Slide 42

Slide 42 text

The Results of API in Cloud 49 SLA API available 7x24x365 99,998% in January 2019 PERFORMANCE High scalability with quickly adaptability to load COSTS With AWS Lambda, DynamoDB, S3, ecc… the cost of infrastructure grows with the load DATA SYNC Data are continuously updated in near real time mode

Slide 43

Slide 43 text

Future use case of Data in Cloud Cerved DBs CDC DBs Operational Online Data OLTP Processes Batch Hadoop DataLake NoSql Tabula Cloud DB DynamoDB RDS S3 CDC Producer Raw Events Aggregator Basic Events Aggregator HL Events Aggregator NoSql Ingestion Sync Connector To Cloud Hadoop Ingestion Stream Processing Back End AWS Lambda API Gateway EMR SageMaker AWS Kinesis or Managed Streaming for Kafka Data Scientist in Cloud Real Time Apps DR & Backup Use Cases

Slide 44

Slide 44 text

THANK YOU Moving Fast Data in cloud creates a new gravity for new and innovative apps and services Antonello Mantuano Head of Software Engineering antonellomantuano @manant74