Architecting for Real-Time Insights with Streaming Data (AWS Kinesis / Apache Kafka)

© 2019, Amazon Web Services, Inc. or its affiliates. All
rights reserved. S U M M I T Antonello Mantuano Head of Software Engineering Cerved Dr Frank Munz Senior Technical Evangelist Amazon Web Services Architecting for Real-Time Insights with Streaming Data

rights reserved. S U M M I T S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Introductory - 200 “These sessions provide an overview of AWS services and features, and they assume that attendees are new to the topic. These sessions highlight basic use cases, features, functions, and benefits."

rights reserved. S U M M I T - Streaming Architectures - Amazon Kinesis - Serverless Stream Processing - Amazon Managed Streaming for Kafka (MSK) - Customer Success Story: Antonello Mantuano from Cerved Agenda

rights reserved. S U M M I T Streaming Data Web Clickstream Application Logs IoT Sensors [Wed Oct 11 14:32:52 2018] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/ht docs/test Continuously generated, small size events, low latency requirements

rights reserved. S U M M I T Transform and Process Continuously Streaming Ingest video & data as it’s generated Process data on the fly Real-time analytics/ML, alerts, actions

rights reserved. S U M M I T From Batch to Streaming Analytics https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

rights reserved. S U M M I T Amazon Kinesis Real-time data streaming and analytics Easily collect, process, and analyze streams in real time Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL or Java Build custom applications that analyze data streams NEW!

rights reserved. S U M M I T Amazon Kinesis Data Streams Overview

rights reserved. S U M M I T Data Ingestion from a Variety of Sources Kinesis Data Streams Transactions ERP Web logs/ cookies Connected devices AWS SDKs • Publish directly from application code via APIs • AWS Mobile SDK • Managed AWS sources: CloudWatch Logs, AWS IoT, Kinesis Data Analytics and more • RDS Aurora via Lambda Kinesis Agent • Monitors log files and forwards lines as messages to Kinesis Data Streams 3rd party and open source • Log4j appender • Apache Kafka • Flume, fluentd, and more … Kinesis Producer Library (KPL) • Background process aggregates and batches messages

rights reserved. S U M M I T

rights reserved. S U M M I T Kinesis Data Streams: Standard consumers

rights reserved. S U M M I T Kinesis Data Streams: Standard consumers Shard 1 Shard 2 Shard 3 Shard n Consumer Application A GetRecords() Data GetRecords(): 5 transactions or 2MB per second, per shard Data Producer up to 1 MB or 1000 records per second, per shard With one consumer application, records can be retrieved every 200 ms. Stream

rights reserved. S U M M I T Kinesis Data Streams: Enhanced fan-out consumers Every consumer gets dedicated 2MB per second, per shard. Latency is typically less than 70 msec. Shard 1 Data Producer Consumer Application B Consumer Application A RegisterStreamConsumer() EFO Pipe SubscribeToShard() Data: up to 2MB per second EFO Pipe HTTP/2: Consumers do not poll. Messages are pushed to the consumer as they arrive. RegisterStreamConsumer() SubscribeToShard() Data: up to 2MB per second Stream

rights reserved. S U M M I T The Serverless Operational Model No provisioning, no management Pay for value Automatic scaling Highly available and secure

rights reserved. S U M M I T Processing a Data Stream with Lambda data producer Kinesis Data Streams Amazon SNS Continuously stream data Lambda service Lambda function A Lambda function B Continuously polls for new data, 1 poll per second Automatically invokes your function(s) when data found Lambda polls each shard once per second Lambda’s maximum execution time is 15 minutes

rights reserved. S U M M I T Kinesis Streaming Data Analytics: SQL or Apache Flink (Java)

rights reserved. S U M M I T Kinesis Streaming Data Analytics / SQL

rights reserved. S U M M I T Kinesis Streaming Data Analytics / Apache Flink Framework and engine for stateful processing of data streams. Simple programming High performance Stateful Processing Strong data integrity Easy to use and flexible APIs make building apps fast In-memory computing provides low latency & high throughput Durable application state saves Exactly-once processing and consistent state

rights reserved. S U M M I T Kinesis Data Firehose: Ingest Transform Load (ITL)

rights reserved. S U M M I T Kinesis Data Firehose—How it Works Ingest Transform Deliver Amazon S3 Amazon Redshift Amazon Elasticsearch Service AWS IoT Amazon Kinesis Agent Amazon Kinesis Streams Amazon CloudWatch Logs Amazon CloudWatch Events Apache Kafka

rights reserved. S U M M I T Kinesis Data Firehose: Record format Conversion Kinesis Data Firehose Amazon S3 Glue Data Catalog Data Producer schema convert to columnar format JSON data /failed

rights reserved. S U M M I T Amazon Kinesis Data – Streams vs. Firehose Scalable and durable real-time data streaming service with provisioned throughput and sub-second latency that can continuously capture gigabytes of data per second from hundreds of thousands of sources. Kinesis Data Streams Kinesis Data Firehose Capture, transform, convert and load data streams into AWS data stores for near real-time analytics. Data latency 60 seconds.

rights reserved. S U M M I T Demo Architecture

rights reserved. S U M M I T Live Demo! Use your phone & connect to: XXX 2. modo ! 3. modo " 1. preparazione

rights reserved. S U M M I T Comparing Amazon Kinesis Data Streams to MSK Amazon Kinesis Data Streams Amazon MSK Newest data Oldest data 5 0 1 2 3 4 0 1 2 3 0 1 2 3 4 Shard 2 Shard 1 Shard 3 Writes from Producers Stream with 3 shards Newest data Oldest data 5 0 1 2 3 4 0 1 2 3 0 1 2 3 4 Partition 2 Partition 1 Partition 3 Writes from Producers Topic with 3 partitions

rights reserved. S U M M I T TopicA Partition1 TopicA Partition3 Partition Replica Replica Producer Zoo- keeper Zoo- keeper Zoo- keeper State & Config TopicA Partition2 Replica Cluster Apache Kafka: Partitioned, Replicated Commit Log

rights reserved. S U M M I T Challenges operating Apache Kafka Difficult to setup, configure and operate Hard to achieve high availability Tricky to scale AWS integrations = development No console, no visible metrics

Getting started with Amazon MSK Preview is easy

Fast Data from Legacy to Cloud The battle to overcome
the gravity Marzo ’19 Antonello Mantuano Head of Software Engineering antonellomantuano @manant74

Cerved – the Data Driven Company 42 Credit Information Credit
Management Marketing Solutions LEAD GENERATION CREDIT COLLECTION DATA PROVIDING & MARKETING ANALYSIS CREDIT INFORMATION CREDIT SCORING BAD CREDITS EVALUATION We are deeply passionate about data. Our data enables various financial services, from credit risk analysis to marketing solutions to managing non-performing loans and bad debt. •1M companies sites Web Data •4M info from open data set Open Data •70M payment •60M scoring Cerved Data •70M Real Estate Property •20M companies •16M shareholders Company Data 2.600 Persons 34.000 Customers 40M€ In Data & Technologies 30 M Decisions 1.400 TB Of Data

Why Cerved in Cloud? 43 TIME-TO-MARKET Rapid implementation for basic
services. Benefits of Cloud for Cerved PRIVACY & SEGREGATION Manage customers data in secure mode AVAILABILITY Services available 7x24 SCALABILITY Infrastructure quickly adaptable to the load

The Data Gravity 44 As data accumulates, it begins to
have gravity. This Data Gravity pulls services and applications closer to the data. - Dave McCrory, 2010 DATA Services Apps Latency Throughpu t This attraction (gravitational force) is caused by the need for services and applications to have higher bandwidth and/or lower latency access to the data.

45 Data Ecosystem in Cerved Sourcing Liv.2 Sourcing Liv. 1
REPOS SYNTH Mondo Dati Lince Dati clienti NCA ERG EBS HUB DWH MBD Teradata Tabula Mongo4 DW DB4You XPCH 2 MATCH NEMO Quaes tio LUDO Tabula (su AWS) Aracne G4U MBD1 R3 Pragma Splunk CDR Mambo CAS Dedalo ELK CSS CR-RIBA (Payline)

How to overcome Gravity? How to lift to the cloud
with Data Gravity?

Cerved Data in Cloud Architecture Cerved DBs CDC DBs Operational
Online Data OLTP Processes Batch Hadoop DataLake NoSql Tabula Cloud DB DynamoDB RDS S3 Streaming is the new ETL CDC Producer Raw Events Aggregator Basic Events Aggregator HL Events Aggregator NoSql Ingestion Synk Connector For Cloud Hadoop Ingestion Stream Processing Streaming is the Anti-Gravity

Cerved API: a Data In Cloud use case Cerved DBs
CDC DBs Operational Online Data OLTP Processes Batch Hadoop DataLake NoSql Tabula Cloud DB DynamoDB RDS S3 CDC Producer Raw Events Aggregator Basic Events Aggregator HL Events Aggregator NoSql Ingestion Sync Connector to Cloud Hadoop Ingestion Stream Processing Back End AWS Lambda Spring Boot API Gateway Front End ReactJs Redux Swagger

The Results of API in Cloud 49 SLA API available
7x24x365 99,998% in January 2019 PERFORMANCE High scalability with quickly adaptability to load COSTS With AWS Lambda, DynamoDB, S3, ecc… the cost of infrastructure grows with the load DATA SYNC Data are continuously updated in near real time mode

Future use case of Data in Cloud Cerved DBs CDC
DBs Operational Online Data OLTP Processes Batch Hadoop DataLake NoSql Tabula Cloud DB DynamoDB RDS S3 CDC Producer Raw Events Aggregator Basic Events Aggregator HL Events Aggregator NoSql Ingestion Sync Connector To Cloud Hadoop Ingestion Stream Processing Back End AWS Lambda API Gateway EMR SageMaker AWS Kinesis or Managed Streaming for Kafka Data Scientist in Cloud Real Time Apps DR & Backup Use Cases

THANK YOU Moving Fast Data in cloud creates a new
gravity for new and innovative apps and services Antonello Mantuano Head of Software Engineering antonellomantuano @manant74

Architecting for Real-Time Insights with Stream...

Architecting for Real-Time Insights with Streaming Data (AWS Kinesis / Apache Kafka)

More Decks by Frank Munz

Featured

Transcript