rights reserved. S U M M I T Antonello Mantuano Head of Software Engineering Cerved Dr Frank Munz Senior Technical Evangelist Amazon Web Services Architecting for Real-Time Insights with Streaming Data
rights reserved. S U M M I T - Streaming Architectures - Amazon Kinesis - Serverless Stream Processing - Amazon Managed Streaming for Kafka (MSK) - Customer Success Story: Antonello Mantuano from Cerved Agenda
rights reserved. S U M M I T Streaming Data Web Clickstream Application Logs IoT Sensors [Wed Oct 11 14:32:52 2018] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/ht docs/test Continuously generated, small size events, low latency requirements
rights reserved. S U M M I T Amazon Kinesis Real-time data streaming and analytics Easily collect, process, and analyze streams in real time Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL or Java Build custom applications that analyze data streams NEW!
rights reserved. S U M M I T Data Ingestion from a Variety of Sources Kinesis Data Streams Transactions ERP Web logs/ cookies Connected devices AWS SDKs • Publish directly from application code via APIs • AWS Mobile SDK • Managed AWS sources: CloudWatch Logs, AWS IoT, Kinesis Data Analytics and more • RDS Aurora via Lambda Kinesis Agent • Monitors log files and forwards lines as messages to Kinesis Data Streams 3rd party and open source • Log4j appender • Apache Kafka • Flume, fluentd, and more … Kinesis Producer Library (KPL) • Background process aggregates and batches messages
rights reserved. S U M M I T Kinesis Data Streams: Standard consumers Shard 1 Shard 2 Shard 3 Shard n Consumer Application A GetRecords() Data GetRecords(): 5 transactions or 2MB per second, per shard Data Producer up to 1 MB or 1000 records per second, per shard With one consumer application, records can be retrieved every 200 ms. Stream
rights reserved. S U M M I T Kinesis Data Streams: Enhanced fan-out consumers Every consumer gets dedicated 2MB per second, per shard. Latency is typically less than 70 msec. Shard 1 Data Producer Consumer Application B Consumer Application A RegisterStreamConsumer() EFO Pipe SubscribeToShard() Data: up to 2MB per second EFO Pipe HTTP/2: Consumers do not poll. Messages are pushed to the consumer as they arrive. RegisterStreamConsumer() SubscribeToShard() Data: up to 2MB per second Stream
rights reserved. S U M M I T Processing a Data Stream with Lambda data producer Kinesis Data Streams Amazon SNS Continuously stream data Lambda service Lambda function A Lambda function B Continuously polls for new data, 1 poll per second Automatically invokes your function(s) when data found Lambda polls each shard once per second Lambda’s maximum execution time is 15 minutes
rights reserved. S U M M I T Kinesis Streaming Data Analytics / Apache Flink Framework and engine for stateful processing of data streams. Simple programming High performance Stateful Processing Strong data integrity Easy to use and flexible APIs make building apps fast In-memory computing provides low latency & high throughput Durable application state saves Exactly-once processing and consistent state
rights reserved. S U M M I T Kinesis Data Firehose—How it Works Ingest Transform Deliver Amazon S3 Amazon Redshift Amazon Elasticsearch Service AWS IoT Amazon Kinesis Agent Amazon Kinesis Streams Amazon CloudWatch Logs Amazon CloudWatch Events Apache Kafka
rights reserved. S U M M I T Amazon Kinesis Data – Streams vs. Firehose Scalable and durable real-time data streaming service with provisioned throughput and sub-second latency that can continuously capture gigabytes of data per second from hundreds of thousands of sources. Kinesis Data Streams Kinesis Data Firehose Capture, transform, convert and load data streams into AWS data stores for near real-time analytics. Data latency 60 seconds.
rights reserved. S U M M I T Comparing Amazon Kinesis Data Streams to MSK Amazon Kinesis Data Streams Amazon MSK Newest data Oldest data 5 0 1 2 3 4 0 1 2 3 0 1 2 3 4 Shard 2 Shard 1 Shard 3 Writes from Producers Stream with 3 shards Newest data Oldest data 5 0 1 2 3 4 0 1 2 3 0 1 2 3 4 Partition 2 Partition 1 Partition 3 Writes from Producers Topic with 3 partitions
rights reserved. S U M M I T TopicA Partition1 TopicA Partition3 Partition Replica Replica Producer Zoo- keeper Zoo- keeper Zoo- keeper State & Config TopicA Partition2 Replica Cluster Apache Kafka: Partitioned, Replicated Commit Log
rights reserved. S U M M I T Challenges operating Apache Kafka Difficult to setup, configure and operate Hard to achieve high availability Tricky to scale AWS integrations = development No console, no visible metrics
Management Marketing Solutions LEAD GENERATION CREDIT COLLECTION DATA PROVIDING & MARKETING ANALYSIS CREDIT INFORMATION CREDIT SCORING BAD CREDITS EVALUATION We are deeply passionate about data. Our data enables various financial services, from credit risk analysis to marketing solutions to managing non-performing loans and bad debt. •1M companies sites Web Data •4M info from open data set Open Data •70M payment •60M scoring Cerved Data •70M Real Estate Property •20M companies •16M shareholders Company Data 2.600 Persons 34.000 Customers 40M€ In Data & Technologies 30 M Decisions 1.400 TB Of Data
have gravity. This Data Gravity pulls services and applications closer to the data. - Dave McCrory, 2010 DATA Services Apps Latency Throughpu t This attraction (gravitational force) is caused by the need for services and applications to have higher bandwidth and/or lower latency access to the data.
REPOS SYNTH Mondo Dati Lince Dati clienti NCA ERG EBS HUB DWH MBD Teradata Tabula Mongo4 DW DB4You XPCH 2 MATCH NEMO Quaes tio LUDO Tabula (su AWS) Aracne G4U MBD1 R3 Pragma Splunk CDR Mambo CAS Dedalo ELK CSS CR-RIBA (Payline)
Online Data OLTP Processes Batch Hadoop DataLake NoSql Tabula Cloud DB DynamoDB RDS S3 Streaming is the new ETL CDC Producer Raw Events Aggregator Basic Events Aggregator HL Events Aggregator NoSql Ingestion Synk Connector For Cloud Hadoop Ingestion Stream Processing Streaming is the Anti-Gravity
CDC DBs Operational Online Data OLTP Processes Batch Hadoop DataLake NoSql Tabula Cloud DB DynamoDB RDS S3 CDC Producer Raw Events Aggregator Basic Events Aggregator HL Events Aggregator NoSql Ingestion Sync Connector to Cloud Hadoop Ingestion Stream Processing Back End AWS Lambda Spring Boot API Gateway Front End ReactJs Redux Swagger
7x24x365 99,998% in January 2019 PERFORMANCE High scalability with quickly adaptability to load COSTS With AWS Lambda, DynamoDB, S3, ecc… the cost of infrastructure grows with the load DATA SYNC Data are continuously updated in near real time mode
DBs Operational Online Data OLTP Processes Batch Hadoop DataLake NoSql Tabula Cloud DB DynamoDB RDS S3 CDC Producer Raw Events Aggregator Basic Events Aggregator HL Events Aggregator NoSql Ingestion Sync Connector To Cloud Hadoop Ingestion Stream Processing Back End AWS Lambda API Gateway EMR SageMaker AWS Kinesis or Managed Streaming for Kafka Data Scientist in Cloud Real Time Apps DR & Backup Use Cases