Slide 1

Slide 1 text

AWS DENVER DATA PIPELINES – PART 1 AWS Meetup Group April 29 - Kevin Tinn

Slide 2

Slide 2 text

• Cloud Application Architect Practice Lead at World Wide Technology • Over a decade of experience in the industry • Software Development • .NET (C# and VB.NET) • JVM (Scala) • JavaScript • Data Streaming Architectures • Kafka, Kinesis, Event Hubs • Application Architecture • AWS, Azure, and on-prem solutions • Incessant traveler with a new-found skiing addiction • I Live in Denver, by way of St Louis, and grew up in TN INTRO: KEVIN TINN 2

Slide 3

Slide 3 text

• Overview of data pipelines and event streaming - 6:00 - 10 minutes • POC Overview and pipeline architecture overview - 20 minutes • Overview of components and IaC – 30 minutes • Kinesis Streams, Firehose, Analytics, S3, and Athena • Demo and Console tour of components – 20 minutes • Manual start of Analytics apps and creation of Athena tables • Dive into IaC – 20 minutes • Q&A AGENDA 3

Slide 4

Slide 4 text

• Source code for AWS infrastructure and the kinesis producer are available at https://github.com/kevasync/aws-meetup-group-data-services • This link is available in the comments of the Meetup deets https://www.meetup.com/AWSMeetupGroup/events/269768602/ Please join the Meetup if you haven’t already REPO INFO 4

Slide 5

Slide 5 text

Traditional batch architectures result in systems move data around in batches, which only allows apps to be as up to date, highly dependent systems may even write to each other’s persistence layer GOAL: MOVE FROM BATCH TO EVENT 5

Slide 6

Slide 6 text

As the number of apps grows the communication mesh becomes significantly rediculous GOAL: MOVE FROM BATCH TO EVENT 6

Slide 7

Slide 7 text

GOAL: MOVE FROM BATCH TO EVENT 7

Slide 8

Slide 8 text

GOAL: MOVE FROM BATCH TO EVENT 8

Slide 9

Slide 9 text

• Problem Statement: Simulate the ingestion and transformation of sensor data • Produce messages for temperature and pressure readings at a manufacturing site • Raw data is stored for compliance purposes • Data is separated into separate streams of data: one for temp, the other for pressure • Enrich pressure data with altitude reference data • Enrich temperature data with ambient weather reference data • Store enriched data in s3 data lake with Athena query capabilities DEMO DATA PIPELINE OVERVIEW 9

Slide 10

Slide 10 text

DATA PIPELINE ARCHITECTURE 10

Slide 11

Slide 11 text

Is something missing? Nope.. No VPCs, subnet, CIDR blocks, VMs, Security Groups, or Network ACLs DATA PIPELINE ARCHITECTURE 11

Slide 12

Slide 12 text

• Using Pulumi to create Infrastructure as Code • Check out repo from my Terraform and Pulumi Meetup: https://github.com/kevasync/aws-meetup-group-terraform • Wanted to use Pulumi on this to try out the new v2.0 release • Introduces full fidelity between languages, including full C# support • Love Terraform too • Demo • get into repo and take a spin around the project • Deploy from Pulumi CLI INTRO TO CODED INFRA AND DEPLOYMENT 12

Slide 13

Slide 13 text

• Amazon Kinesis Data Streams is a massively scalable, highly durable data ingestion and processing service optimized for streaming data • Allows for many-to-many communication with extremely low latency • Fully managed service • Stream consists of shards, which allow for parallelism • Messages are produced with a partition key to allow for time-ordered process • When dealing with stream processing, always make consumption an idempotent process COMPONENTS: KINESIS STREAM 13

Slide 14

Slide 14 text

• Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as • Either dump (Commonly referred to as sink) data to sources without writing code • Mapping transformations allow for light ETL tasks • Various destination are supported • S3 • Redshift • Elasticsearch • Splunk COMPONENTS: KINESIS FIREHOSE DELIVERY STREAM 14

Slide 15

Slide 15 text

• With Amazon Kinesis Data Analytics for SQL Applications, you can process and analyze streaming data using standard SQL. The service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics • Allows for stream input to be combined with other streams, as well as reference data from s3 • Uses SQL syntax that is relatively intuitive • Apache Flink can be used as well • Similar to KSQL in the Confluent Platform COMPONENTS: KINESIS ANALYTICS APPLICATION 15

Slide 16

Slide 16 text

• S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface • Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run • Athena databases are configured to read from an s3 bucket • Athena database table specifies schema of files, as well as format • CSV • TSV • JSON • Parquet COMPONENTS: S3 & ATHENA 16

Slide 17

Slide 17 text

• Upload reference data • Start Kinesis Analytics Applications • Create Athena tables • Demo data going through Analytics Applications • Py data producer • View data in raw, enriched, and reference buckets • Dive into Analytics Application in console • SQL Syntax • Check out Athena query interface MANUAL STEPS 17

Slide 18

Slide 18 text

DIVE INTO CODED INFRA, TO THE CODE! 18

Slide 19

Slide 19 text

• In this session, we have covered: • Benefits of event driven architectures • Overview of Demo architecture • Deployment of coded infrastructure • Overview of AWS components used • Manual steps to complete setup of Analytics Applications and Athena • Thank you for coming • Please talk to me if you have further questions CONCLUSION 19

Slide 20

Slide 20 text

• Part 2, Last Wednesday of May • Full Athena impl • Data warehousing with Red Shift • Shared app layer with Elasticsearch • Redshift/s3 integration using s3 Spectrum • Other sweet AWS data things… • Crowdsourcing ideas welcome! • Curiosities • Problems • AppSync Part 2, Second Wednesday of May – Austin Loveless • IoT Core, Last Wednesday of June – Kevin Tinn UPCOMING MEETUPS 20

Slide 21

Slide 21 text

WWT does a ton of data pipeline projects on all major platforms Reach out on LinkedIn: www.linkedin.com/in/kevin-tinn 21