aws-data-pipelines-part-1

AWS DENVER DATA PIPELINES – PART 1 AWS Meetup Group
April 29 - Kevin Tinn

• Cloud Application Architect Practice Lead at World Wide Technology
• Over a decade of experience in the industry • Software Development • .NET (C# and VB.NET) • JVM (Scala) • JavaScript • Data Streaming Architectures • Kafka, Kinesis, Event Hubs • Application Architecture • AWS, Azure, and on-prem solutions • Incessant traveler with a new-found skiing addiction • I Live in Denver, by way of St Louis, and grew up in TN INTRO: KEVIN TINN 2

• Overview of data pipelines and event streaming - 6:00
- 10 minutes • POC Overview and pipeline architecture overview - 20 minutes • Overview of components and IaC – 30 minutes • Kinesis Streams, Firehose, Analytics, S3, and Athena • Demo and Console tour of components – 20 minutes • Manual start of Analytics apps and creation of Athena tables • Dive into IaC – 20 minutes • Q&A AGENDA 3

• Source code for AWS infrastructure and the kinesis producer
are available at https://github.com/kevasync/aws-meetup-group-data-services • This link is available in the comments of the Meetup deets https://www.meetup.com/AWSMeetupGroup/events/269768602/ Please join the Meetup if you haven’t already REPO INFO 4

Traditional batch architectures result in systems move data around in
batches, which only allows apps to be as up to date, highly dependent systems may even write to each other’s persistence layer GOAL: MOVE FROM BATCH TO EVENT 5

As the number of apps grows the communication mesh becomes
significantly rediculous GOAL: MOVE FROM BATCH TO EVENT 6

GOAL: MOVE FROM BATCH TO EVENT 7

GOAL: MOVE FROM BATCH TO EVENT 8

• Problem Statement: Simulate the ingestion and transformation of sensor
data • Produce messages for temperature and pressure readings at a manufacturing site • Raw data is stored for compliance purposes • Data is separated into separate streams of data: one for temp, the other for pressure • Enrich pressure data with altitude reference data • Enrich temperature data with ambient weather reference data • Store enriched data in s3 data lake with Athena query capabilities DEMO DATA PIPELINE OVERVIEW 9

DATA PIPELINE ARCHITECTURE 10

Is something missing? Nope.. No VPCs, subnet, CIDR blocks, VMs,
Security Groups, or Network ACLs DATA PIPELINE ARCHITECTURE 11

• Using Pulumi to create Infrastructure as Code • Check
out repo from my Terraform and Pulumi Meetup: https://github.com/kevasync/aws-meetup-group-terraform • Wanted to use Pulumi on this to try out the new v2.0 release • Introduces full fidelity between languages, including full C# support • Love Terraform too • Demo • get into repo and take a spin around the project • Deploy from Pulumi CLI INTRO TO CODED INFRA AND DEPLOYMENT 12

• Amazon Kinesis Data Streams is a massively scalable, highly
durable data ingestion and processing service optimized for streaming data • Allows for many-to-many communication with extremely low latency • Fully managed service • Stream consists of shards, which allow for parallelism • Messages are produced with a partition key to allow for time-ordered process • When dealing with stream processing, always make consumption an idempotent process COMPONENTS: KINESIS STREAM 13

• Amazon Kinesis Data Firehose is a fully managed service
for delivering real-time streaming data to destinations such as • Either dump (Commonly referred to as sink) data to sources without writing code • Mapping transformations allow for light ETL tasks • Various destination are supported • S3 • Redshift • Elasticsearch • Splunk COMPONENTS: KINESIS FIREHOSE DELIVERY STREAM 14

• With Amazon Kinesis Data Analytics for SQL Applications, you
can process and analyze streaming data using standard SQL. The service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics • Allows for stream input to be combined with other streams, as well as reference data from s3 • Uses SQL syntax that is relatively intuitive • Apache Flink can be used as well • Similar to KSQL in the Confluent Platform COMPONENTS: KINESIS ANALYTICS APPLICATION 15

• S3 or Amazon Simple Storage Service is a service
offered by Amazon Web Services (AWS) that provides object storage through a web service interface • Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run • Athena databases are configured to read from an s3 bucket • Athena database table specifies schema of files, as well as format • CSV • TSV • JSON • Parquet COMPONENTS: S3 & ATHENA 16

• Upload reference data • Start Kinesis Analytics Applications •
Create Athena tables • Demo data going through Analytics Applications • Py data producer • View data in raw, enriched, and reference buckets • Dive into Analytics Application in console • SQL Syntax • Check out Athena query interface MANUAL STEPS 17

DIVE INTO CODED INFRA, TO THE CODE! 18

• In this session, we have covered: • Benefits of
event driven architectures • Overview of Demo architecture • Deployment of coded infrastructure • Overview of AWS components used • Manual steps to complete setup of Analytics Applications and Athena • Thank you for coming • Please talk to me if you have further questions CONCLUSION 19

• Part 2, Last Wednesday of May • Full Athena
impl • Data warehousing with Red Shift • Shared app layer with Elasticsearch • Redshift/s3 integration using s3 Spectrum • Other sweet AWS data things… • Crowdsourcing ideas welcome! • Curiosities • Problems • AppSync Part 2, Second Wednesday of May – Austin Loveless • IoT Core, Last Wednesday of June – Kevin Tinn UPCOMING MEETUPS 20

WWT does a ton of data pipeline projects on all
major platforms Reach out on LinkedIn: www.linkedin.com/in/kevin-tinn 21

aws-data-pipelines-part-1

aws-data-pipelines-part-1

Kevin Tinn

More Decks by Kevin Tinn

Other Decks in Technology

Featured

Transcript

AWS DENVER DATA PIPELINES – PART 1 AWS Meetup Group

• Cloud Application Architect Practice Lead at World Wide Technology

• Overview of data pipelines and event streaming - 6:00

• Source code for AWS infrastructure and the kinesis producer

Traditional batch architectures result in systems move data around in

As the number of apps grows the communication mesh becomes

GOAL: MOVE FROM BATCH TO EVENT 7

GOAL: MOVE FROM BATCH TO EVENT 8

• Problem Statement: Simulate the ingestion and transformation of sensor

DATA PIPELINE ARCHITECTURE 10

Is something missing? Nope.. No VPCs, subnet, CIDR blocks, VMs,

• Using Pulumi to create Infrastructure as Code • Check

• Amazon Kinesis Data Streams is a massively scalable, highly

• Amazon Kinesis Data Firehose is a fully managed service

• With Amazon Kinesis Data Analytics for SQL Applications, you

• S3 or Amazon Simple Storage Service is a service

• Upload reference data • Start Kinesis Analytics Applications •

DIVE INTO CODED INFRA, TO THE CODE! 18

• In this session, we have covered: • Benefits of

• Part 2, Last Wednesday of May • Full Athena

WWT does a ton of data pipeline projects on all