Upgrade to Pro — share decks privately, control downloads, hide ads and more …

aws-data-pipelines-part-1

 aws-data-pipelines-part-1

Overview of Kinesis Data Streams, Firehose Delivery Streams, Kinesis Stream Analytics, and s3 + Athena Data Lake

Kevin Tinn

April 29, 2020
Tweet

More Decks by Kevin Tinn

Other Decks in Technology

Transcript

  1. • Cloud Application Architect Practice Lead at World Wide Technology

    • Over a decade of experience in the industry • Software Development • .NET (C# and VB.NET) • JVM (Scala) • JavaScript • Data Streaming Architectures • Kafka, Kinesis, Event Hubs • Application Architecture • AWS, Azure, and on-prem solutions • Incessant traveler with a new-found skiing addiction • I Live in Denver, by way of St Louis, and grew up in TN INTRO: KEVIN TINN 2
  2. • Overview of data pipelines and event streaming - 6:00

    - 10 minutes • POC Overview and pipeline architecture overview - 20 minutes • Overview of components and IaC – 30 minutes • Kinesis Streams, Firehose, Analytics, S3, and Athena • Demo and Console tour of components – 20 minutes • Manual start of Analytics apps and creation of Athena tables • Dive into IaC – 20 minutes • Q&A AGENDA 3
  3. • Source code for AWS infrastructure and the kinesis producer

    are available at https://github.com/kevasync/aws-meetup-group-data-services • This link is available in the comments of the Meetup deets https://www.meetup.com/AWSMeetupGroup/events/269768602/ Please join the Meetup if you haven’t already REPO INFO 4
  4. Traditional batch architectures result in systems move data around in

    batches, which only allows apps to be as up to date, highly dependent systems may even write to each other’s persistence layer GOAL: MOVE FROM BATCH TO EVENT 5
  5. As the number of apps grows the communication mesh becomes

    significantly rediculous GOAL: MOVE FROM BATCH TO EVENT 6
  6. • Problem Statement: Simulate the ingestion and transformation of sensor

    data • Produce messages for temperature and pressure readings at a manufacturing site • Raw data is stored for compliance purposes • Data is separated into separate streams of data: one for temp, the other for pressure • Enrich pressure data with altitude reference data • Enrich temperature data with ambient weather reference data • Store enriched data in s3 data lake with Athena query capabilities DEMO DATA PIPELINE OVERVIEW 9
  7. Is something missing? Nope.. No VPCs, subnet, CIDR blocks, VMs,

    Security Groups, or Network ACLs DATA PIPELINE ARCHITECTURE 11
  8. • Using Pulumi to create Infrastructure as Code • Check

    out repo from my Terraform and Pulumi Meetup: https://github.com/kevasync/aws-meetup-group-terraform • Wanted to use Pulumi on this to try out the new v2.0 release • Introduces full fidelity between languages, including full C# support • Love Terraform too • Demo • get into repo and take a spin around the project • Deploy from Pulumi CLI INTRO TO CODED INFRA AND DEPLOYMENT 12
  9. • Amazon Kinesis Data Streams is a massively scalable, highly

    durable data ingestion and processing service optimized for streaming data • Allows for many-to-many communication with extremely low latency • Fully managed service • Stream consists of shards, which allow for parallelism • Messages are produced with a partition key to allow for time-ordered process • When dealing with stream processing, always make consumption an idempotent process COMPONENTS: KINESIS STREAM 13
  10. • Amazon Kinesis Data Firehose is a fully managed service

    for delivering real-time streaming data to destinations such as • Either dump (Commonly referred to as sink) data to sources without writing code • Mapping transformations allow for light ETL tasks • Various destination are supported • S3 • Redshift • Elasticsearch • Splunk COMPONENTS: KINESIS FIREHOSE DELIVERY STREAM 14
  11. • With Amazon Kinesis Data Analytics for SQL Applications, you

    can process and analyze streaming data using standard SQL. The service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics • Allows for stream input to be combined with other streams, as well as reference data from s3 • Uses SQL syntax that is relatively intuitive • Apache Flink can be used as well • Similar to KSQL in the Confluent Platform COMPONENTS: KINESIS ANALYTICS APPLICATION 15
  12. • S3 or Amazon Simple Storage Service is a service

    offered by Amazon Web Services (AWS) that provides object storage through a web service interface • Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run • Athena databases are configured to read from an s3 bucket • Athena database table specifies schema of files, as well as format • CSV • TSV • JSON • Parquet COMPONENTS: S3 & ATHENA 16
  13. • Upload reference data • Start Kinesis Analytics Applications •

    Create Athena tables • Demo data going through Analytics Applications • Py data producer • View data in raw, enriched, and reference buckets • Dive into Analytics Application in console • SQL Syntax • Check out Athena query interface MANUAL STEPS 17
  14. • In this session, we have covered: • Benefits of

    event driven architectures • Overview of Demo architecture • Deployment of coded infrastructure • Overview of AWS components used • Manual steps to complete setup of Analytics Applications and Athena • Thank you for coming • Please talk to me if you have further questions CONCLUSION 19
  15. • Part 2, Last Wednesday of May • Full Athena

    impl • Data warehousing with Red Shift • Shared app layer with Elasticsearch • Redshift/s3 integration using s3 Spectrum • Other sweet AWS data things… • Crowdsourcing ideas welcome! • Curiosities • Problems • AppSync Part 2, Second Wednesday of May – Austin Loveless • IoT Core, Last Wednesday of June – Kevin Tinn UPCOMING MEETUPS 20
  16. WWT does a ton of data pipeline projects on all

    major platforms Reach out on LinkedIn: www.linkedin.com/in/kevin-tinn 21