Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS Data Pipeline Part 2

AWS Data Pipeline Part 2

Continue overview of AWS data pipeline components: Glue, Redshift, Elasticsearch, and Athena.

Kevin Tinn

May 27, 2020
Tweet

More Decks by Kevin Tinn

Other Decks in Technology

Transcript

  1. • Intro • Overview of data pipelines and event streaming

    • Review of Part 1 • Overview of new pipeline use cases • Overview of components • Glue, Redshift, Redshift Spectrum, Elasticsearch, and more Athena • Demo and Console tour of components • Look at Parquet data in s3, create Athena tables/queries • Set ES privs and query from command line • Query Redshift using join of standard and external Spectrum-sourced table • Dive into IaC • Meetup News & Notes • Q&A AGENDA 2
  2. • Cloud Application Architect Practice Lead at World Wide Technology

    • Over a decade of experience in the industry • Software Development • .NET (C# and VB.NET) • JVM (Scala) • JavaScript • Data Streaming Architectures • Kafka, Kinesis, Event Hubs • Application Architecture • AWS, Azure, and on-prem solutions • Incessant traveler with a new-found skiing addiction • I Live in Denver, by way of St Louis, and grew up in TN INTRO: KEVIN TINN 3
  3. Source code for AWS infrastructure and the kinesis producer are

    available at https://github.com/kevasync/aws-meetup-group-data-services This link is available in the comments of the Meetup deets: https://www.meetup.com/AWSMeetupGroup/events/270511655/ Part 1 Meetup w/ useful links https://www.meetup.com/AWSMeetupGroup/events/269768602/ Please join the Meetup if you haven’t already REPO INFO 4
  4. Traditional batch architectures result in systems move data around in

    batches, which only allows apps to be as up to date, highly dependent systems may even write to each other’s persistence layer GOAL: MOVE FROM BATCH TO EVENT 5
  5. As the number of apps grows the communication mesh becomes

    significantly ridiculous GOAL: MOVE FROM BATCH TO EVENT 6
  6. • Using Pulumi to create Infrastructure as Code • Check

    out repo from my Terraform and Pulumi Meetup: https://github.com/kevasync/aws-meetup-group-terraform • Wanted to use Pulumi on this to try out the new v2.0 release • Introduces full fidelity between languages, including full C# support • Love Terraform too INTRO TO CODED INFRA AND DEPLOYMENT 9
  7. • Problem Statement: Simulate the ingestion and transformation of sensor

    data • Produce messages for temperature and pressure readings at a manufacturing site • Raw data is stored for compliance purposes • Data is separated into separate streams of data: one for temp, the other for pressure • Enrich pressure data with altitude reference data • Enrich temperature data with ambient weather reference data • Store enriched data in s3 data lake with Athena query capabilities PART 1 REVIEW: DATA PIPELINE OVERVIEW 10
  8. • Kinesis Data Streams is a massively scalable, highly durable

    data ingestion and processing service optimized for streaming data • Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations • Kinesis Data Analytics allows for transforming, enriching, and analyzing streaming data using a SQL syntax • S3 or Simple Storage Service is a service offered by AWS that provides object storage through a web service interface • Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL PART 1 REVIEW: COMPONENTS 12
  9. • Data transformation is required to fix the issue we

    had with Athena only returning a single result • Adjust Firehose to convert data to Parquet format prior to loading into s3 • Parquet: columnar storage file format • Organizing by column allows for better compression, as data is more homogeneous • Load data into Elasticsearch from data streams for high-performance full- text application queries • Kinesis Data Analytics allows for transforming, enriching, and analyzing streaming data using a SQL syntax • S3 or Simple Storage Service is a service offered by AWS that provides object storage through a web service interface • Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL PART 2 USE CASES 13
  10. • AWS Glue is a fully managed extract, transform, and

    load (ETL) service that makes it easy for customers to prepare and load their data for analytics. • We’re using it for the purpose of transformation to Parquet in our enriched s3 Firehoses • Components/Features • Data Catalog • Schema detection • Code generation • Data cleansing • Job scheduling • Streaming ETL (Hey that’s us!) COMPONENTS: GLUE 15
  11. • Amazon Elasticsearch Service is a fully managed service that

    makes it easy for you to deploy, secure, and run Elasticsearch cost effectively at scale. • Based on Apache Lucene full-text search (From wiki): • While suitable for any application that requires full text indexing and searching capability, Lucene is recognized for its utility in the implementation of Internet search engines and local, single-site searching. • Lucene includes a feature to perform a fuzzy search based on edit distance. • AWS provides managed version, but with dedicated cluster that constantly incurs cost COMPONENTS: ELASTICSEARCH 16
  12. • Amazon Redshift is a data warehouse product which forms

    part of the larger cloud-computing platform Amazon Web Services. • Wiki: The name means to shift away from Oracle, red being an allusion to Oracle, whose corporate color is red and is informally referred to as "Big Red.” • Industry-leading performance • Efficient Storage • Massive Scalability (Petabyte-scale storage and analytics) • Extremely performant queries against vast columnar storage volumes • Improved query performance over large dataset • Managed EDW • Very competitive costs COMPONENTS: REDSHIFT 17
  13. • Amazon Redshift allows AWS customers to build exabyte*-scale data

    warehouses that unify data from a variety of internal and external sources • In-place queries allow data to be joined with internal Redshift sources while not requiring that data be stored in Redshift • Built-in to the Redshift SQL query language • Only pay for queries/scans that you run • It can still get expensive. If querying the same external data constantly, store it in Redshift * exabyte: 1 EB = 1018bytes = 10006bytes = 1000000000000000000B = 1000 petabytes = 1millionterabytes = 1billiongigabytes COMPONENTS: REDSHIFT SPECTRUM 19
  14. • Upload reference data • Start Kinesis Analytics Applications •

    Create Athena tables • Demo data going through Analytics Applications • Py data producer • View data in raw, enriched, and reference buckets • Dive into Analytics Application in console • SQL Syntax • Check out parquet data • Athena tables and full query capabilities demo CONSOLE STEPS 20
  15. • Set Elasticsearch privs • Query Elasticsearch using curl •

    Create Redshift reference table data • Create Redshift Spectrum schema/database • Join S3 data and Redshift data using Redshift Spectrum queries CONSOLE STEPS 21
  16. • In this session, we have covered: • Benefits of

    event driven architectures • Overview of demo architecture • Deployment of coded infrastructure • Overview of AWS components used • Manual steps to complete setup of Glue, Athena, Redshift, and Elasticsearch • Thank you for coming • Please message to me if you have further questions, or to talk shop CONCLUSION 23
  17. If you stood up the infra, destroy it. Dedicated clusters

    such as Elasticsearch and Redshift get expensive, quickly. A WORD ON COSTS… 24
  18. • We recently had the Meetup’s 1-year anniversary – w00t

    • Upcoming: IoT Core, Last Wednesday of June – Kevin Tinn/Austin Loveless • We are updating the cadence of the Meetup • Going to once a month, the last Wednesday of every month • Hit us up and let us know what you would like to learn about, or if you are interested in speaking MEETUP NEWS AND NOTES 25
  19. WWT does a ton of data pipeline work (Cloud in

    general), and we do it [well] on all major platforms Reach out on LinkedIn: www.linkedin.com/in/kevin-tinn 26