AWS Data Pipeline Part 2

Slide 1

Slide 1 text

AWS DENVER DATA PIPELINES – PART 2 AWS Meetup Group May 27 - Kevin Tinn

Slide 2

Slide 2 text

• Intro • Overview of data pipelines and event streaming • Review of Part 1 • Overview of new pipeline use cases • Overview of components • Glue, Redshift, Redshift Spectrum, Elasticsearch, and more Athena • Demo and Console tour of components • Look at Parquet data in s3, create Athena tables/queries • Set ES privs and query from command line • Query Redshift using join of standard and external Spectrum-sourced table • Dive into IaC • Meetup News & Notes • Q&A AGENDA 2

Slide 3

Slide 3 text

• Cloud Application Architect Practice Lead at World Wide Technology • Over a decade of experience in the industry • Software Development • .NET (C# and VB.NET) • JVM (Scala) • JavaScript • Data Streaming Architectures • Kafka, Kinesis, Event Hubs • Application Architecture • AWS, Azure, and on-prem solutions • Incessant traveler with a new-found skiing addiction • I Live in Denver, by way of St Louis, and grew up in TN INTRO: KEVIN TINN 3

Slide 4

Slide 4 text

Source code for AWS infrastructure and the kinesis producer are available at https://github.com/kevasync/aws-meetup-group-data-services This link is available in the comments of the Meetup deets: https://www.meetup.com/AWSMeetupGroup/events/270511655/ Part 1 Meetup w/ useful links https://www.meetup.com/AWSMeetupGroup/events/269768602/ Please join the Meetup if you haven’t already REPO INFO 4

Slide 5

Slide 5 text

Traditional batch architectures result in systems move data around in batches, which only allows apps to be as up to date, highly dependent systems may even write to each other’s persistence layer GOAL: MOVE FROM BATCH TO EVENT 5

Slide 6

Slide 6 text

As the number of apps grows the communication mesh becomes significantly ridiculous GOAL: MOVE FROM BATCH TO EVENT 6

Slide 7

Slide 7 text

GOAL: MOVE FROM BATCH TO EVENT 7

Slide 8

Slide 8 text

GOAL: MOVE FROM BATCH TO EVENT 8

Slide 9

Slide 9 text

• Using Pulumi to create Infrastructure as Code • Check out repo from my Terraform and Pulumi Meetup: https://github.com/kevasync/aws-meetup-group-terraform • Wanted to use Pulumi on this to try out the new v2.0 release • Introduces full fidelity between languages, including full C# support • Love Terraform too INTRO TO CODED INFRA AND DEPLOYMENT 9

Slide 10

Slide 10 text

• Problem Statement: Simulate the ingestion and transformation of sensor data • Produce messages for temperature and pressure readings at a manufacturing site • Raw data is stored for compliance purposes • Data is separated into separate streams of data: one for temp, the other for pressure • Enrich pressure data with altitude reference data • Enrich temperature data with ambient weather reference data • Store enriched data in s3 data lake with Athena query capabilities PART 1 REVIEW: DATA PIPELINE OVERVIEW 10

Slide 11

Slide 11 text

PART 1 REVIEW: PIPELINE ARCHITECTURE 11

Slide 12

Slide 12 text

• Kinesis Data Streams is a massively scalable, highly durable data ingestion and processing service optimized for streaming data • Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations • Kinesis Data Analytics allows for transforming, enriching, and analyzing streaming data using a SQL syntax • S3 or Simple Storage Service is a service offered by AWS that provides object storage through a web service interface • Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL PART 1 REVIEW: COMPONENTS 12

Slide 13

Slide 13 text

• Data transformation is required to fix the issue we had with Athena only returning a single result • Adjust Firehose to convert data to Parquet format prior to loading into s3 • Parquet: columnar storage file format • Organizing by column allows for better compression, as data is more homogeneous • Load data into Elasticsearch from data streams for high-performance full- text application queries • Kinesis Data Analytics allows for transforming, enriching, and analyzing streaming data using a SQL syntax • S3 or Simple Storage Service is a service offered by AWS that provides object storage through a web service interface • Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL PART 2 USE CASES 13

Slide 14

Slide 14 text

UPDATED PIPELINE ARCHITECTURE 14

Slide 15

Slide 15 text

• AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. • We’re using it for the purpose of transformation to Parquet in our enriched s3 Firehoses • Components/Features • Data Catalog • Schema detection • Code generation • Data cleansing • Job scheduling • Streaming ETL (Hey that’s us!) COMPONENTS: GLUE 15

Slide 16

Slide 16 text

• Amazon Elasticsearch Service is a fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost effectively at scale. • Based on Apache Lucene full-text search (From wiki): • While suitable for any application that requires full text indexing and searching capability, Lucene is recognized for its utility in the implementation of Internet search engines and local, single-site searching. • Lucene includes a feature to perform a fuzzy search based on edit distance. • AWS provides managed version, but with dedicated cluster that constantly incurs cost COMPONENTS: ELASTICSEARCH 16

Slide 17

Slide 17 text

• Amazon Redshift is a data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services. • Wiki: The name means to shift away from Oracle, red being an allusion to Oracle, whose corporate color is red and is informally referred to as "Big Red.” • Industry-leading performance • Efficient Storage • Massive Scalability (Petabyte-scale storage and analytics) • Extremely performant queries against vast columnar storage volumes • Improved query performance over large dataset • Managed EDW • Very competitive costs COMPONENTS: REDSHIFT 17

Slide 18

Slide 18 text

COMPONENTS: REDSHIFT, COLUMNAR STORAGE 18

Slide 19

Slide 19 text

• Amazon Redshift allows AWS customers to build exabyte*-scale data warehouses that unify data from a variety of internal and external sources • In-place queries allow data to be joined with internal Redshift sources while not requiring that data be stored in Redshift • Built-in to the Redshift SQL query language • Only pay for queries/scans that you run • It can still get expensive. If querying the same external data constantly, store it in Redshift * exabyte: 1 EB = 1018bytes = 10006bytes = 1000000000000000000B = 1000 petabytes = 1millionterabytes = 1billiongigabytes COMPONENTS: REDSHIFT SPECTRUM 19

Slide 20

Slide 20 text

• Upload reference data • Start Kinesis Analytics Applications • Create Athena tables • Demo data going through Analytics Applications • Py data producer • View data in raw, enriched, and reference buckets • Dive into Analytics Application in console • SQL Syntax • Check out parquet data • Athena tables and full query capabilities demo CONSOLE STEPS 20

Slide 21

Slide 21 text

• Set Elasticsearch privs • Query Elasticsearch using curl • Create Redshift reference table data • Create Redshift Spectrum schema/database • Join S3 data and Redshift data using Redshift Spectrum queries CONSOLE STEPS 21

Slide 22

Slide 22 text

LET’S CHECK OUT THE CODED INFRA 22

Slide 23

Slide 23 text

• In this session, we have covered: • Benefits of event driven architectures • Overview of demo architecture • Deployment of coded infrastructure • Overview of AWS components used • Manual steps to complete setup of Glue, Athena, Redshift, and Elasticsearch • Thank you for coming • Please message to me if you have further questions, or to talk shop CONCLUSION 23

Slide 24

Slide 24 text

If you stood up the infra, destroy it. Dedicated clusters such as Elasticsearch and Redshift get expensive, quickly. A WORD ON COSTS… 24

Slide 25

Slide 25 text

• We recently had the Meetup’s 1-year anniversary – w00t • Upcoming: IoT Core, Last Wednesday of June – Kevin Tinn/Austin Loveless • We are updating the cadence of the Meetup • Going to once a month, the last Wednesday of every month • Hit us up and let us know what you would like to learn about, or if you are interested in speaking MEETUP NEWS AND NOTES 25

Slide 26

Slide 26 text

WWT does a ton of data pipeline work (Cloud in general), and we do it [well] on all major platforms Reach out on LinkedIn: www.linkedin.com/in/kevin-tinn 26