AWS Data Pipeline Part 2

AWS DENVER DATA PIPELINES – PART 2 AWS Meetup Group
May 27 - Kevin Tinn

• Intro • Overview of data pipelines and event streaming
• Review of Part 1 • Overview of new pipeline use cases • Overview of components • Glue, Redshift, Redshift Spectrum, Elasticsearch, and more Athena • Demo and Console tour of components • Look at Parquet data in s3, create Athena tables/queries • Set ES privs and query from command line • Query Redshift using join of standard and external Spectrum-sourced table • Dive into IaC • Meetup News & Notes • Q&A AGENDA 2

• Cloud Application Architect Practice Lead at World Wide Technology
• Over a decade of experience in the industry • Software Development • .NET (C# and VB.NET) • JVM (Scala) • JavaScript • Data Streaming Architectures • Kafka, Kinesis, Event Hubs • Application Architecture • AWS, Azure, and on-prem solutions • Incessant traveler with a new-found skiing addiction • I Live in Denver, by way of St Louis, and grew up in TN INTRO: KEVIN TINN 3

Source code for AWS infrastructure and the kinesis producer are
available at https://github.com/kevasync/aws-meetup-group-data-services This link is available in the comments of the Meetup deets: https://www.meetup.com/AWSMeetupGroup/events/270511655/ Part 1 Meetup w/ useful links https://www.meetup.com/AWSMeetupGroup/events/269768602/ Please join the Meetup if you haven’t already REPO INFO 4

Traditional batch architectures result in systems move data around in
batches, which only allows apps to be as up to date, highly dependent systems may even write to each other’s persistence layer GOAL: MOVE FROM BATCH TO EVENT 5

As the number of apps grows the communication mesh becomes
significantly ridiculous GOAL: MOVE FROM BATCH TO EVENT 6

GOAL: MOVE FROM BATCH TO EVENT 7

GOAL: MOVE FROM BATCH TO EVENT 8

• Using Pulumi to create Infrastructure as Code • Check
out repo from my Terraform and Pulumi Meetup: https://github.com/kevasync/aws-meetup-group-terraform • Wanted to use Pulumi on this to try out the new v2.0 release • Introduces full fidelity between languages, including full C# support • Love Terraform too INTRO TO CODED INFRA AND DEPLOYMENT 9

• Problem Statement: Simulate the ingestion and transformation of sensor
data • Produce messages for temperature and pressure readings at a manufacturing site • Raw data is stored for compliance purposes • Data is separated into separate streams of data: one for temp, the other for pressure • Enrich pressure data with altitude reference data • Enrich temperature data with ambient weather reference data • Store enriched data in s3 data lake with Athena query capabilities PART 1 REVIEW: DATA PIPELINE OVERVIEW 10

PART 1 REVIEW: PIPELINE ARCHITECTURE 11

• Kinesis Data Streams is a massively scalable, highly durable
data ingestion and processing service optimized for streaming data • Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations • Kinesis Data Analytics allows for transforming, enriching, and analyzing streaming data using a SQL syntax • S3 or Simple Storage Service is a service offered by AWS that provides object storage through a web service interface • Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL PART 1 REVIEW: COMPONENTS 12

• Data transformation is required to fix the issue we
had with Athena only returning a single result • Adjust Firehose to convert data to Parquet format prior to loading into s3 • Parquet: columnar storage file format • Organizing by column allows for better compression, as data is more homogeneous • Load data into Elasticsearch from data streams for high-performance full- text application queries • Kinesis Data Analytics allows for transforming, enriching, and analyzing streaming data using a SQL syntax • S3 or Simple Storage Service is a service offered by AWS that provides object storage through a web service interface • Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL PART 2 USE CASES 13

UPDATED PIPELINE ARCHITECTURE 14

• AWS Glue is a fully managed extract, transform, and
load (ETL) service that makes it easy for customers to prepare and load their data for analytics. • We’re using it for the purpose of transformation to Parquet in our enriched s3 Firehoses • Components/Features • Data Catalog • Schema detection • Code generation • Data cleansing • Job scheduling • Streaming ETL (Hey that’s us!) COMPONENTS: GLUE 15

• Amazon Elasticsearch Service is a fully managed service that
makes it easy for you to deploy, secure, and run Elasticsearch cost effectively at scale. • Based on Apache Lucene full-text search (From wiki): • While suitable for any application that requires full text indexing and searching capability, Lucene is recognized for its utility in the implementation of Internet search engines and local, single-site searching. • Lucene includes a feature to perform a fuzzy search based on edit distance. • AWS provides managed version, but with dedicated cluster that constantly incurs cost COMPONENTS: ELASTICSEARCH 16

• Amazon Redshift is a data warehouse product which forms
part of the larger cloud-computing platform Amazon Web Services. • Wiki: The name means to shift away from Oracle, red being an allusion to Oracle, whose corporate color is red and is informally referred to as "Big Red.” • Industry-leading performance • Efficient Storage • Massive Scalability (Petabyte-scale storage and analytics) • Extremely performant queries against vast columnar storage volumes • Improved query performance over large dataset • Managed EDW • Very competitive costs COMPONENTS: REDSHIFT 17

COMPONENTS: REDSHIFT, COLUMNAR STORAGE 18

• Amazon Redshift allows AWS customers to build exabyte*-scale data
warehouses that unify data from a variety of internal and external sources • In-place queries allow data to be joined with internal Redshift sources while not requiring that data be stored in Redshift • Built-in to the Redshift SQL query language • Only pay for queries/scans that you run • It can still get expensive. If querying the same external data constantly, store it in Redshift * exabyte: 1 EB = 1018bytes = 10006bytes = 1000000000000000000B = 1000 petabytes = 1millionterabytes = 1billiongigabytes COMPONENTS: REDSHIFT SPECTRUM 19

• Upload reference data • Start Kinesis Analytics Applications •
Create Athena tables • Demo data going through Analytics Applications • Py data producer • View data in raw, enriched, and reference buckets • Dive into Analytics Application in console • SQL Syntax • Check out parquet data • Athena tables and full query capabilities demo CONSOLE STEPS 20

• Set Elasticsearch privs • Query Elasticsearch using curl •
Create Redshift reference table data • Create Redshift Spectrum schema/database • Join S3 data and Redshift data using Redshift Spectrum queries CONSOLE STEPS 21

LET’S CHECK OUT THE CODED INFRA 22

• In this session, we have covered: • Benefits of
event driven architectures • Overview of demo architecture • Deployment of coded infrastructure • Overview of AWS components used • Manual steps to complete setup of Glue, Athena, Redshift, and Elasticsearch • Thank you for coming • Please message to me if you have further questions, or to talk shop CONCLUSION 23

If you stood up the infra, destroy it. Dedicated clusters
such as Elasticsearch and Redshift get expensive, quickly. A WORD ON COSTS… 24

• We recently had the Meetup’s 1-year anniversary – w00t
• Upcoming: IoT Core, Last Wednesday of June – Kevin Tinn/Austin Loveless • We are updating the cadence of the Meetup • Going to once a month, the last Wednesday of every month • Hit us up and let us know what you would like to learn about, or if you are interested in speaking MEETUP NEWS AND NOTES 25

WWT does a ton of data pipeline work (Cloud in
general), and we do it [well] on all major platforms Reach out on LinkedIn: www.linkedin.com/in/kevin-tinn 26

AWS Data Pipeline Part 2

AWS Data Pipeline Part 2

Kevin Tinn

More Decks by Kevin Tinn

Other Decks in Technology

Featured

Transcript

AWS DENVER DATA PIPELINES – PART 2 AWS Meetup Group

• Intro • Overview of data pipelines and event streaming

• Cloud Application Architect Practice Lead at World Wide Technology

Source code for AWS infrastructure and the kinesis producer are

Traditional batch architectures result in systems move data around in

As the number of apps grows the communication mesh becomes

GOAL: MOVE FROM BATCH TO EVENT 7

GOAL: MOVE FROM BATCH TO EVENT 8

• Using Pulumi to create Infrastructure as Code • Check

• Problem Statement: Simulate the ingestion and transformation of sensor

PART 1 REVIEW: PIPELINE ARCHITECTURE 11

• Kinesis Data Streams is a massively scalable, highly durable

• Data transformation is required to fix the issue we

UPDATED PIPELINE ARCHITECTURE 14

• AWS Glue is a fully managed extract, transform, and

• Amazon Elasticsearch Service is a fully managed service that

• Amazon Redshift is a data warehouse product which forms

COMPONENTS: REDSHIFT, COLUMNAR STORAGE 18

• Amazon Redshift allows AWS customers to build exabyte*-scale data

• Upload reference data • Start Kinesis Analytics Applications •

• Set Elasticsearch privs • Query Elasticsearch using curl •

LET’S CHECK OUT THE CODED INFRA 22

• In this session, we have covered: • Benefits of

If you stood up the infra, destroy it. Dedicated clusters

• We recently had the Meetup’s 1-year anniversary – w00t

WWT does a ton of data pipeline work (Cloud in