Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to AWS Glue DataBrew

Introduction to AWS Glue DataBrew

For this session, I introduced AWS Glue DataBrew - a zero-code visual data preparation tool.

What personas it caters to and how it can be used to simplify and drive efficiency in terms of data transformation as well as data quality.

AWS Glue DataBrew: https://aws.amazon.com/glue/features/databrew/

Kyle Escosia

May 25, 2021
Tweet

More Decks by Kyle Escosia

Other Decks in Technology

Transcript

  1. Kyle Escosia Jr. Data Science Specialist @ Info Alchemy AWS

    Community Builder Program 2+ years experience in AWS Big Data and Analytics 2x AWS Certifications linkedin.com/in/kyle-escosia/ dev.to/klescosia
  2. “Over the last 20 years, there’s been a surge in

    the variety and volume of data that companies can collect”
  3. Preparing data involves a lot of complex tasks ELT ETL

    Orchestration Needs a lot of heavy-lifting to work at scale
  4. 80% of time is spent preparing data Data Engineer ETL

    Developers Data Analysts Data Scientists
  5. Challenges with traditional data preparation Manual Needs a lot of

    code-based heavy-lifting for it to work at scale Siloed Often requires moving large amounts of data into silos, at times out of VPCs Time consuming Needs the right tools for the right persona that are integrated
  6. Built for Data Analysts and Data Scientists Understand data quality

    Understand patterns and detect anomalies using profiles Clean and normalize data Over 250 built-in transformations Visually map data lineage Understand steps that the data has been through Automate at scale Save transformations and apply to new data as it comes in Data preparation made easy
  7. What we’ll see in the demo ELO < 1000 RATED

    = TRUE GROUP BY OPENING Create and Start job Chess Dataset AWS Glue DataBrew
  8. Build a recipe Profile the data Run a job Operationalize

    at scale Schedule jobs Use APIs/SDK Reuse recipes What we saw
  9. One-time data analysis for business reporting AWS Glue DataBrew Amazon

    QuickSight Amazon S3 output bucket Data catalog data sources Amazon S3 Amazon Redshift Amazon RDS Amazon Simple Storage Service (Amazon S3) Local file
  10. Amazon Simple Notification Service Amazon EventBridge Email notification AWS Lambda

    AWS Glue DataBrew Recurring raw data feed Amazon S3 Set up data quality rules with AWS Lambda
  11. Data preprocessing for machine learning Amazon S3 AWS Glue DataBrew

    JupyterLab environment Inference Amazon S3 output bucket Model training
  12. Orchestrating data preparation in workflows AWS Step Functionsworkflow AWS Glue

    DataBrew AWS Glue Data catalog Amazon Redshift Crawler