Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ETL jobs with AWS Glue ecosystem

Maher Deeb
February 07, 2019

ETL jobs with AWS Glue ecosystem

In this talk, I speak about ETL jobs using AWS Glue service. I show the motivation behind using Glue. In other words, I explained where Glue fits in the data pipeline. I listed the advantage and limitation of using Glue to write ETL jobs. Finally, I summarized the most important lessons that should be taken into account when using Glue.

Maher Deeb

February 07, 2019
Tweet

Other Decks in Technology

Transcript

  1. Table of contents • Motivation • AWS Glue Service •

    ETL Jobs in Glue ◦ Advantages ◦ Limitations • Lessons learned • References Photo by AbsolutVision on Unsplash
  2. Motivation Data Lake: I don’t know and I don’t care!!!

    https://giphy.com/ http://www.mydataspeaks.com/foundation-elements-of-data-lake/ Photo by Lode Lagrainge on Unsplash
  3. Motivation Questions to answer: 1. Is data dumped correctly? 2.

    What data do I have? 3. I have now a use case, Which data do I need to use? What can I do with data? 4. etc... https://giphy.com/ https://www.psychologytoday.com/us/blog/functioning-flourishing/201807/are-you-able-embrace-chaos
  4. AWS Glue Service AWS Glue is a fully managed extract,

    transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. Glue 1 2 3 https://www.topsimages.com/images/aws-data-lake-architecture-1b.html
  5. ETL Jobs in Glue Advantages 1. AWS Glue UI: No

    need to use the API to start using Glue. Click-based!! 2. It supports Python and Scala 3. It uses the schema directly from the Glue catalog 4. It provides template scripts (Minimum code requirements) 5. IDE integrated or point to your script 6. Direct connection to Sagemaker with a DEV endpoint to develop your script 7. Logging and metrics watching 8. Scaling by choosing the Number of the DPUs A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. Minimum number is 2 DPUs
  6. ETL Jobs in Glue more advantages 1. Serverless! No DevOps

    experience is required 2. It supports job parameters (environment variables) 3. Security 4. Job triggering and job scheduling (date- or events-based) 5. Useful use cases: Flatten nested data. 6. Applying ETL jobs on streaming data 7. It supports spark and python shell. 8. Infrastructure as code (supported by Terraform)
  7. ETL Jobs in Glue Limitations 1. It supports only Python

    2.7 2. Limitation using some libraries. For example if you write an ETL job in Spark mode, you can’t use Pandas. 3. For small size data it can be too slow :( 4. Debugging is a nightmare 5. Debugging is Expensive 6. Limited number of jobs 7. Not good for long scripts - refactoring might not be possible 8. Not good for multi-level job dependencies.
  8. Lessons learned 1. NEVER EVER forget to delete the DEV

    Endpoint. 2. Optimize the number of the DPUs for each job. 3. There are better solutions if you don’t have large amount of data. (billed for minimum 10 min run time) https://giphy.com/