Slide 1

Slide 1 text

EMR Data Ingestion Apache Hudi, Spark, Glue Catalog, S3 CONFIDENTIAL | © 2023 EPAM Systems, Inc. Alexey Novakov, Data Architect

Slide 2

Slide 2 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Agenda E M R C L U S T E R S E T U P D A T A S E T O V E R V I E W E M R S T E P E X E C U T I O N D A T A A N A L Y S I S I N A T H E N A D E M O C O D E O V E R V I E W W O R K F L O W 2

Slide 3

Slide 3 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. WORKFLOW

Slide 4

Slide 4 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR – AWS Managed Hadoop Cluster. EMR is a very popular platform among Data Teams for Apache Spark workloads with YARN 4

Slide 5

Slide 5 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Workflow 5 - EMR Job (Step): - loads CSV files from S3 path - saves data to Hudi table via Upsert - registers table in Data Catalog - Athena queries raw data from registered table - CLI, Web-Console, Airflow create an EMR cluster and add new steps

Slide 6

Slide 6 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR Cluster Spec Apache Hudi is available on every EMR node, but one can use own version inside EMR Step Integration with Glue Catalog

Slide 7

Slide 7 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR Nodes • Use Spot Instances for Test and Development • Use On-Demand Instances for critical workloads and for Master Node • Use Reserved instead of On- Demand Instances for 1-2 years for Master node 7

Slide 8

Slide 8 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DATASET OVERVIEW 8

Slide 9

Slide 9 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Overview • CSV files at S3 bucket C U S TO M E R O R D E R S 9 Source: s3://input-data-etljobs/cdc-orders/ Total number of objects: 5256 Total size: 49.8 MB Total records: 1 321 317 Unique Orders: 533 799 (40.4 %) case class Order( orderId: Int, customerId: Int, itemId: Int, quantity: Int, year: Int, month: Int, day: Int, lastUpdateTime: Long ) Schema: orderId,customerId,itemId,quantity,year,month,day,last_update_time 1,1,1,1,2021,7,21,1626903226641 2,2,2,3,2021,7,21,1626903226642 3,2,2,5,2021,7,21,1626903226643 Sample: partition keys

Slide 10

Slide 10 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Generation 10 val orderMax = 100000 val orderIdGen = Gen.choose(1, orderMax) val customerIdGen = Gen.choose(1, 100) val itemIdGen = Gen.choose(1, 1000) val quantityGen = Gen.choose(2, 20) val yearGen = Gen.const(2021) val monthGen = Gen.choose(7, 7) val dayMin = 20 val dayMax = 25 val dayGen = Gen.choose(dayMin, dayMax) • records fall into 6 days interval • 100 000 max unique possible order IDs

Slide 11

Slide 11 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR STEP EXECUTION

Slide 12

Slide 12 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Add New Step 12 Spark parameters: - deploy-mode: cluster - Executor: 4 vCPU, 2GB RAM Job parameters: - paths to data - path to input schema - partition keys - different flags

Slide 13

Slide 13 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Completed Step 13 Debugging: - Spark History Server - std error output logs - Yarn Timeline Server

Slide 14

Slide 14 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Cluster Monitoring Tab 14

Slide 15

Slide 15 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DATA ANALYSIS IN ATHENA

Slide 16

Slide 16 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Query Count 16

Slide 17

Slide 17 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Query Data 17

Slide 18

Slide 18 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Table Structure 18 CREATE EXTERNAL TABLE `orders`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `orderid` int, `customerid` int, `itemid` int, `quantity` int, `last_update_time` bigint, `execution_year` int, `execution_month` int, `execution_day` int) PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetH iveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetO utputFormat' LOCATION 's3a://raw-data-etljobs/cdc-orders/orders' TBLPROPERTIES ( 'bucketing_version'='2', 'last_commit_time_sync'='20210726153658', 'transient_lastDdlTime'='1627196865')

Slide 19

Slide 19 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DEMO CODE OVERVIEW

Slide 20

Slide 20 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. • Spark 3.1.1 (faster releases), Your HUDI version • EMR Notebooks to analyze data via Spark or PySpark API • Debug with SSH to cluster or with Web-Logs in EMR UI • Does not require entry script • No cold start issue • Incremental load via Spark Streaming checkpoints • Requires utilization strategy to be cost-effective (e.g. Spot, Reserved Instances) • Requires external schedular for recurrent jobs (Airflow) • Opportunity to run AWS EMR on EKS or run Spark alone on EKS G LU E J O B S E M R S T E P S 20 • Spark 2.4 (slow releases), Hudi Connector v0.5 • No Notebooks • Hard to debug (CloudWatch) • Require entry script • Cold start takes seconds to minutes • Incremental load via Glue Bookmarks • Pay-as-you-go cost • Scheduler is embedded in Glue API • Glue is proprietary AWS software Feature Comparison (High-Level) easy to switch