EMR Data Ingestion with Apache Hudi

Slide 1

Slide 1 text

Slide 2

Slide 2 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Agenda E M R C L U S T E R S E T U P D A T A S E T O V E R V I E W E M R S T E P E X E C U T I O N D A T A A N A L Y S I S I N A T H E N A D E M O C O D E O V E R V I E W W O R K F L O W 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Workflow 5 - EMR Job (Step): - loads CSV files from S3 path - saves data to Hudi table via Upsert - registers table in Data Catalog - Athena queries raw data from registered table - CLI, Web-Console, Airflow create an EMR cluster and add new steps

Slide 6

Slide 6 text

Slide 7

Slide 7 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR Nodes • Use Spot Instances for Test and Development • Use On-Demand Instances for critical workloads and for Master Node • Use Reserved instead of On- Demand Instances for 1-2 years for Master node 7

Slide 8

Slide 8 text

Slide 9

Slide 9 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Overview • CSV files at S3 bucket C U S TO M E R O R D E R S 9 Source: s3://input-data-etljobs/cdc-orders/ Total number of objects: 5256 Total size: 49.8 MB Total records: 1 321 317 Unique Orders: 533 799 (40.4 %) case class Order( orderId: Int, customerId: Int, itemId: Int, quantity: Int, year: Int, month: Int, day: Int, lastUpdateTime: Long ) Schema: orderId,customerId,itemId,quantity,year,month,day,last_update_time 1,1,1,1,2021,7,21,1626903226641 2,2,2,3,2021,7,21,1626903226642 3,2,2,5,2021,7,21,1626903226643 Sample: partition keys

Slide 10

Slide 10 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Generation 10 val orderMax = 100000 val orderIdGen = Gen.choose(1, orderMax) val customerIdGen = Gen.choose(1, 100) val itemIdGen = Gen.choose(1, 1000) val quantityGen = Gen.choose(2, 20) val yearGen = Gen.const(2021) val monthGen = Gen.choose(7, 7) val dayMin = 20 val dayMax = 25 val dayGen = Gen.choose(dayMin, dayMax) • records fall into 6 days interval • 100 000 max unique possible order IDs

Slide 11

Slide 11 text

Slide 12

Slide 12 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Add New Step 12 Spark parameters: - deploy-mode: cluster - Executor: 4 vCPU, 2GB RAM Job parameters: - paths to data - path to input schema - partition keys - different flags

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Table Structure 18 CREATE EXTERNAL TABLE òrders`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, òrderid` int, `customerid` int, ìtemid` int, `quantity` int, `last_update_time` bigint, èxecution_year` int, èxecution_month` int, èxecution_day` int) PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetH iveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetO utputFormat' LOCATION 's3a://raw-data-etljobs/cdc-orders/orders' TBLPROPERTIES ( 'bucketing_version'='2', 'last_commit_time_sync'='20210726153658', 'transient_lastDdlTime'='1627196865')

Slide 19

Slide 19 text

Slide 20

Slide 20 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. • Spark 3.1.1 (faster releases), Your HUDI version • EMR Notebooks to analyze data via Spark or PySpark API • Debug with SSH to cluster or with Web-Logs in EMR UI • Does not require entry script • No cold start issue • Incremental load via Spark Streaming checkpoints • Requires utilization strategy to be cost-effective (e.g. Spot, Reserved Instances) • Requires external schedular for recurrent jobs (Airflow) • Opportunity to run AWS EMR on EKS or run Spark alone on EKS G LU E J O B S E M R S T E P S 20 • Spark 2.4 (slow releases), Hudi Connector v0.5 • No Notebooks • Hard to debug (CloudWatch) • Require entry script • Cold start takes seconds to minutes • Incremental load via Glue Bookmarks • Pay-as-you-go cost • Scheduler is embedded in Glue API • Glue is proprietary AWS software Feature Comparison (High-Level) easy to switch