EMR Data Ingestion with Apache Hudi

EMR Data Ingestion Apache Hudi, Spark, Glue Catalog, S3 CONFIDENTIAL
| © 2023 EPAM Systems, Inc. Alexey Novakov, Data Architect

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Agenda E M
R C L U S T E R S E T U P D A T A S E T O V E R V I E W E M R S T E P E X E C U T I O N D A T A A N A L Y S I S I N A T H E N A D E M O C O D E O V E R V I E W W O R K F L O W 2

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR – AWS
Managed Hadoop Cluster. EMR is a very popular platform among Data Teams for Apache Spark workloads with YARN 4

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Workflow 5 -
EMR Job (Step): - loads CSV files from S3 path - saves data to Hudi table via Upsert - registers table in Data Catalog - Athena queries raw data from registered table - CLI, Web-Console, Airflow create an EMR cluster and add new steps

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR Cluster Spec
Apache Hudi is available on every EMR node, but one can use own version inside EMR Step Integration with Glue Catalog

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR Nodes •
Use Spot Instances for Test and Development • Use On-Demand Instances for critical workloads and for Master Node • Use Reserved instead of On- Demand Instances for 1-2 years for Master node 7

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Overview •
CSV files at S3 bucket C U S TO M E R O R D E R S 9 Source: s3://input-data-etljobs/cdc-orders/ Total number of objects: 5256 Total size: 49.8 MB Total records: 1 321 317 Unique Orders: 533 799 (40.4 %) case class Order( orderId: Int, customerId: Int, itemId: Int, quantity: Int, year: Int, month: Int, day: Int, lastUpdateTime: Long ) Schema: orderId,customerId,itemId,quantity,year,month,day,last_update_time 1,1,1,1,2021,7,21,1626903226641 2,2,2,3,2021,7,21,1626903226642 3,2,2,5,2021,7,21,1626903226643 Sample: partition keys

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Generation 10
val orderMax = 100000 val orderIdGen = Gen.choose(1, orderMax) val customerIdGen = Gen.choose(1, 100) val itemIdGen = Gen.choose(1, 1000) val quantityGen = Gen.choose(2, 20) val yearGen = Gen.const(2021) val monthGen = Gen.choose(7, 7) val dayMin = 20 val dayMax = 25 val dayGen = Gen.choose(dayMin, dayMax) • records fall into 6 days interval • 100 000 max unique possible order IDs

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Add New Step
12 Spark parameters: - deploy-mode: cluster - Executor: 4 vCPU, 2GB RAM Job parameters: - paths to data - path to input schema - partition keys - different flags

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Completed Step 13
Debugging: - Spark History Server - std error output logs - Yarn Timeline Server

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Table Structure 18
CREATE EXTERNAL TABLE òrders`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, òrderid` int, `customerid` int, ìtemid` int, `quantity` int, `last_update_time` bigint, èxecution_year` int, èxecution_month` int, èxecution_day` int) PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetH iveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetO utputFormat' LOCATION 's3a://raw-data-etljobs/cdc-orders/orders' TBLPROPERTIES ( 'bucketing_version'='2', 'last_commit_time_sync'='20210726153658', 'transient_lastDdlTime'='1627196865')

CONFIDENTIAL | © 2023 EPAM Systems, Inc. • Spark 3.1.1
(faster releases), Your HUDI version • EMR Notebooks to analyze data via Spark or PySpark API • Debug with SSH to cluster or with Web-Logs in EMR UI • Does not require entry script • No cold start issue • Incremental load via Spark Streaming checkpoints • Requires utilization strategy to be cost-effective (e.g. Spot, Reserved Instances) • Requires external schedular for recurrent jobs (Airflow) • Opportunity to run AWS EMR on EKS or run Spark alone on EKS G LU E J O B S E M R S T E P S 20 • Spark 2.4 (slow releases), Hudi Connector v0.5 • No Notebooks • Hard to debug (CloudWatch) • Require entry script • Cold start takes seconds to minutes • Incremental load via Glue Bookmarks • Pay-as-you-go cost • Scheduler is embedded in Glue API • Glue is proprietary AWS software Feature Comparison (High-Level) easy to switch

EMR Data Ingestion with Apache Hudi

EMR Data Ingestion with Apache Hudi

Alexey Novakov

More Decks by Alexey Novakov

Other Decks in Programming

Featured

Transcript

EMR Data Ingestion Apache Hudi, Spark, Glue Catalog, S3 CONFIDENTIAL

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Agenda E M

CONFIDENTIAL | © 2023 EPAM Systems, Inc. WORKFLOW

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR – AWS

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Workflow 5 -

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR Cluster Spec

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR Nodes •

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DATASET OVERVIEW 8

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Overview •

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Generation 10

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR STEP EXECUTION

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Add New Step

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Completed Step 13

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Cluster Monitoring Tab

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DATA ANALYSIS IN

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Query Count 16

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Query Data 17

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Table Structure 18

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DEMO CODE OVERVIEW

CONFIDENTIAL | © 2023 EPAM Systems, Inc. • Spark 3.1.1