Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EMR Data Ingestion with Apache Hudi

EMR Data Ingestion with Apache Hudi

Data Ingestion using Apache Hudi, Spark and AWS EMR

Alexey Novakov

July 27, 2021
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. EMR Data Ingestion Apache Hudi, Spark, Glue Catalog, S3 CONFIDENTIAL

    | © 2023 EPAM Systems, Inc. Alexey Novakov, Data Architect
  2. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Agenda E M

    R C L U S T E R S E T U P D A T A S E T O V E R V I E W E M R S T E P E X E C U T I O N D A T A A N A L Y S I S I N A T H E N A D E M O C O D E O V E R V I E W W O R K F L O W 2
  3. CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR – AWS

    Managed Hadoop Cluster. EMR is a very popular platform among Data Teams for Apache Spark workloads with YARN 4
  4. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Workflow 5 -

    EMR Job (Step): - loads CSV files from S3 path - saves data to Hudi table via Upsert - registers table in Data Catalog - Athena queries raw data from registered table - CLI, Web-Console, Airflow create an EMR cluster and add new steps
  5. CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR Cluster Spec

    Apache Hudi is available on every EMR node, but one can use own version inside EMR Step Integration with Glue Catalog
  6. CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR Nodes •

    Use Spot Instances for Test and Development • Use On-Demand Instances for critical workloads and for Master Node • Use Reserved instead of On- Demand Instances for 1-2 years for Master node 7
  7. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Overview •

    CSV files at S3 bucket C U S TO M E R O R D E R S 9 Source: s3://input-data-etljobs/cdc-orders/ Total number of objects: 5256 Total size: 49.8 MB Total records: 1 321 317 Unique Orders: 533 799 (40.4 %) case class Order( orderId: Int, customerId: Int, itemId: Int, quantity: Int, year: Int, month: Int, day: Int, lastUpdateTime: Long ) Schema: orderId,customerId,itemId,quantity,year,month,day,last_update_time 1,1,1,1,2021,7,21,1626903226641 2,2,2,3,2021,7,21,1626903226642 3,2,2,5,2021,7,21,1626903226643 Sample: partition keys
  8. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Dataset Generation 10

    val orderMax = 100000 val orderIdGen = Gen.choose(1, orderMax) val customerIdGen = Gen.choose(1, 100) val itemIdGen = Gen.choose(1, 1000) val quantityGen = Gen.choose(2, 20) val yearGen = Gen.const(2021) val monthGen = Gen.choose(7, 7) val dayMin = 20 val dayMax = 25 val dayGen = Gen.choose(dayMin, dayMax) • records fall into 6 days interval • 100 000 max unique possible order IDs
  9. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Add New Step

    12 Spark parameters: - deploy-mode: cluster - Executor: 4 vCPU, 2GB RAM Job parameters: - paths to data - path to input schema - partition keys - different flags
  10. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Completed Step 13

    Debugging: - Spark History Server - std error output logs - Yarn Timeline Server
  11. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Table Structure 18

    CREATE EXTERNAL TABLE `orders`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `orderid` int, `customerid` int, `itemid` int, `quantity` int, `last_update_time` bigint, `execution_year` int, `execution_month` int, `execution_day` int) PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetH iveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetO utputFormat' LOCATION 's3a://raw-data-etljobs/cdc-orders/orders' TBLPROPERTIES ( 'bucketing_version'='2', 'last_commit_time_sync'='20210726153658', 'transient_lastDdlTime'='1627196865')
  12. CONFIDENTIAL | © 2023 EPAM Systems, Inc. • Spark 3.1.1

    (faster releases), Your HUDI version • EMR Notebooks to analyze data via Spark or PySpark API • Debug with SSH to cluster or with Web-Logs in EMR UI • Does not require entry script • No cold start issue • Incremental load via Spark Streaming checkpoints • Requires utilization strategy to be cost-effective (e.g. Spot, Reserved Instances) • Requires external schedular for recurrent jobs (Airflow) • Opportunity to run AWS EMR on EKS or run Spark alone on EKS G LU E J O B S E M R S T E P S 20 • Spark 2.4 (slow releases), Hudi Connector v0.5 • No Notebooks • Hard to debug (CloudWatch) • Require entry script • Cold start takes seconds to minutes • Incremental load via Glue Bookmarks • Pay-as-you-go cost • Scheduler is embedded in Glue API • Glue is proprietary AWS software Feature Comparison (High-Level) easy to switch