Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EMR Data Ingestion with Apache Hudi

EMR Data Ingestion with Apache Hudi

Data Ingestion using Apache Hudi, Spark and AWS EMR

Alexey Novakov

July 27, 2021
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. EMR Data
    Ingestion
    Apache Hudi, Spark, Glue Catalog, S3
    CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Alexey Novakov, Data Architect

    View Slide

  2. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Agenda
    E M R C L U S T E R S E T U P
    D A T A S E T O V E R V I E W
    E M R S T E P E X E C U T I O N
    D A T A A N A L Y S I S I N A T H E N A
    D E M O C O D E O V E R V I E W
    W O R K F L O W
    2

    View Slide

  3. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    WORKFLOW

    View Slide

  4. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    EMR – AWS Managed Hadoop Cluster. EMR is a very popular
    platform among Data Teams for Apache Spark workloads with
    YARN
    4

    View Slide

  5. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Workflow
    5
    - EMR Job (Step):
    - loads CSV files from S3 path
    - saves data to Hudi table via Upsert
    - registers table in Data Catalog
    - Athena queries raw data from
    registered table
    - CLI, Web-Console, Airflow create an
    EMR cluster and add new steps

    View Slide

  6. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    EMR Cluster Spec
    Apache Hudi is
    available on every
    EMR node,
    but one can use own
    version
    inside EMR Step
    Integration with Glue Catalog

    View Slide

  7. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    EMR Nodes
    • Use Spot Instances for Test and
    Development
    • Use On-Demand Instances for
    critical workloads and for Master
    Node
    • Use Reserved instead of On-
    Demand Instances for 1-2 years for
    Master node
    7

    View Slide

  8. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    DATASET OVERVIEW
    8

    View Slide

  9. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Dataset Overview
    • CSV files at S3 bucket
    C U S TO M E R O R D E R S
    9
    Source: s3://input-data-etljobs/cdc-orders/
    Total number of objects: 5256
    Total size: 49.8 MB
    Total records: 1 321 317
    Unique Orders: 533 799 (40.4 %)
    case class Order(
    orderId: Int,
    customerId: Int,
    itemId: Int,
    quantity: Int,
    year: Int,
    month: Int,
    day: Int,
    lastUpdateTime: Long
    )
    Schema:
    orderId,customerId,itemId,quantity,year,month,day,last_update_time
    1,1,1,1,2021,7,21,1626903226641
    2,2,2,3,2021,7,21,1626903226642
    3,2,2,5,2021,7,21,1626903226643
    Sample:
    partition keys

    View Slide

  10. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Dataset Generation
    10
    val orderMax = 100000
    val orderIdGen = Gen.choose(1, orderMax)
    val customerIdGen = Gen.choose(1, 100)
    val itemIdGen = Gen.choose(1, 1000)
    val quantityGen = Gen.choose(2, 20)
    val yearGen = Gen.const(2021)
    val monthGen = Gen.choose(7, 7)
    val dayMin = 20
    val dayMax = 25
    val dayGen = Gen.choose(dayMin, dayMax)
    • records fall into 6 days interval
    • 100 000 max unique possible
    order IDs

    View Slide

  11. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    EMR STEP EXECUTION

    View Slide

  12. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Add New Step
    12
    Spark parameters:
    - deploy-mode: cluster
    - Executor: 4 vCPU, 2GB RAM
    Job parameters:
    - paths to data
    - path to input schema
    - partition keys
    - different flags

    View Slide

  13. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Completed Step
    13
    Debugging:
    - Spark History Server
    - std error output logs
    - Yarn Timeline Server

    View Slide

  14. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Cluster Monitoring Tab
    14

    View Slide

  15. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    DATA ANALYSIS IN ATHENA

    View Slide

  16. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Query Count
    16

    View Slide

  17. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Query Data
    17

    View Slide

  18. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Table Structure
    18
    CREATE EXTERNAL TABLE `orders`(
    `_hoodie_commit_time` string,
    `_hoodie_commit_seqno` string,
    `_hoodie_record_key` string,
    `_hoodie_partition_path` string,
    `_hoodie_file_name` string,
    `orderid` int,
    `customerid` int,
    `itemid` int,
    `quantity` int,
    `last_update_time` bigint,
    `execution_year` int,
    `execution_month` int,
    `execution_day` int)
    PARTITIONED BY (
    `year` int,
    `month` int,
    `day` int)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetH
    iveSerDe'
    STORED AS INPUTFORMAT
    'org.apache.hudi.hadoop.HoodieParquetInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetO
    utputFormat'
    LOCATION
    's3a://raw-data-etljobs/cdc-orders/orders'
    TBLPROPERTIES (
    'bucketing_version'='2',
    'last_commit_time_sync'='20210726153658',
    'transient_lastDdlTime'='1627196865')

    View Slide

  19. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    DEMO CODE OVERVIEW

    View Slide

  20. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    • Spark 3.1.1 (faster releases), Your HUDI version
    • EMR Notebooks to analyze data via Spark or PySpark API
    • Debug with SSH to cluster or with Web-Logs in EMR UI
    • Does not require entry script
    • No cold start issue
    • Incremental load via Spark Streaming checkpoints
    • Requires utilization strategy to be cost-effective (e.g.
    Spot, Reserved Instances)
    • Requires external schedular for recurrent jobs (Airflow)
    • Opportunity to run AWS EMR on EKS or run Spark alone
    on EKS
    G LU E J O B S
    E M R S T E P S
    20
    • Spark 2.4 (slow releases), Hudi Connector v0.5
    • No Notebooks
    • Hard to debug (CloudWatch)
    • Require entry script
    • Cold start takes seconds to minutes
    • Incremental load via Glue Bookmarks
    • Pay-as-you-go cost
    • Scheduler is embedded in Glue API
    • Glue is proprietary AWS software
    Feature Comparison (High-Level)
    easy to switch

    View Slide