Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Productionizing Big Data - stories from the tre...

Roksolana
September 14, 2023

Productionizing Big Data - stories from the trenches

Presented at ScalaDays 2023 (Madrid, Spain)

Roksolana

September 14, 2023
Tweet

More Decks by Roksolana

Other Decks in Technology

Transcript

  1. Problem XX Advertiser ID, Language, XX Device Type, …, XX

    Media Cost (USD) X Advertiser ID, Language, X Device Type, …, X Media Cost (USD)
  2. Solution val colRegex: Regex = “””X (.+)“””.r val oldNewColumnsMapping =

    df.schema.collect { case oldColdName@colRegex(pattern) => (oldColName.name, (“XX “ + pattern)) } oldNewColumnsMapping.foldLeft(df) { case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) }
  3. • Slow processing • Large parquet files • Failing job

    that consumes lots of resources Problem
  4. • Write new partitioned state • Run downstream jobs with

    smaller states • Generate seed partition column - xxhash64(fullUrl, domain) Solution
  5. Problem • Missing columns from the source • Impala to

    Databricks migration speed • Dependency with another team • Unhappy users
  6. Solution XX Advertiser ID, Language, XX Device Type, …, XX

    Partner Currency, XX CPM Fee (USD) XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD) 26 columns 82 columns
  7. Solution • Adjust pipeline • Reload data for 3 days

    on S3 • Relaunch Databricks autoloader
  8. 2. Observability is the key 4. Plan major changes carefully

    1. Set up clear expectations with stakeholders Prevention mechanisms 3. Distribute data transformation load
  9. 2. Errors can be prevented 4. Data evolution is hard

    1. Data setup is always changing Conclusions 3. There are multiple approaches with different tools