Productionizing Big Data - stories from the trenches

Slide 1

Slide 1 text

Productionizing big data - stories from the trenches

Slide 2

Slide 2 text

Roksolana Diachuk •Engineering manager at Captify •Women Who Code Kyiv Data Engineering Lead •Speaker

Slide 3

Slide 3 text

AdTech methodologies deliver the right content at the right time to the right consumer AdTech

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

You have your pipelines in production What’s next?

Slide 6

Slide 6 text

Types of issues • Low performance • Human errors • Data source errors

Slide 7

Slide 7 text

Story #1. Unlucky query

Slide 8

Slide 8 text

Problem Drop 13 months of user profiles

Slide 9

Slide 9 text

Reporting

Slide 10

Slide 10 text

Problem 13 months hour=22042001

Slide 11

Slide 11 text

Loading mechanism loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P13M” val minTime = currentDay.minus(config.feedPeriod)   listFiles.filter(file => file.eventDateTime isAfter minTime)

Slide 12

Slide 12 text

Solution loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P1M” loader.ImpalaLoaderConfig.periodToLoad: “P13M” …

Slide 13

Slide 13 text

Story #2. Missing data

Slide 14

Slide 14 text

Data ingestion Data from Partner X Data costs attribution Extractor

Slide 15

Slide 15 text

Problem XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD) X Advertiser ID, Language, X Device Type, …, X Media Cost (USD)

Slide 16

Slide 16 text

Solution • Rename old columns • Reload data for the week

Slide 17

Slide 17 text

Solution val colRegex: Regex = “””X (.+)“””.r val oldNewColumnsMapping = df.schema.collect { case oldColdName@colRegex(pattern) => (oldColName.name, (“XX “ + pattern)) } oldNewColumnsMapping.foldLeft(df) { case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) }

Slide 18

Slide 18 text

XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD) Solution

Slide 19

Slide 19 text

Story #3. Divide and conquer

Slide 20

Slide 20 text

Problem processing_time part-*.parquet filtering aggregations created part-*.parquet

Slide 21

Slide 21 text

• Slow processing • Large parquet files • Failing job that consumes lots of resources Problem

Slide 22

Slide 22 text

• Write new partitioned state • Run downstream jobs with smaller states • Generate seed partition column - xxhash64(fullUrl, domain) Solution

Slide 23

Slide 23 text

processing_time part-*.parquet created bucket=0 part-*.parquet part-*.parquet … bucket=9 part-*.parquet part-*.parquet processing_time part-*.parquet Solution

Slide 24

Slide 24 text

Story #4. Catch the evolution train

Slide 25

Slide 25 text

Data organisation evolution

Slide 26

Slide 26 text

Problem • Missing columns from the source • Impala to Databricks migration speed • Dependency with another team • Unhappy users

Slide 27

Slide 27 text

Log-level data Mapper Ingestor Transformer Data costs calculator Data costs attribution

Slide 28

Slide 28 text

Data costs attribution Data costs attribution Data extractor Impala loader

Slide 29

Slide 29 text

Data costs attribution Data extractor Impala loader Data costs attribution

Slide 30

Slide 30 text

Solution XX Advertiser ID, Language, XX Device Type, …, XX Partner Currency, XX CPM Fee (USD) XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD) 26 columns 82 columns

Slide 31

Slide 31 text

Solution Data extractor New ingestion job

Slide 32

Slide 32 text

//final step is writing the data df.write .partitionBy(“event_date”, “event_hour”) .mode(SaveMode.Overwrite) .parquet(dstPath) Solution

Slide 33

Slide 33 text

Why this solution doesn’t work data_feed clicks.csv.gz views.csv.gz activity.csv.gz event_date clicks1.parquet clicks2.parquet

Slide 34

Slide 34 text

Impressions Clicks Conversions Attribution data source

Slide 35

Slide 35 text

Solution impressions clicks conversions clicks.csv.gz views.csv.gz activity.csv.gz

Slide 36

Slide 36 text

Story #5. Cleanup time

Slide 37

Slide 37 text

Corrupted data Data from Partner X Ingestor

Slide 38

Slide 38 text

Corrupted data Data from Partner X Ingestor IllegalArgumentException: Can't convert value to BinaryType data type

Slide 39

Slide 39 text

Solution • Adjust pipeline • Reload data for 3 days on S3 • Relaunch Databricks autoloader

Slide 40

Slide 40 text

Current solution impressions videoevents conversions impressions conversions Clicks clicks videoevents

Slide 41

Slide 41 text

Current solution impressions conversions clicks videoevents

Slide 42

Slide 42 text

Better solution impressions videoevents conversions impressions conversions clicks clicks videoevents

Slide 43

Slide 43 text

Conclusions

Slide 44

Slide 44 text

2. Observability is the key 4. Plan major changes carefully 1. Set up clear expectations with stakeholders Prevention mechanisms 3. Distribute data transformation load

Slide 45

Slide 45 text

2. Errors can be prevented 4. Data evolution is hard 1. Data setup is always changing Conclusions 3. There are multiple approaches with different tools