Problem
• Missing columns from the source
• Impala to Databricks migration speed
• Dependency with another team
• Unhappy users
Slide 27
Slide 27 text
Log-level
data
Mapper
Ingestor Transformer
Data costs
calculator
Data costs attribution
Slide 28
Slide 28 text
Data costs attribution
Data costs attribution
Data extractor
Impala loader
Slide 29
Slide 29 text
Data costs attribution
Data extractor
Impala loader
Data costs attribution
Slide 30
Slide 30 text
Solution
XX Advertiser ID, Language, XX Device Type, …, XX Partner Currency, XX CPM Fee (USD)
XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD)
26 columns
82 columns
Slide 31
Slide 31 text
Solution
Data extractor
New ingestion job
Slide 32
Slide 32 text
//final step is writing the data
df.write
.partitionBy(“event_date”, “event_hour”)
.mode(SaveMode.Overwrite)
.parquet(dstPath)
Solution
Slide 33
Slide 33 text
Why this solution doesn’t work
data_feed
clicks.csv.gz
views.csv.gz
activity.csv.gz
event_date
clicks1.parquet
clicks2.parquet
Slide 34
Slide 34 text
Impressions
Clicks
Conversions
Attribution data source
2. Observability is the key
4. Plan major changes carefully
1. Set up clear expectations with stakeholders
Prevention mechanisms
3. Distribute data transformation load
Slide 45
Slide 45 text
2. Errors can be prevented
4. Data evolution is hard
1. Data setup is always changing
Conclusions
3. There are multiple approaches with different tools
Slide 46
Slide 46 text
No content
Slide 47
Slide 47 text
dead_
fl
owers22
roksolana-d
roksolanadiachuk
roksolanad
My contact info