Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Productionizing Big Data - stories from the tre...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Roksolana
September 14, 2023
Technology
88
0
Share
Productionizing Big Data - stories from the trenches
Presented at ScalaDays 2023 (Madrid, Spain)
Roksolana
September 14, 2023
More Decks by Roksolana
See All by Roksolana
Pain of engineering management
roksolanad
1
110
Alice and the return to the world of pods and higher-order functions
roksolanad
0
200
Modern data pipelines in AdTech - life in the trenches
roksolanad
1
310
Alice and travelling back in time
roksolanad
0
180
Big Data at AdTech
roksolanad
0
360
Alice and the Mad Hatter: Predict or not to predict
roksolanad
0
210
Alice in the world of machine learning
roksolanad
0
130
Alice and the lost pod: practical guide to Kubernetes in Scala
roksolanad
1
360
Scala meets Kubernetes
roksolanad
0
530
Other Decks in Technology
See All in Technology
Proxmox超入門
devops_vtj
0
170
あるアーキテクチャ決定と その結果/architecture-decision-and-its-result
hanhan1978
2
570
BIツール「Omni」の紹介 @Snowflake中部UG
sagara
0
270
ふりかえりを 「あそび」にしたら、 学習が勝手に進んだ / Playful Retros Drive Learning
katoaz
0
450
プロジェクトマネジメントは AIでどう変わるか?
mkg5383
0
210
会社紹介資料 / Sansan Company Profile
sansan33
PRO
16
410k
AI環境整備はどのくらい開発生産性を変えうるか? #AI駆動開発 #AI自走環境
ucchi0909
0
120
20260410 - CNTUG meetup #72 - DiskImage Builder 介紹:以 Kubespray CI 打造 RockyLinux 10 Cloud Image 為例
tico88612
0
120
AgentCore RuntimeからS3 Filesをマウントしてみる
har1101
3
400
申請待ちゼロへ!AWS × Entra IDで実現した「権限付与」のセルフサービス化
mhrtech
1
280
Bluesky Meetup in Tokyo vol.4 - 2023to2026
shinoharata
0
150
ストライクウィッチーズ2期6話のエイラの行動が許せないのでPjMの観点から何をすべきだったのかを考える
ichimichi
1
320
Featured
See All Featured
Designing for humans not robots
tammielis
254
26k
We Have a Design System, Now What?
morganepeng
55
8.1k
SEO for Brand Visibility & Recognition
aleyda
0
4.4k
The Power of CSS Pseudo Elements
geoffreycrofte
82
6.2k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
9.9k
We Are The Robots
honzajavorek
0
210
Why Your Marketing Sucks and What You Can Do About It - Sophie Logan
marketingsoph
0
120
A better future with KSS
kneath
240
18k
Breaking role norms: Why Content Design is so much more than writing copy - Taylor Woolridge
uxyall
0
250
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
130
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
120
Transcript
Productionizing big data - stories from the trenches
Roksolana Diachuk •Engineering manager at Captify •Women Who Code Kyiv
Data Engineering Lead •Speaker
AdTech methodologies deliver the right content at the right time
to the right consumer AdTech
None
You have your pipelines in production What’s next?
Types of issues • Low performance • Human errors •
Data source errors
Story #1. Unlucky query
Problem Drop 13 months of user profiles
Reporting
Problem 13 months hour=22042001
Loading mechanism loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P13M” val minTime = currentDay.minus(config.feedPeriod)
listFiles.filter(file => file.eventDateTime isAfter minTime)
Solution loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P1M” loader.ImpalaLoaderConfig.periodToLoad: “P13M” …
Story #2. Missing data
Data ingestion Data from Partner X Data costs attribution Extractor
Problem XX Advertiser ID, Language, XX Device Type, …, XX
Media Cost (USD) X Advertiser ID, Language, X Device Type, …, X Media Cost (USD)
Solution • Rename old columns • Reload data for the
week
Solution val colRegex: Regex = “””X (.+)“””.r val oldNewColumnsMapping =
df.schema.collect { case oldColdName@colRegex(pattern) => (oldColName.name, (“XX “ + pattern)) } oldNewColumnsMapping.foldLeft(df) { case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) }
XX Advertiser ID, Language, XX Device Type, …, XX Media
Cost (USD) Solution
Story #3. Divide and conquer
Problem processing_time part-*.parquet filtering aggregations created part-*.parquet
• Slow processing • Large parquet files • Failing job
that consumes lots of resources Problem
• Write new partitioned state • Run downstream jobs with
smaller states • Generate seed partition column - xxhash64(fullUrl, domain) Solution
processing_time part-*.parquet created bucket=0 part-*.parquet part-*.parquet … bucket=9 part-*.parquet part-*.parquet
processing_time part-*.parquet Solution
Story #4. Catch the evolution train
Data organisation evolution
Problem • Missing columns from the source • Impala to
Databricks migration speed • Dependency with another team • Unhappy users
Log-level data Mapper Ingestor Transformer Data costs calculator Data costs
attribution
Data costs attribution Data costs attribution Data extractor Impala loader
Data costs attribution Data extractor Impala loader Data costs attribution
Solution XX Advertiser ID, Language, XX Device Type, …, XX
Partner Currency, XX CPM Fee (USD) XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD) 26 columns 82 columns
Solution Data extractor New ingestion job
//final step is writing the data df.write .partitionBy(“event_date”, “event_hour”) .mode(SaveMode.Overwrite)
.parquet(dstPath) Solution
Why this solution doesn’t work data_feed clicks.csv.gz views.csv.gz activity.csv.gz event_date
clicks1.parquet clicks2.parquet
Impressions Clicks Conversions Attribution data source
Solution impressions clicks conversions clicks.csv.gz views.csv.gz activity.csv.gz
Story #5. Cleanup time
Corrupted data Data from Partner X Ingestor
Corrupted data Data from Partner X Ingestor IllegalArgumentException: Can't convert
value to BinaryType data type
Solution • Adjust pipeline • Reload data for 3 days
on S3 • Relaunch Databricks autoloader
Current solution impressions videoevents conversions impressions conversions Clicks clicks videoevents
Current solution impressions conversions clicks videoevents
Better solution impressions videoevents conversions impressions conversions clicks clicks videoevents
Conclusions
2. Observability is the key 4. Plan major changes carefully
1. Set up clear expectations with stakeholders Prevention mechanisms 3. Distribute data transformation load
2. Errors can be prevented 4. Data evolution is hard
1. Data setup is always changing Conclusions 3. There are multiple approaches with different tools
None
dead_ fl owers22 roksolana-d roksolanadiachuk roksolanad My contact info