Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Productionizing Big Data - stories from the tre...
Search
Roksolana
September 14, 2023
Technology
0
78
Productionizing Big Data - stories from the trenches
Presented at ScalaDays 2023 (Madrid, Spain)
Roksolana
September 14, 2023
Tweet
Share
More Decks by Roksolana
See All by Roksolana
Pain of engineering management
roksolanad
1
96
Alice and the return to the world of pods and higher-order functions
roksolanad
0
190
Modern data pipelines in AdTech - life in the trenches
roksolanad
1
300
Alice and travelling back in time
roksolanad
0
170
Big Data at AdTech
roksolanad
0
360
Alice and the Mad Hatter: Predict or not to predict
roksolanad
0
200
Alice in the world of machine learning
roksolanad
0
120
Alice and the lost pod: practical guide to Kubernetes in Scala
roksolanad
1
350
Scala meets Kubernetes
roksolanad
0
520
Other Decks in Technology
See All in Technology
1万人を変え日本を変える!!多層構造型ふりかえりの大規模組織変革 / 20260108 Kazuki Mori
shift_evolve
PRO
6
1k
プロンプトエンジニアリングを超えて:自由と統制のあいだでつくる Platform × Context Engineering
yuriemori
0
310
Agentic AIが変革するAWSの開発・運用・セキュリティ ~Frontier Agentsを試してみた~ / Agentic AI transforms AWS development, operations, and security I tried Frontier Agents
yuj1osm
0
210
製造業から学んだ「本質を守り現場に合わせるアジャイル実践」
kamitokusari
0
460
Introduction to Sansan for Engineers / エンジニア向け会社紹介
sansan33
PRO
5
60k
AWS re:Invent2025最新動向まとめ(NRIグループre:Cap 2025)
gamogamo
0
160
「違う現場で格闘する二人」——社内コミュニティがつないだトヨタ流アジャイルの実践とその先
shinichitakeuchi
0
180
Eight Engineering Unit 紹介資料
sansan33
PRO
0
6.2k
2025年の医用画像AI/AI×medical_imaging_in_2025_generated_by_AI
tdys13
0
320
BidiAgent と Nova 2 Sonic から考える音声 AI について
yama3133
2
150
RALGO : AIを組織に組み込む方法 -アルゴリズム中心組織設計- #RSGT2026 / RALGO: How to Integrate AI into an Organization – Algorithm-Centric Organizational Design
kyonmm
PRO
3
890
AI駆動開発ライフサイクル(AI-DLC)の始め方
ryansbcho79
0
310
Featured
See All Featured
Making Projects Easy
brettharned
120
6.5k
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.2k
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
1
100
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
0
110
Measuring & Analyzing Core Web Vitals
bluesmoon
9
720
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
74
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
1
270
Prompt Engineering for Job Search
mfonobong
0
140
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
How Software Deployment tools have changed in the past 20 years
geshan
0
31k
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
260
GraphQLとの向き合い方2022年版
quramy
50
14k
Transcript
Productionizing big data - stories from the trenches
Roksolana Diachuk •Engineering manager at Captify •Women Who Code Kyiv
Data Engineering Lead •Speaker
AdTech methodologies deliver the right content at the right time
to the right consumer AdTech
None
You have your pipelines in production What’s next?
Types of issues • Low performance • Human errors •
Data source errors
Story #1. Unlucky query
Problem Drop 13 months of user profiles
Reporting
Problem 13 months hour=22042001
Loading mechanism loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P13M” val minTime = currentDay.minus(config.feedPeriod)
listFiles.filter(file => file.eventDateTime isAfter minTime)
Solution loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P1M” loader.ImpalaLoaderConfig.periodToLoad: “P13M” …
Story #2. Missing data
Data ingestion Data from Partner X Data costs attribution Extractor
Problem XX Advertiser ID, Language, XX Device Type, …, XX
Media Cost (USD) X Advertiser ID, Language, X Device Type, …, X Media Cost (USD)
Solution • Rename old columns • Reload data for the
week
Solution val colRegex: Regex = “””X (.+)“””.r val oldNewColumnsMapping =
df.schema.collect { case oldColdName@colRegex(pattern) => (oldColName.name, (“XX “ + pattern)) } oldNewColumnsMapping.foldLeft(df) { case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) }
XX Advertiser ID, Language, XX Device Type, …, XX Media
Cost (USD) Solution
Story #3. Divide and conquer
Problem processing_time part-*.parquet filtering aggregations created part-*.parquet
• Slow processing • Large parquet files • Failing job
that consumes lots of resources Problem
• Write new partitioned state • Run downstream jobs with
smaller states • Generate seed partition column - xxhash64(fullUrl, domain) Solution
processing_time part-*.parquet created bucket=0 part-*.parquet part-*.parquet … bucket=9 part-*.parquet part-*.parquet
processing_time part-*.parquet Solution
Story #4. Catch the evolution train
Data organisation evolution
Problem • Missing columns from the source • Impala to
Databricks migration speed • Dependency with another team • Unhappy users
Log-level data Mapper Ingestor Transformer Data costs calculator Data costs
attribution
Data costs attribution Data costs attribution Data extractor Impala loader
Data costs attribution Data extractor Impala loader Data costs attribution
Solution XX Advertiser ID, Language, XX Device Type, …, XX
Partner Currency, XX CPM Fee (USD) XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD) 26 columns 82 columns
Solution Data extractor New ingestion job
//final step is writing the data df.write .partitionBy(“event_date”, “event_hour”) .mode(SaveMode.Overwrite)
.parquet(dstPath) Solution
Why this solution doesn’t work data_feed clicks.csv.gz views.csv.gz activity.csv.gz event_date
clicks1.parquet clicks2.parquet
Impressions Clicks Conversions Attribution data source
Solution impressions clicks conversions clicks.csv.gz views.csv.gz activity.csv.gz
Story #5. Cleanup time
Corrupted data Data from Partner X Ingestor
Corrupted data Data from Partner X Ingestor IllegalArgumentException: Can't convert
value to BinaryType data type
Solution • Adjust pipeline • Reload data for 3 days
on S3 • Relaunch Databricks autoloader
Current solution impressions videoevents conversions impressions conversions Clicks clicks videoevents
Current solution impressions conversions clicks videoevents
Better solution impressions videoevents conversions impressions conversions clicks clicks videoevents
Conclusions
2. Observability is the key 4. Plan major changes carefully
1. Set up clear expectations with stakeholders Prevention mechanisms 3. Distribute data transformation load
2. Errors can be prevented 4. Data evolution is hard
1. Data setup is always changing Conclusions 3. There are multiple approaches with different tools
None
dead_ fl owers22 roksolana-d roksolanadiachuk roksolanad My contact info