Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Productionizing Big Data - stories from the tre...
Search
Roksolana
September 14, 2023
Technology
93
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Productionizing Big Data - stories from the trenches
Presented at ScalaDays 2023 (Madrid, Spain)
Roksolana
September 14, 2023
More Decks by Roksolana
See All by Roksolana
Pain of engineering management
roksolanad
1
120
Alice and the return to the world of pods and higher-order functions
roksolanad
0
210
Modern data pipelines in AdTech - life in the trenches
roksolanad
1
310
Alice and travelling back in time
roksolanad
0
190
Big Data at AdTech
roksolanad
0
370
Alice and the Mad Hatter: Predict or not to predict
roksolanad
0
220
Alice in the world of machine learning
roksolanad
0
130
Alice and the lost pod: practical guide to Kubernetes in Scala
roksolanad
1
360
Scala meets Kubernetes
roksolanad
0
540
Other Decks in Technology
See All in Technology
非定型業務をAI slackbotで自動化する ~ 社内要望を自動壁打ちするbotを作った ~/automating-ad-hoc-work-with-ai-slackbot
shibayu36
0
610
NAB Show 2026 動画技術関連レポート / NAB Show 2026 Report
cyberagentdevelopers
PRO
0
170
あなたの AI ワークスペースに、 専門コーダーを連れてくる - Amazon Quick Desktop 最新情報
kawaji_scratch
1
130
DevOps Agentで始めるAWS運用 〜フロンティアエージェントが変える運用の現場〜
nyankotaro
1
380
AGENTS.mdとSkillsで始めるAIエージェント活用
sonoda_mj
3
200
AI-DLCを活用した高品質・安全なAI駆動開発実践 / AI Driven Development with AI-DLC
yoshidashingo
0
170
AAIFに入ってみた ~内から見えるコミュニティ動向~
sato4
0
160
作って終わりにしない タイミーのセマンティックレイヤー育成の現在地
chanyou0311
4
2.2k
200個のGitHubリポジトリを横断調査したかった
icck
0
110
2026TECHFRESH畢業分享會 - Lightning Talk - 打造精準高效的 MCP 設計模式與測試實務
line_developers_tw
PRO
0
830
Claude Code の Sandbox 機能を Anthropic Sandbox Runtime(srt) で試そう!/lets-play-anthropic-sandbox-runtime
tomoki10
1
540
SIer20年! 培ったスキルがスタートアップで輝く時
shucho0103
0
840
Featured
See All Featured
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
A Tale of Four Properties
chriscoyier
163
24k
Reality Check: Gamification 10 Years Later
codingconduct
0
2.2k
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
130
Leading Effective Engineering Teams in the AI Era
addyosmani
9
2k
The agentic SEO stack - context over prompts
schlessera
0
810
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.9k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
2k
What Being in a Rock Band Can Teach Us About Real World SEO
427marketing
0
250
Build your cross-platform service in a week with App Engine
jlugia
234
18k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
4.1k
Transcript
Productionizing big data - stories from the trenches
Roksolana Diachuk •Engineering manager at Captify •Women Who Code Kyiv
Data Engineering Lead •Speaker
AdTech methodologies deliver the right content at the right time
to the right consumer AdTech
None
You have your pipelines in production What’s next?
Types of issues • Low performance • Human errors •
Data source errors
Story #1. Unlucky query
Problem Drop 13 months of user profiles
Reporting
Problem 13 months hour=22042001
Loading mechanism loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P13M” val minTime = currentDay.minus(config.feedPeriod)
listFiles.filter(file => file.eventDateTime isAfter minTime)
Solution loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P1M” loader.ImpalaLoaderConfig.periodToLoad: “P13M” …
Story #2. Missing data
Data ingestion Data from Partner X Data costs attribution Extractor
Problem XX Advertiser ID, Language, XX Device Type, …, XX
Media Cost (USD) X Advertiser ID, Language, X Device Type, …, X Media Cost (USD)
Solution • Rename old columns • Reload data for the
week
Solution val colRegex: Regex = “””X (.+)“””.r val oldNewColumnsMapping =
df.schema.collect { case oldColdName@colRegex(pattern) => (oldColName.name, (“XX “ + pattern)) } oldNewColumnsMapping.foldLeft(df) { case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) }
XX Advertiser ID, Language, XX Device Type, …, XX Media
Cost (USD) Solution
Story #3. Divide and conquer
Problem processing_time part-*.parquet filtering aggregations created part-*.parquet
• Slow processing • Large parquet files • Failing job
that consumes lots of resources Problem
• Write new partitioned state • Run downstream jobs with
smaller states • Generate seed partition column - xxhash64(fullUrl, domain) Solution
processing_time part-*.parquet created bucket=0 part-*.parquet part-*.parquet … bucket=9 part-*.parquet part-*.parquet
processing_time part-*.parquet Solution
Story #4. Catch the evolution train
Data organisation evolution
Problem • Missing columns from the source • Impala to
Databricks migration speed • Dependency with another team • Unhappy users
Log-level data Mapper Ingestor Transformer Data costs calculator Data costs
attribution
Data costs attribution Data costs attribution Data extractor Impala loader
Data costs attribution Data extractor Impala loader Data costs attribution
Solution XX Advertiser ID, Language, XX Device Type, …, XX
Partner Currency, XX CPM Fee (USD) XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD) 26 columns 82 columns
Solution Data extractor New ingestion job
//final step is writing the data df.write .partitionBy(“event_date”, “event_hour”) .mode(SaveMode.Overwrite)
.parquet(dstPath) Solution
Why this solution doesn’t work data_feed clicks.csv.gz views.csv.gz activity.csv.gz event_date
clicks1.parquet clicks2.parquet
Impressions Clicks Conversions Attribution data source
Solution impressions clicks conversions clicks.csv.gz views.csv.gz activity.csv.gz
Story #5. Cleanup time
Corrupted data Data from Partner X Ingestor
Corrupted data Data from Partner X Ingestor IllegalArgumentException: Can't convert
value to BinaryType data type
Solution • Adjust pipeline • Reload data for 3 days
on S3 • Relaunch Databricks autoloader
Current solution impressions videoevents conversions impressions conversions Clicks clicks videoevents
Current solution impressions conversions clicks videoevents
Better solution impressions videoevents conversions impressions conversions clicks clicks videoevents
Conclusions
2. Observability is the key 4. Plan major changes carefully
1. Set up clear expectations with stakeholders Prevention mechanisms 3. Distribute data transformation load
2. Errors can be prevented 4. Data evolution is hard
1. Data setup is always changing Conclusions 3. There are multiple approaches with different tools
None
dead_ fl owers22 roksolana-d roksolanadiachuk roksolanad My contact info