Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ML Productivity
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Beomjun Shin
January 17, 2018
Research
83
1
Share
ML Productivity
short talks on productivity of machine learning
Beomjun Shin
January 17, 2018
More Decks by Beomjun Shin
See All by Beomjun Shin
Convolution Transpose by yourself
shastakr
0
82
스마트폰 위의 딥러닝
shastakr
0
270
Design your CNN: historical inspirations
shastakr
0
38
"진짜 되는" 투자 전략 찾기: 금융전략과 통계적 검정
shastakr
0
86
Other Decks in Research
See All in Research
Dual Quadric表現を用いた動的物体追跡とRGB-D・IMU制約の密結合によるオドメトリ推定
nanoshimarobot
0
360
姫路市 -都市OSの「再実装」-
hopin
0
1.7k
IEEE AIxVR 2026 Keynote Talk: "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"
miso2024
0
180
Model Discovery and Graph Simulation: A Lightweight Gateway to Chaos Engineering
anatolykr
0
160
さくらインターネット研究所テックトーク2026春、研究開発Gr.25年度成果26年度方針
kikuzo
0
140
Harness Engineering and Al Agent
kzinmr
1
520
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
250
機械学習で作った ポケモン対戦bot で 遊ぼう!
fufufukakaka
0
190
2026年度 生成AI を活用した論文執筆ガイド/ワークショップ / 2026 Academic Year Guide to Writing Papers Using Generative AI - Workshop
ks91
PRO
0
140
量子コンピュータの紹介
oqtopus
0
300
生成AI による論文執筆サポート・ワークショップ 論文執筆・推敲編 / Generative AI-Assisted Paper Writing Support Workshop: Drafting and Revision Edition
ks91
PRO
0
200
ICCV2025参加報告_採択されやすいワークショップの選び方
kobayashi31
0
160
Featured
See All Featured
brightonSEO & MeasureFest 2025 - Christian Goodrich - Winning strategies for Black Friday CRO & PPC
cargoodrich
3
700
Paper Plane (Part 1)
katiecoart
PRO
0
7.6k
Building an army of robots
kneath
306
46k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.3k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.7k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.7k
Product Roadmaps are Hard
iamctodd
PRO
55
12k
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
200
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
8.1k
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
390
The State of eCommerce SEO: How to Win in Today's Products SERPs - #SEOweek
aleyda
2
10k
Rebuilding a faster, lazier Slack
samanthasiow
85
9.5k
Transcript
ML Productivity Ben (Beomjun Shin) 2018-01-17 (Wed) © Beomjun Shin
Productivity is about not waiting © Beomjun Shin
Time Scales © Beomjun Shin
• Immediate: less than 60 seconds. • Bathroom break: less
than 5 minutes. • Lunch break: less than 1 hour. • Overnight: less than 12 hours. WE MUST ESTIMATE TIME BEFORE RUNNING! © Beomjun Shin
Productivity == Iteration © Beomjun Shin
© Beomjun Shin
class timeit(object): def __init__(self, name): self.name = name def __call__(self,
f): @wraps(f) def wrap(*args, **kw): ts = time.time() result = f(*args, **kw) te = time.time() logger.info("%s %s" % (self.name, humanfriendly.format_timespan(te - ts))) return result return wrap © Beomjun Shin
@contextlib.contextmanager def timer(name): """ Example. with timer("Some Routines"): routine1() routine2()
""" start = time.clock() yield end = time.clock() duration = end - start readable_duration = format_timespan(duration) logger.info("%s %s" % (name, readable_duration)) © Beomjun Shin
Use Less Data • Sampled data • Various data •
Synthesis data to validate hypothesis © Beomjun Shin
Sublinear Debugging • Prefer pre-trained model to training from scratch
• Prefer "proven(open-sourced)" code to coding from scratch • Prefer "SGD" to "complex" optimization algorithm © Beomjun Shin
Sublinear Debugging • Logging as many as possible: • First
N step BatchNorm Mean/Variance tracking • Scale of Logit, Activation • Rigorous validation of data quality, preprocessing, augmentation • 2 days of validation is worth enough • Insert assertions as many as possible © Beomjun Shin
Linear Feature Engineering engineering features for a linear model and
then switching to a more complicated model on the same representation © Beomjun Shin
Flexible Code • We can sacrifice "Code Efficiency" for "Flexibility"
• Exchange "raw" data between models and preprocessing by code • Unlike API server, in machine learning task so many assumption can be changed • We should always be prepare to build whole pipeline from scratch © Beomjun Shin
Reproducible preprocessing • Every data preprocessing will be fail in
first iteration • let's fall in love with shell © Beomjun Shin
Shell commands © Beomjun Shin
# Move each directory's files into subdirectory named dummy; #
mv doesn't support mv many files for x in *; do for xx in $x/*; do command mv $xx $x/dummy; done; done; # Recursively counting files in a Linux directory find $DIR -type f | wc -l # Remove whitespace from filename (using shell subsitition) for x in *\ .jpg; do echo $x ${x//\ /}; done # bash rm large directory find . -name '*.mol' -exec rm {} \; # kill process contains partial string ps -ef | grep [some_string] | grep -v grep | awk '{print $2}' | xargs kill -9 # Parallel imagemagick preprocessing ls *.jpg | parallel -j 48 convert {} -crop 240x320+0+0 {} 2> error.log © Beomjun Shin
How many commands are you familiar? • echo, touch, awk,
sed, cat, cut, grep, xargs, find • wait, background(&), redirect(>) • ps, netstat • for, if, function • parallel, imagemagick(convert) © Beomjun Shin
#!/bin/zsh set -x trap 'pkill -P $$' SIGINT SIGTERM EXIT
multitailc () { args="" for file in "$@"; do args+="-cT ANSI $file " done multitail $args } export CUDA_VISIBLE_DEVICES=0 python train.py &> a.log & export CUDA_VISIBLE_DEVICES=1 python train.py &> b.log & multitailc *.log wait echo "Finish Experiments" © Beomjun Shin
Working Process 1. Prepare "proven" data, model or idea 2.
Data validation 3. Setup evaluation metrics (at least two) • one is for model comparison, the other is for human 4. Code and test whether it is "well" trained or not 5. Model improvement (iteration) © Beomjun Shin
Build our best practice • datawrapper - model - trainer
• data/ folder in project root • experiment management © Beomjun Shin
Be aware of ML's technical debt • Recommend to read
Machine Learning: The High- Interest Credit Card of Technical Debt from Google © Beomjun Shin
References • Productivity is about not waiting • Machine Learning:
The High-Interest Credit Card of Technical Debt • Patterns for Research in Machine Learning • Development workflows for Data Scientists © Beomjun Shin