Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ML Productivity

Beomjun Shin
January 17, 2018

ML Productivity

short talks on productivity of machine learning

Beomjun Shin

January 17, 2018
Tweet

More Decks by Beomjun Shin

Other Decks in Research

Transcript

  1. • Immediate: less than 60 seconds. • Bathroom break: less

    than 5 minutes. • Lunch break: less than 1 hour. • Overnight: less than 12 hours. WE MUST ESTIMATE TIME BEFORE RUNNING! © Beomjun Shin
  2. class timeit(object): def __init__(self, name): self.name = name def __call__(self,

    f): @wraps(f) def wrap(*args, **kw): ts = time.time() result = f(*args, **kw) te = time.time() logger.info("%s %s" % (self.name, humanfriendly.format_timespan(te - ts))) return result return wrap © Beomjun Shin
  3. @contextlib.contextmanager def timer(name): """ Example. with timer("Some Routines"): routine1() routine2()

    """ start = time.clock() yield end = time.clock() duration = end - start readable_duration = format_timespan(duration) logger.info("%s %s" % (name, readable_duration)) © Beomjun Shin
  4. Use Less Data • Sampled data • Various data •

    Synthesis data to validate hypothesis © Beomjun Shin
  5. Sublinear Debugging • Prefer pre-trained model to training from scratch

    • Prefer "proven(open-sourced)" code to coding from scratch • Prefer "SGD" to "complex" optimization algorithm © Beomjun Shin
  6. Sublinear Debugging • Logging as many as possible: • First

    N step BatchNorm Mean/Variance tracking • Scale of Logit, Activation • Rigorous validation of data quality, preprocessing, augmentation • 2 days of validation is worth enough • Insert assertions as many as possible © Beomjun Shin
  7. Linear Feature Engineering engineering features for a linear model and

    then switching to a more complicated model on the same representation © Beomjun Shin
  8. Flexible Code • We can sacrifice "Code Efficiency" for "Flexibility"

    • Exchange "raw" data between models and preprocessing by code • Unlike API server, in machine learning task so many assumption can be changed • We should always be prepare to build whole pipeline from scratch © Beomjun Shin
  9. Reproducible preprocessing • Every data preprocessing will be fail in

    first iteration • let's fall in love with shell © Beomjun Shin
  10. # Move each directory's files into subdirectory named dummy; #

    mv doesn't support mv many files for x in *; do for xx in $x/*; do command mv $xx $x/dummy; done; done; # Recursively counting files in a Linux directory find $DIR -type f | wc -l # Remove whitespace from filename (using shell subsitition) for x in *\ .jpg; do echo $x ${x//\ /}; done # bash rm large directory find . -name '*.mol' -exec rm {} \; # kill process contains partial string ps -ef | grep [some_string] | grep -v grep | awk '{print $2}' | xargs kill -9 # Parallel imagemagick preprocessing ls *.jpg | parallel -j 48 convert {} -crop 240x320+0+0 {} 2> error.log © Beomjun Shin
  11. How many commands are you familiar? • echo, touch, awk,

    sed, cat, cut, grep, xargs, find • wait, background(&), redirect(>) • ps, netstat • for, if, function • parallel, imagemagick(convert) © Beomjun Shin
  12. #!/bin/zsh set -x trap 'pkill -P $$' SIGINT SIGTERM EXIT

    multitailc () { args="" for file in "$@"; do args+="-cT ANSI $file " done multitail $args } export CUDA_VISIBLE_DEVICES=0 python train.py &> a.log & export CUDA_VISIBLE_DEVICES=1 python train.py &> b.log & multitailc *.log wait echo "Finish Experiments" © Beomjun Shin
  13. Working Process 1. Prepare "proven" data, model or idea 2.

    Data validation 3. Setup evaluation metrics (at least two) • one is for model comparison, the other is for human 4. Code and test whether it is "well" trained or not 5. Model improvement (iteration) © Beomjun Shin
  14. Build our best practice • datawrapper - model - trainer

    • data/ folder in project root • experiment management © Beomjun Shin
  15. Be aware of ML's technical debt • Recommend to read

    Machine Learning: The High- Interest Credit Card of Technical Debt from Google © Beomjun Shin
  16. References • Productivity is about not waiting • Machine Learning:

    The High-Interest Credit Card of Technical Debt • Patterns for Research in Machine Learning • Development workflows for Data Scientists © Beomjun Shin