$30 off During Our Annual Pro Sale. View Details »

Treasure Data Summer Internship 2016

Takuya Kitazawa
September 30, 2016

Treasure Data Summer Internship 2016

Internship final presentation

Blog entry (in Japanese):
http://takuti.me/note/td-intern-2016/

Repository I have created during the internship:
https://github.com/takuti/datadog-anomaly-detector
https://github.com/takuti/norikra-udf-dateformat

Takuya Kitazawa

September 30, 2016
Tweet

More Decks by Takuya Kitazawa

Other Decks in Technology

Transcript

  1. Treasure Data Summer Internship 2016
    Real-world Machine Learning

    View Slide

  2. $ whoami
    Takuya Kitazawa
    github.com/takuti
    twitter.com/takuti

    View Slide

  3. $ curl takuti.me

    View Slide

  4. Congrats!

    View Slide

  5. Lesson from internship:
    Machine Learning is difficult…

    View Slide

  6. Hivemall UDFs

    1. Evaluation of ranking problems

    2. Anomaly detection
    Datadog anomaly detection — thanks @nahi!
    Very difficult…
    Customer churn prediction on TD
    Random Forest on Hivemall & td-pandas
    Sales/consulting MTGs
    Attend 2 MTGs w/ @myui and other members

    View Slide

  7. Hivemall UDFs

    1. Evaluation of ranking problems

    2. Anomaly detection
    Datadog anomaly detection — thanks @nahi!
    Very difficult…
    Customer churn prediction on TD
    Random Forest on Hivemall & td-pandas
    Sales/consulting MTGs
    Attend 2 MTGs w/ @myui and other members

    View Slide

  8. Item recommendation
    = Item ranking problem based on scoring function
    1
    6
    2 3
    4 5
    How can we evaluate?
    Which f is better?
    1 6
    2
    3 4 5
    items
    score: 10 8 6 2 1 0.5
    user
    f
    1 2
    4
    recommend

    View Slide

  9. Implement 6 ranking measures
    [B. McFee and G. R. Lanckriet. Metric Learning to Rank. ICML’10]

    View Slide

  10. 1. Precision@k
    Portion of true positives in Y : |X and Y| / |Y|
    2. Recall@k
    Portion of true positives in X : |X and Y| / |X|
    3. MAP (Mean Average Precision)
    Average from Precision@1 to Precision@k
    truth recommend
    (use top-k items)
    X Y

    View Slide

  11. 4. AUC (Area Under the ROC Curve)
    Scores for truth items must be greater than others
    Portion of “correct” pairs
    1 2
    4
    Expected:
    1
    1
    1
    2
    2
    2
    4
    4
    4
    6
    3
    5
    >
    >
    > 6
    3
    5
    >
    >
    > 6
    3
    5
    >
    >
    >

    View Slide

  12. 5. MRR (Mean Reciprocal Rank)
    Rank of first true positive
    Best: “First true positive is ranked #1”
    6. nDCG (normalized Discounted Cumulated Gain)
    Where is each true positive ranked?
    Best:
    1 6
    2
    3 4 5
    1 2
    4
    1 2 4 6
    3 5
    truth others
    #1 #6
    #1 #6

    View Slide

  13. … — aggregation
    SELECT

    precision(t1.rec, t2.truth, 2),
    recall(t1.rec, t2.truth, 2),
    average_precision(t1.rec, t2.truth, 2),
    auc(t1.rec, t2.truth, 2),
    mrr(t1.rec, t2.truth, 2),
    ndcg(t1.rec, t2.truth, 2)
    … — join
    => 0.500
    => 0.333
    => 0.333
    => 1.000
    => 1.000
    => 0.613
    “higher is better”
    in [0, 1] range
    Evaluate top-2 rec. on Hivemall

    View Slide

  14. Q. Which one should I use?

    View Slide

  15. Q. Which one should I use?
    A. It depends
    on your problem


    You can try all of them!

    View Slide

  16. Hivemall UDFs

    1. Evaluation of ranking problems

    2. Anomaly detection
    Datadog anomaly detection — thanks @nahi!
    Very difficult…
    Customer churn prediction on TD
    Random Forest on Hivemall & td-pandas
    Sales/consulting MTGs
    Attend 2 MTGs w/ @myui and other members

    View Slide

  17. Concept behind anomaly detectors
    Find patterns
    from past points
    score “how far from past pattern”
    Data source: http://cl-www.msi.co.jp/reports/changefinder.html

    View Slide

  18. ChangeFinder (CF) by Spring intern
    Outlier score
    &
    Change-point score

    View Slide

  19. Implement additional options for CF
    Parameter estimation logic
    1. Solving Yule-Walker equation
    2. Burg’s method
    Scoring function
    1. Logloss
    2. Hellinger distance
    new!
    new!

    View Slide

  20. Outlier Change-point
    Try different combinations (1/2)

    View Slide

  21. Outlier Change-point
    Try different combinations (2/2)

    View Slide

  22. CF also has 4 hyperparameters
    r (float; [0, 1])
    k (int)
    T1 (int)
    T2 (int)

    discounting rate
    order (i.e. complexity) of model
    window size for outliers
    window size for change-points

    View Slide

  23. Alternative change-point detector:
    Implement Singular Spectrum Transform (SST)

    View Slide

  24. SST is much simpler than CF
    Naive
    computationally heavy
    Efficient
    numerical approximation
    easy-to-use, robust method
    single intuitive hyperparameter: window size w (int)
    (others can be chosen implicitly)

    View Slide

  25. time x
    1 182.478
    2 176.231
    3 183.917
    4 177.798
    5 165.469
    … …
    SELECT

    time,
    changefinder(x,
    “-changepoint_threshold 0.005")
    FROM

    timeseries
    ORDER BY time ASC
    SELECT

    time,
    sst(x, "-threshold 0.005")
    FROM

    timeseries
    ORDER BY time ASC
    Change-point detection on Hivemall

    View Slide

  26. Q. Which method, option and
    hyperparameter should I choose?

    View Slide

  27. Q. Which method, option and
    hyperparameter should I choose?
    A. It depends
    on data and your preference

    View Slide

  28. Hivemall UDFs

    1. Evaluation of ranking problems

    2. Anomaly detection
    Datadog anomaly detection — thanks @nahi!
    Very difficult…
    Customer churn prediction on TD
    Random Forest on Hivemall & td-pandas
    Sales/consulting MTGs
    Attend 2 MTGs w/ @myui and other members

    View Slide

  29. https://github.com/takuti/datadog-anomaly-detector

    View Slide

  30. DD supports (simple) outlier detection
    Set alert by just thresholding outlier scores
    We need to detect from more complex conditions 

    reduce false positives
    (e.g. check if metric-A AND metric-B show high outlier scores)
    https://www.datadoghq.com/blog/introducing-outlier-detection-in-datadog/

    View Slide

  31. Internship Day1:
    Apply ChangeFinder (Python) for DD metrics
    Aggregate 1 month points from system.load.norm.5
    successfully detected
    change-point score original points

    View Slide

  32. Internship Day2-5:
    Construct DD anomaly detection system
    get data points
    via API
    new metric for
    anomaly scores
    ChangeFinder daemon /
    CLI tool for replay
    send record with
    anomaly scores
    notify detected
    anomalies
    notify errors
    stream
    fetch
    Query

    View Slide

  33. Internship Day2-5:
    Construct DD anomaly detection system
    get data points
    via API
    new metric for
    anomaly scores
    ChangeFinder daemon /
    CLI tool for replay
    send record with
    anomaly scores
    notify detected
    anomalies
    notify errors
    stream
    fetch
    Query
    Yay! My intern has been finished! (?)

    View Slide

  34. EPL: Esper’s fancy query language
    Aggregate metrics (LOOPBACK on Norikra):
    Detection query:
    https://github.com/takuti/norikra-udf-dateformat

    View Slide

  35. Feedback from @nahi
    Usability-related requests for daemon’s behavior
    Supported as soon as possible
    Feasibility of ChangeFinder for DD metrics
    CF works as expected on some metrics
    Hard to figure out useful metrics due to CF’s instability
    Lack of Norikra-side evaluation

    View Slide

  36. Far from conclusion
    incident?
    no problem?

    View Slide

  37. Q. Which method, option and
    hyperparameter should I choose?
    A. It depends
    on data and your preference
    Remark:

    View Slide

  38. 2-month intern was:
    enough
    to implement algorithms & mock system
    too short
    to build useful anomaly detector w/ sufficient evaluation
    Future directions
    Continuous discussions w/ metric observers
    More static analysis w/ different methods and options
    re:dash integration
    + can be research project

    View Slide

  39. Hivemall UDFs

    1. Evaluation of ranking problems

    2. Anomaly detection
    Datadog anomaly detection — thanks @nahi!
    Very difficult…
    Customer churn prediction on TD
    Random Forest on Hivemall & td-pandas
    Sales/consulting MTGs
    Attend 2 MTGs w/ @myui and other members

    View Slide

  40. Write tutorial article

    View Slide

  41. Churn prediction
    Focus on subscriber-based services
    (e.g. mobile telephone network)
    Churn rate
    Percentage of individuals who cancelled contract

    View Slide

  42. Step-by-step tutorial on TD w/ td-pandas
    Preprocessing
    Model training (80% samples)
    Prediction (20% samples)
    Evaluation

    View Slide

  43. http://tinyurl.com/td-hivemall-churn-draft

    View Slide

  44. Hivemall UDFs

    1. Evaluation of ranking problems

    2. Anomaly detection
    Datadog anomaly detection — thanks @nahi!
    Very difficult…
    Customer churn prediction on TD
    Random Forest on Hivemall & td-pandas
    Sales/consulting MTGs
    Attend 2 MTGs w/ @myui and other members

    View Slide

  45. Q. Which method, option and
    hyperparameter should I choose?
    A. It depends
    on data and your preference
    Customers’ requirements are MOST important

    View Slide

  46. Hivemall UDFs

    1. Evaluation of ranking problems

    2. Anomaly detection
    Datadog anomaly detection — thanks @nahi!
    Very difficult…
    Customer churn prediction on TD
    Random Forest on Hivemall & td-pandas
    Sales/consulting MTGs
    Attend 2 MTGs w/ @myui and other members

    View Slide

  47. Backbone of real-life machine learning
    Engineering
    Wide variety of programming skills
    Integrating numerous middleware
    Science
    Understanding concepts behind equations
    Having practical point of view (e.g. complexity, usability)
    Human factor
    Experience on various data to incorporate heuristics
    Communication skills to get customers’ requirements

    View Slide

  48. Lesson from internship:
    Machine Learning is difficult…

    View Slide

  49. Lesson from internship:
    Machine Learning is difficult…
    fun!

    View Slide

  50. Treasure Data Summer Internship 2016
    Real-world Machine Learning

    View Slide