Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Treasure Data Summer Internship 2016

Takuya Kitazawa
September 30, 2016

Treasure Data Summer Internship 2016

Internship final presentation

Blog entry (in Japanese):
http://takuti.me/note/td-intern-2016/

Repository I have created during the internship:
https://github.com/takuti/datadog-anomaly-detector
https://github.com/takuti/norikra-udf-dateformat

Takuya Kitazawa

September 30, 2016
Tweet

More Decks by Takuya Kitazawa

Other Decks in Technology

Transcript

  1. Hivemall UDFs
 1. Evaluation of ranking problems
 2. Anomaly detection

    Datadog anomaly detection — thanks @nahi! Very difficult… Customer churn prediction on TD Random Forest on Hivemall & td-pandas Sales/consulting MTGs Attend 2 MTGs w/ @myui and other members
  2. Hivemall UDFs
 1. Evaluation of ranking problems
 2. Anomaly detection

    Datadog anomaly detection — thanks @nahi! Very difficult… Customer churn prediction on TD Random Forest on Hivemall & td-pandas Sales/consulting MTGs Attend 2 MTGs w/ @myui and other members
  3. Item recommendation = Item ranking problem based on scoring function

    1 6 2 3 4 5 How can we evaluate? Which f is better? 1 6 2 3 4 5 items score: 10 8 6 2 1 0.5 user f 1 2 4 recommend
  4. 1. Precision@k Portion of true positives in Y : |X

    and Y| / |Y| 2. Recall@k Portion of true positives in X : |X and Y| / |X| 3. MAP (Mean Average Precision) Average from Precision@1 to Precision@k truth recommend (use top-k items) X Y
  5. 4. AUC (Area Under the ROC Curve) Scores for truth

    items must be greater than others Portion of “correct” pairs 1 2 4 Expected: 1 1 1 2 2 2 4 4 4 6 3 5 > > > 6 3 5 > > > 6 3 5 > > >
  6. 5. MRR (Mean Reciprocal Rank) Rank of first true positive

    Best: “First true positive is ranked #1” 6. nDCG (normalized Discounted Cumulated Gain) Where is each true positive ranked? Best: 1 6 2 3 4 5 1 2 4 1 2 4 6 3 5 truth others #1 #6 #1 #6
  7. … — aggregation SELECT
 precision(t1.rec, t2.truth, 2), recall(t1.rec, t2.truth, 2),

    average_precision(t1.rec, t2.truth, 2), auc(t1.rec, t2.truth, 2), mrr(t1.rec, t2.truth, 2), ndcg(t1.rec, t2.truth, 2) … — join => 0.500 => 0.333 => 0.333 => 1.000 => 1.000 => 0.613 “higher is better” in [0, 1] range Evaluate top-2 rec. on Hivemall
  8. Q. Which one should I use? A. It depends on

    your problem
 
 You can try all of them!
  9. Hivemall UDFs
 1. Evaluation of ranking problems
 2. Anomaly detection

    Datadog anomaly detection — thanks @nahi! Very difficult… Customer churn prediction on TD Random Forest on Hivemall & td-pandas Sales/consulting MTGs Attend 2 MTGs w/ @myui and other members
  10. Concept behind anomaly detectors Find patterns from past points score

    “how far from past pattern” Data source: http://cl-www.msi.co.jp/reports/changefinder.html
  11. Implement additional options for CF Parameter estimation logic 1. Solving

    Yule-Walker equation 2. Burg’s method Scoring function 1. Logloss 2. Hellinger distance new! new!
  12. CF also has 4 hyperparameters r (float; [0, 1]) k

    (int) T1 (int) T2 (int) discounting rate order (i.e. complexity) of model window size for outliers window size for change-points
  13. SST is much simpler than CF Naive computationally heavy Efficient

    numerical approximation easy-to-use, robust method single intuitive hyperparameter: window size w (int) (others can be chosen implicitly)
  14. time x 1 182.478 2 176.231 3 183.917 4 177.798

    5 165.469 … … SELECT
 time, changefinder(x, “-changepoint_threshold 0.005") FROM
 timeseries ORDER BY time ASC SELECT
 time, sst(x, "-threshold 0.005") FROM
 timeseries ORDER BY time ASC Change-point detection on Hivemall
  15. Hivemall UDFs
 1. Evaluation of ranking problems
 2. Anomaly detection

    Datadog anomaly detection — thanks @nahi! Very difficult… Customer churn prediction on TD Random Forest on Hivemall & td-pandas Sales/consulting MTGs Attend 2 MTGs w/ @myui and other members
  16. DD supports (simple) outlier detection Set alert by just thresholding

    outlier scores We need to detect from more complex conditions 
 reduce false positives (e.g. check if metric-A AND metric-B show high outlier scores) https://www.datadoghq.com/blog/introducing-outlier-detection-in-datadog/
  17. Internship Day1: Apply ChangeFinder (Python) for DD metrics Aggregate 1

    month points from system.load.norm.5 successfully detected change-point score original points
  18. Internship Day2-5: Construct DD anomaly detection system get data points

    via API new metric for anomaly scores ChangeFinder daemon / CLI tool for replay send record with anomaly scores notify detected anomalies notify errors stream fetch Query
  19. Internship Day2-5: Construct DD anomaly detection system get data points

    via API new metric for anomaly scores ChangeFinder daemon / CLI tool for replay send record with anomaly scores notify detected anomalies notify errors stream fetch Query Yay! My intern has been finished! (?)
  20. EPL: Esper’s fancy query language Aggregate metrics (LOOPBACK on Norikra):

    Detection query: https://github.com/takuti/norikra-udf-dateformat
  21. Feedback from @nahi Usability-related requests for daemon’s behavior Supported as

    soon as possible Feasibility of ChangeFinder for DD metrics CF works as expected on some metrics Hard to figure out useful metrics due to CF’s instability Lack of Norikra-side evaluation
  22. Q. Which method, option and hyperparameter should I choose? A.

    It depends on data and your preference Remark:
  23. 2-month intern was: enough to implement algorithms & mock system

    too short to build useful anomaly detector w/ sufficient evaluation Future directions Continuous discussions w/ metric observers More static analysis w/ different methods and options re:dash integration + can be research project
  24. Hivemall UDFs
 1. Evaluation of ranking problems
 2. Anomaly detection

    Datadog anomaly detection — thanks @nahi! Very difficult… Customer churn prediction on TD Random Forest on Hivemall & td-pandas Sales/consulting MTGs Attend 2 MTGs w/ @myui and other members
  25. Churn prediction Focus on subscriber-based services (e.g. mobile telephone network)

    Churn rate Percentage of individuals who cancelled contract …
  26. Hivemall UDFs
 1. Evaluation of ranking problems
 2. Anomaly detection

    Datadog anomaly detection — thanks @nahi! Very difficult… Customer churn prediction on TD Random Forest on Hivemall & td-pandas Sales/consulting MTGs Attend 2 MTGs w/ @myui and other members
  27. Q. Which method, option and hyperparameter should I choose? A.

    It depends on data and your preference Customers’ requirements are MOST important
  28. Hivemall UDFs
 1. Evaluation of ranking problems
 2. Anomaly detection

    Datadog anomaly detection — thanks @nahi! Very difficult… Customer churn prediction on TD Random Forest on Hivemall & td-pandas Sales/consulting MTGs Attend 2 MTGs w/ @myui and other members
  29. Backbone of real-life machine learning Engineering Wide variety of programming

    skills Integrating numerous middleware Science Understanding concepts behind equations Having practical point of view (e.g. complexity, usability) Human factor Experience on various data to incorporate heuristics Communication skills to get customers’ requirements