Takuya Kitazawa
September 30, 2016
8.3k

# Treasure Data Summer Internship 2016

Internship final presentation

Blog entry (in Japanese):
http://takuti.me/note/td-intern-2016/

Repository I have created during the internship:
https://github.com/takuti/norikra-udf-dateformat

## Takuya Kitazawa

September 30, 2016

## Transcript

1. Treasure Data Summer Internship 2016
Real-world Machine Learning

2. \$ whoami
Takuya Kitazawa
github.com/takuti

3. \$ curl takuti.me

4. Congrats!

5. Lesson from internship:
Machine Learning is diﬃcult…

6. Hivemall UDFs
1. Evaluation of ranking problems
2. Anomaly detection
Datadog anomaly detection — thanks @nahi!
Very diﬃcult…
Customer churn prediction on TD
Random Forest on Hivemall & td-pandas
Sales/consulting MTGs
Attend 2 MTGs w/ @myui and other members

7. Hivemall UDFs
1. Evaluation of ranking problems
2. Anomaly detection
Datadog anomaly detection — thanks @nahi!
Very diﬃcult…
Customer churn prediction on TD
Random Forest on Hivemall & td-pandas
Sales/consulting MTGs
Attend 2 MTGs w/ @myui and other members

8. Item recommendation
= Item ranking problem based on scoring function
1
6
2 3
4 5
How can we evaluate?
Which f is better?
1 6
2
3 4 5
items
score: 10 8 6 2 1 0.5
user
f
1 2
4
recommend

9. Implement 6 ranking measures
[B. McFee and G. R. Lanckriet. Metric Learning to Rank. ICML’10]

10. 1. Precision@k
Portion of true positives in Y : |X and Y| / |Y|
2. Recall@k
Portion of true positives in X : |X and Y| / |X|
3. MAP (Mean Average Precision)
Average from Precision@1 to Precision@k
truth recommend
(use top-k items)
X Y

11. 4. AUC (Area Under the ROC Curve)
Scores for truth items must be greater than others
Portion of “correct” pairs
1 2
4
Expected:
1
1
1
2
2
2
4
4
4
6
3
5
>
>
> 6
3
5
>
>
> 6
3
5
>
>
>

12. 5. MRR (Mean Reciprocal Rank)
Rank of ﬁrst true positive
Best: “First true positive is ranked #1”
6. nDCG (normalized Discounted Cumulated Gain)
Where is each true positive ranked?
Best:
1 6
2
3 4 5
1 2
4
1 2 4 6
3 5
truth others
#1 #6
#1 #6

13. … — aggregation
SELECT
precision(t1.rec, t2.truth, 2),
recall(t1.rec, t2.truth, 2),
average_precision(t1.rec, t2.truth, 2),
auc(t1.rec, t2.truth, 2),
mrr(t1.rec, t2.truth, 2),
ndcg(t1.rec, t2.truth, 2)
… — join
=> 0.500
=> 0.333
=> 0.333
=> 1.000
=> 1.000
=> 0.613
“higher is better”
in [0, 1] range
Evaluate top-2 rec. on Hivemall

14. Q. Which one should I use?

15. Q. Which one should I use?
A. It depends

You can try all of them!

16. Hivemall UDFs
1. Evaluation of ranking problems
2. Anomaly detection
Datadog anomaly detection — thanks @nahi!
Very diﬃcult…
Customer churn prediction on TD
Random Forest on Hivemall & td-pandas
Sales/consulting MTGs
Attend 2 MTGs w/ @myui and other members

17. Concept behind anomaly detectors
Find patterns
from past points
score “how far from past pattern”
Data source: http://cl-www.msi.co.jp/reports/changeﬁnder.html

18. ChangeFinder (CF) by Spring intern
Outlier score
&
Change-point score

19. Implement additional options for CF
Parameter estimation logic
1. Solving Yule-Walker equation
2. Burg’s method
Scoring function
1. Logloss
2. Hellinger distance
new!
new!

20. Outlier Change-point
Try different combinations (1/2)

21. Outlier Change-point
Try different combinations (2/2)

22. CF also has 4 hyperparameters
r (ﬂoat; [0, 1])
k (int)
T1 (int)
T2 (int)

discounting rate
order (i.e. complexity) of model
window size for outliers
window size for change-points

23. Alternative change-point detector:
Implement Singular Spectrum Transform (SST)

24. SST is much simpler than CF
Naive
computationally heavy
Eﬃcient
numerical approximation
easy-to-use, robust method
single intuitive hyperparameter: window size w (int)
(others can be chosen implicitly)

25. time x
1 182.478
2 176.231
3 183.917
4 177.798
5 165.469
… …
SELECT
time,
changeﬁnder(x,
“-changepoint_threshold 0.005")
FROM
timeseries
ORDER BY time ASC
SELECT
time,
sst(x, "-threshold 0.005")
FROM
timeseries
ORDER BY time ASC
Change-point detection on Hivemall

26. Q. Which method, option and
hyperparameter should I choose?

27. Q. Which method, option and
hyperparameter should I choose?
A. It depends

28. Hivemall UDFs
1. Evaluation of ranking problems
2. Anomaly detection
Datadog anomaly detection — thanks @nahi!
Very diﬃcult…
Customer churn prediction on TD
Random Forest on Hivemall & td-pandas
Sales/consulting MTGs
Attend 2 MTGs w/ @myui and other members

30. DD supports (simple) outlier detection
Set alert by just thresholding outlier scores
We need to detect from more complex conditions
reduce false positives
(e.g. check if metric-A AND metric-B show high outlier scores)

31. Internship Day1:
Apply ChangeFinder (Python) for DD metrics
Aggregate 1 month points from system.load.norm.5
successfully detected
change-point score original points

32. Internship Day2-5:
Construct DD anomaly detection system
get data points
via API
new metric for
anomaly scores
ChangeFinder daemon /
CLI tool for replay
send record with
anomaly scores
notify detected
anomalies
notify errors
stream
fetch
Query

33. Internship Day2-5:
Construct DD anomaly detection system
get data points
via API
new metric for
anomaly scores
ChangeFinder daemon /
CLI tool for replay
send record with
anomaly scores
notify detected
anomalies
notify errors
stream
fetch
Query
Yay! My intern has been ﬁnished! (?)

34. EPL: Esper’s fancy query language
Aggregate metrics (LOOPBACK on Norikra):
Detection query:
https://github.com/takuti/norikra-udf-dateformat

35. Feedback from @nahi
Usability-related requests for daemon’s behavior
Supported as soon as possible
Feasibility of ChangeFinder for DD metrics
CF works as expected on some metrics
Hard to ﬁgure out useful metrics due to CF’s instability
Lack of Norikra-side evaluation

36. Far from conclusion
incident?
no problem?

37. Q. Which method, option and
hyperparameter should I choose?
A. It depends
Remark:

38. 2-month intern was:
enough
to implement algorithms & mock system
too short
to build useful anomaly detector w/ suﬃcient evaluation
Future directions
Continuous discussions w/ metric observers
More static analysis w/ different methods and options
re:dash integration
+ can be research project

39. Hivemall UDFs
1. Evaluation of ranking problems
2. Anomaly detection
Datadog anomaly detection — thanks @nahi!
Very diﬃcult…
Customer churn prediction on TD
Random Forest on Hivemall & td-pandas
Sales/consulting MTGs
Attend 2 MTGs w/ @myui and other members

40. Write tutorial article

41. Churn prediction
Focus on subscriber-based services
(e.g. mobile telephone network)
Churn rate
Percentage of individuals who cancelled contract

42. Step-by-step tutorial on TD w/ td-pandas
Preprocessing
Model training (80% samples)
Prediction (20% samples)
Evaluation

43. http://tinyurl.com/td-hivemall-churn-draft

44. Hivemall UDFs
1. Evaluation of ranking problems
2. Anomaly detection
Datadog anomaly detection — thanks @nahi!
Very diﬃcult…
Customer churn prediction on TD
Random Forest on Hivemall & td-pandas
Sales/consulting MTGs
Attend 2 MTGs w/ @myui and other members

45. Q. Which method, option and
hyperparameter should I choose?
A. It depends
Customers’ requirements are MOST important

46. Hivemall UDFs
1. Evaluation of ranking problems
2. Anomaly detection
Datadog anomaly detection — thanks @nahi!
Very diﬃcult…
Customer churn prediction on TD
Random Forest on Hivemall & td-pandas
Sales/consulting MTGs
Attend 2 MTGs w/ @myui and other members

47. Backbone of real-life machine learning
Engineering
Wide variety of programming skills
Integrating numerous middleware
Science
Understanding concepts behind equations
Having practical point of view (e.g. complexity, usability)
Human factor
Experience on various data to incorporate heuristics
Communication skills to get customers’ requirements

48. Lesson from internship:
Machine Learning is diﬃcult…

49. Lesson from internship:
Machine Learning is diﬃcult…
fun!

50. Treasure Data Summer Internship 2016
Real-world Machine Learning