superChing
October 01, 2015
47

# 2

October 01, 2015

## Transcript

1. Kaggle ICDM Challenge
林煒清/Wayne

2. data

3. My imagination
of
the truth
?
?
?

4. Many method you can come up with.
As a newbie, I use classification.
classify two node into whether same user or not

5. My imagination
of
what’s it like after
classification
?
?
?

6. 分數都99
納悶...
sampling!!
what do you do if you want to sample 10%

7. Sampling training data
we want is I.I.D sample ~ P(device,cookie)= uniform(all poosible pairs).
1. cartesion product device and cookie , then sample 10%
a. too slow
2. Previously I sample x% on devices and x% on cookies and cartesion-
product them.
a. WRONG?
3. take one sample from device and one from cookies. you can proof
that the result ~ uniform(all pair)
a. device samples and cookie samples need to have eqaul size
and then horizonal concatenate the both.
devices
fully connected bi-partite graph
what do you do if you want to
pairs .

8. Lesson1
● 把問題定義好(what’s subproblems or surrogate
problems)
○ 不同定義決定了接下來所有步驟 .
○ e.g. what “-1” label means:
不同解讀有不同sample方式，
假如他們沒說，那你應該要try看看。
● 遵守定義取樣
○ 一開始我沒有sample -1, 後來我卻有用-1，顯然我沒有
定義好也沒遵守my problem formulation.

9. we have data:
● total > 10G
● devices (143K labeled) (203M total)
● cookies (1.64M labeled) (217M total)
● 180K positives
● 143K * 1.64M - 180K = 235 Billion
negatives
● postive : negative ~ 1 : 1.2M
● feature anonymous_c2 3萬個category
● ip feature > 千萬個category
● ip properties 可能是樹狀結構 （直接忽略）

10. ● Ryan發現重要的特徵，可以過濾掉大部份 negative
○ 過濾後的label , positive : negative ~ 1 : 529
● subsampling, oversampling
● decision threshold
● and there’re more…..
對付不平衡的label

11. !這不是learning curve, 這只是看模擬出來的比
賽分數收斂情形。
predicted: truth
empty: empty =1分
empty: somthing = 0分
使用10分之1 sample (小筆電只能塞這樣)
Model with rank: 1
Parameters: {'thres__threshold': 0.12}
validation score: 0.4 ?
if you got 0.3x , then you are rank at <%50
Base Line
(tanks to Chuyu’s filtered dataset)
just IP jaccard feature
and a threshold learner, threshold >0 = 至少有交集
?

12. 第二個模型
Features = basics += ip0 += all combinations and polynomial (if possible...)
小步驟避免OOM : 刪除稀有value
anonymous_c2_d : #23892 of 31689 values are < threshold 50
最後feature數量約幾萬個
重要的feature :
1. ip0 jaccard x others
2. annonymous_c2 vs annonymous_c2

13. 第三個模型
Features after filtering by ip0
重要的feature :
1. annonymous_c2 vs annonymous_c2
2. ip0 jaccard x others

14. 第四個模型
Features after filtering by ip0 and +使用六種ip欄位jaccard and cosine (no tfidf, should be better if
using it)
重要的feature :
1. ip cosine x others (count is more important)
2. ip jaccard (消失於top 10)
3. annonymous_c2 vs annonymous_c2 （消失於top 10）

15. ip index > e+8
此table如何做各種計算，groupBy into array ? 別鬧了。
用scipy建立sparse matrix
row = df['index'].values
col = df['ip'].values
data = df[data_col].values
return csr_matrix((data,(row,col)))
用scipy的加減乘除計算
共有六種ip資訊欄位，每個都算 jaccard and cosine
ID IP count1 countN
123 4345123 1 ... ...
123 12345678 200 ...
345 234 0 ...
...
IP table特別有趣，你可以用任何 NLP技術來拷問它
（在這裏bag of words is perfect assumption ）

16. 第五個模型
feature += SVD += K-means
有用tfidf，所以效果比較好我不知道是哪一個造成的。
數量亂取，沒時間tune了

17. ip0 jaccard
ip0 jaccard+
categorical
basics
ip + all
basics
ip +
numerical
ip + only
is_same
ip包含svd + k-
means+jaccard
+cosine
ip - svd -
kmeans
ip - kmeans

18. ip0 jaccard
ip0 jaccard+
categorical
basics
ip + all
basics
ip +
numerical
ip + only
is_same
ip包含svd + k-
means+jaccard
+cosine
ip - svd -
kmeans
ip - kmeans

19. Lesson2
個禮拜。
個禮拜。
個禮拜。

20. IPython Cluster computation

21. IPython Cluster computation

22. 阻礙:
1. 小筆電training等半天
a. 上AWS ( thanks to teammate informing me about spot instance)
2. Spark : *@##\$^%^*#%
a. 算candidate training set 莫名其妙的慢，
b. 莫名其妙死機
c. docker吃光我硬碟...
d. 改用pandas
3. pandas : \$#%^\$^&%&*
a. 要自己平行化，ORZ
4. scikit-learn : @#\$#\$#\$^%*&
a. OOM

23. Lesson3
● 佈置好和熟悉你的個人工具箱.
我花大部分時間在探索和學習使用工具 --- very slow
iteration...，比賽是有時限的。
“Responsible Analysis and Quick Iteration Produce High-
Performing Predictive Models” -- Nic Kridler , a Kaggle
Master

24. Collaborative Data Science?
ENV
GitHub ? …. 它一開始被創造時並沒有data science這概念
Domino …. 超屌的Data Science Platform
TEAM
一開始我根本不知到predictive modeling要怎麼team work
KDD Cup: Team work is crucial – the first four winning teams were 4 to 21
people in size. Within a team, the most important role of every team member
is to bring features. Often, team members start working independently not to
get exposed to each other’s features,？？？？ and merge their results later.
Quite often, the teams are not even planning to work together early during
the competition – they decide to run together once they establish their

25. Collaborative Data Science?