2

Kaggle ICDM Challenge 林煒清/Wayne

My imagination of the truth ? ? ?

Many method you can come up with. As a newbie,
I use classification. classify two node into whether same user or not

My imagination of what’s it like after classification ? ?
?

分數都99 納悶... sampling!! what do you do if you want
to sample 10% device-cookie pairs .

Sampling training data we want is I.I.D sample ~ P(device,cookie)=
uniform(all poosible pairs). 1. cartesion product device and cookie , then sample 10% a. too slow 2. Previously I sample x% on devices and x% on cookies and cartesion- product them. a. WRONG? 3. take one sample from device and one from cookies. you can proof that the result ~ uniform(all pair) a. device samples and cookie samples need to have eqaul size and then horizonal concatenate the both. cookies devices fully connected bi-partite graph what do you do if you want to sample 10% device-cookie pairs .

Lesson1 • 把問題定義好(what’s subproblems or surrogate problems) ◦ 不同定義決定了接下來所有步驟 .
◦ e.g. what “-1” label means: masked? unknown?unregister? determined by user- login? 不同解讀有不同sample方式，假如他們沒說，那你應該要try看看。 • 遵守定義取樣 ◦ 一開始我沒有sample -1, 後來我卻有用-1，顯然我沒有定義好也沒遵守my problem formulation.

we have data: • total > 10G • devices (143K
labeled) (203M total) • cookies (1.64M labeled) (217M total) • 180K positives • 143K * 1.64M - 180K = 235 Billion negatives • postive : negative ~ 1 : 1.2M • feature anonymous_c2 3萬個category • ip feature > 千萬個category • ip properties 可能是樹狀結構（直接忽略） •

• Ryan發現重要的特徵，可以過濾掉大部份 negative ◦ 過濾後的label , positive : negative ~
1 : 529 • subsampling, oversampling • decision threshold • and there’re more….. 對付不平衡的label

!這不是learning curve, 這只是看模擬出來的比賽分數收斂情形。 cookie set predicted: truth empty: empty
=1分 empty: somthing = 0分使用10分之1 sample (小筆電只能塞這樣) Model with rank: 1 Parameters: {'thres__threshold': 0.12} validation score: 0.4 ? if you got 0.3x , then you are rank at <%50 Base Line (tanks to Chuyu’s filtered dataset) just IP jaccard feature and a threshold learner, threshold >0 = 至少有交集 ?

第二個模型 Features = basics += ip0 += all combinations and
polynomial (if possible...) 小步驟避免OOM : 刪除稀有value anonymous_c2_d : #23892 of 31689 values are < threshold 50 最後feature數量約幾萬個重要的feature : 1. ip0 jaccard x others 2. annonymous_c2 vs annonymous_c2

第三個模型 Features after filtering by ip0 重要的feature : 1. annonymous_c2
vs annonymous_c2 2. ip0 jaccard x others

第四個模型 Features after filtering by ip0 and +使用六種ip欄位jaccard and cosine
(no tfidf, should be better if using it) 重要的feature : 1. ip cosine x others (count is more important) 2. ip jaccard (消失於top 10) 3. annonymous_c2 vs annonymous_c2 （消失於top 10）

ip index > e+8 此table如何做各種計算，groupBy into array ? 別鬧了。用scipy建立sparse
matrix row = df['index'].values col = df['ip'].values data = df[data_col].values return csr_matrix((data,(row,col))) 用scipy的加減乘除計算共有六種ip資訊欄位，每個都算 jaccard and cosine ID IP count1 countN 123 4345123 1 ... ... 123 12345678 200 ... 345 234 0 ... ... IP table特別有趣，你可以用任何 NLP技術來拷問它（在這裏bag of words is perfect assumption ）

第五個模型 feature += SVD += K-means 有用tfidf，所以效果比較好我不知道是哪一個造成的。數量亂取，沒時間tune了

ip0 jaccard ip0 jaccard+ categorical basics ip + all basics
ip + numerical ip + only is_same ip包含svd + k- means+jaccard +cosine ip - svd - kmeans ip - kmeans

Lesson2 • 請記得Kaggle有兩個deadline，第一個deadline在比賽結素前一個禮拜。 • 請記得Kaggle有兩個deadline，第一個deadline在比賽結素前一個禮拜。 •
請記得Kaggle有兩個deadline，第一個deadline在比賽結素前一個禮拜。

IPython Cluster computation

阻礙: 1. 小筆電training等半天 a. 上AWS ( thanks to teammate informing
me about spot instance) 2. Spark : *@##$^%^*#% a. 算candidate training set 莫名其妙的慢， b. 莫名其妙死機 c. docker吃光我硬碟... d. 改用pandas 3. pandas : $#%^$^&%&* a. 要自己平行化，ORZ 4. scikit-learn : @#$#$#$^%*& a. OOM

Lesson3 • 佈置好和熟悉你的個人工具箱. 我花大部分時間在探索和學習使用工具 --- very slow iteration...，比賽是有時限的。 “Responsible Analysis
and Quick Iteration Produce High- Performing Predictive Models” -- Nic Kridler , a Kaggle Master

Collaborative Data Science? ENV GitHub ? …. 它一開始被創造時並沒有data science這概念 Domino
…. 超屌的Data Science Platform TEAM 一開始我根本不知到predictive modeling要怎麼team work KDD Cup: Team work is crucial – the first four winning teams were 4 to 21 people in size. Within a team, the most important role of every team member is to bring features. Often, team members start working independently not to get exposed to each other’s features,？？？？ and merge their results later. Quite often, the teams are not even planning to work together early during the competition – they decide to run together once they establish their

Collaborative Data Science? 開個Google sheet，可以 • report finding • promote
diversity 大家寫自己的做法，列出來的好處是方便採取不同大方向，增加多元做法不同角度的看法，當然不強迫不一樣，如果有興趣也可以和別人一樣。 cuz Ensemble is benefit from diversity! name 想做什麼分數有什麼發現 a 專攻A table, …. 0 A table是垃圾 b 使用graph, blalalala 0.1 A+B table變黃金 c 使用linear,3$#$20- 1 data可能是linear d 專門視覺化連結

2

2

superChing

More Decks by superChing

Featured

Transcript

Kaggle ICDM Challenge 林煒清/Wayne

data

My imagination of the truth ? ? ?

Many method you can come up with. As a newbie,

My imagination of what’s it like after classification ? ?

分數都99 納悶... sampling!! what do you do if you want

Sampling training data we want is I.I.D sample ~ P(device,cookie)=

Lesson1 • 把問題定義好(what’s subproblems or surrogate problems) ◦ 不同定義決定了接下來所有步驟 .

we have data: • total > 10G • devices (143K

• Ryan發現重要的特徵，可以過濾掉大部份 negative ◦ 過濾後的label , positive : negative ~

!這不是learning curve, 這只是看模擬出來的比賽分數收斂情形。 cookie set predicted: truth empty: empty

第二個模型 Features = basics += ip0 += all combinations and

第三個模型 Features after filtering by ip0 重要的feature : 1. annonymous_c2

第四個模型 Features after filtering by ip0 and +使用六種ip欄位jaccard and cosine

ip index > e+8 此table如何做各種計算，groupBy into array ? 別鬧了。用scipy建立sparse

第五個模型 feature += SVD += K-means 有用tfidf，所以效果比較好我不知道是哪一個造成的。數量亂取，沒時間tune了

ip0 jaccard ip0 jaccard+ categorical basics ip + all basics

ip0 jaccard ip0 jaccard+ categorical basics ip + all basics

Lesson2 • 請記得Kaggle有兩個deadline，第一個deadline在比賽結素前一個禮拜。 • 請記得Kaggle有兩個deadline，第一個deadline在比賽結素前一個禮拜。 •

IPython Cluster computation

IPython Cluster computation

阻礙: 1. 小筆電training等半天 a. 上AWS ( thanks to teammate informing

Lesson3 • 佈置好和熟悉你的個人工具箱. 我花大部分時間在探索和學習使用工具 --- very slow iteration...，比賽是有時限的。 “Responsible Analysis

Collaborative Data Science? ENV GitHub ? …. 它一開始被創造時並沒有data science這概念 Domino

Collaborative Data Science? 開個Google sheet，可以 • report finding • promote