Slide 1

Slide 1 text

Kaggle ICDM Challenge 林煒清/Wayne

Slide 2

Slide 2 text

data

Slide 3

Slide 3 text

My imagination of the truth ? ? ?

Slide 4

Slide 4 text

Many method you can come up with. As a newbie, I use classification. classify two node into whether same user or not

Slide 5

Slide 5 text

My imagination of what’s it like after classification ? ? ?

Slide 6

Slide 6 text

分數都99 納悶... sampling!! what do you do if you want to sample 10% device-cookie pairs .

Slide 7

Slide 7 text

Sampling training data we want is I.I.D sample ~ P(device,cookie)= uniform(all poosible pairs). 1. cartesion product device and cookie , then sample 10% a. too slow 2. Previously I sample x% on devices and x% on cookies and cartesion- product them. a. WRONG? 3. take one sample from device and one from cookies. you can proof that the result ~ uniform(all pair) a. device samples and cookie samples need to have eqaul size and then horizonal concatenate the both. cookies devices fully connected bi-partite graph what do you do if you want to sample 10% device-cookie pairs .

Slide 8

Slide 8 text

Lesson1 ● 把問題定義好(what’s subproblems or surrogate problems) ○ 不同定義決定了接下來所有步驟 . ○ e.g. what “-1” label means: masked? unknown?unregister? determined by user- login? 不同解讀有不同sample方式, 假如他們沒說,那你應該要try看看。 ● 遵守定義取樣 ○ 一開始我沒有sample -1, 後來我卻有用-1,顯然我沒有 定義好也沒遵守my problem formulation.

Slide 9

Slide 9 text

we have data: ● total > 10G ● devices (143K labeled) (203M total) ● cookies (1.64M labeled) (217M total) ● 180K positives ● 143K * 1.64M - 180K = 235 Billion negatives ● postive : negative ~ 1 : 1.2M ● feature anonymous_c2 3萬個category ● ip feature > 千萬個category ● ip properties 可能是樹狀結構 (直接忽略) ●

Slide 10

Slide 10 text

● Ryan發現重要的特徵,可以過濾掉大部份 negative ○ 過濾後的label , positive : negative ~ 1 : 529 ● subsampling, oversampling ● decision threshold ● and there’re more….. 對付不平衡的label

Slide 11

Slide 11 text

!這不是learning curve, 這只是看模擬出來的比 賽分數收斂情形。 cookie set predicted: truth empty: empty =1分 empty: somthing = 0分 使用10分之1 sample (小筆電只能塞這樣) Model with rank: 1 Parameters: {'thres__threshold': 0.12} validation score: 0.4 ? if you got 0.3x , then you are rank at <%50 Base Line (tanks to Chuyu’s filtered dataset) just IP jaccard feature and a threshold learner, threshold >0 = 至少有交集 ?

Slide 12

Slide 12 text

第二個模型 Features = basics += ip0 += all combinations and polynomial (if possible...) 小步驟避免OOM : 刪除稀有value anonymous_c2_d : #23892 of 31689 values are < threshold 50 最後feature數量約幾萬個 重要的feature : 1. ip0 jaccard x others 2. annonymous_c2 vs annonymous_c2

Slide 13

Slide 13 text

第三個模型 Features after filtering by ip0 重要的feature : 1. annonymous_c2 vs annonymous_c2 2. ip0 jaccard x others

Slide 14

Slide 14 text

第四個模型 Features after filtering by ip0 and +使用六種ip欄位jaccard and cosine (no tfidf, should be better if using it) 重要的feature : 1. ip cosine x others (count is more important) 2. ip jaccard (消失於top 10) 3. annonymous_c2 vs annonymous_c2 (消失於top 10)

Slide 15

Slide 15 text

ip index > e+8 此table如何做各種計算,groupBy into array ? 別鬧了。 用scipy建立sparse matrix row = df['index'].values col = df['ip'].values data = df[data_col].values return csr_matrix((data,(row,col))) 用scipy的加減乘除計算 共有六種ip資訊欄位,每個都算 jaccard and cosine ID IP count1 countN 123 4345123 1 ... ... 123 12345678 200 ... 345 234 0 ... ... IP table特別有趣,你可以用任何 NLP技術來拷問它 (在這裏bag of words is perfect assumption )

Slide 16

Slide 16 text

第五個模型 feature += SVD += K-means 有用tfidf,所以效果比較好我不知道是哪一個造成的。 數量亂取,沒時間tune了

Slide 17

Slide 17 text

ip0 jaccard ip0 jaccard+ categorical basics ip + all basics ip + numerical ip + only is_same ip包含svd + k- means+jaccard +cosine ip - svd - kmeans ip - kmeans

Slide 18

Slide 18 text

ip0 jaccard ip0 jaccard+ categorical basics ip + all basics ip + numerical ip + only is_same ip包含svd + k- means+jaccard +cosine ip - svd - kmeans ip - kmeans

Slide 19

Slide 19 text

Lesson2 ● 請記得Kaggle有兩個deadline, 第一個deadline在比賽結素前一 個禮拜。 ● 請記得Kaggle有兩個deadline, 第一個deadline在比賽結素前一 個禮拜。 ● 請記得Kaggle有兩個deadline, 第一個deadline在比賽結素前一 個禮拜。

Slide 20

Slide 20 text

IPython Cluster computation

Slide 21

Slide 21 text

IPython Cluster computation

Slide 22

Slide 22 text

阻礙: 1. 小筆電training等半天 a. 上AWS ( thanks to teammate informing me about spot instance) 2. Spark : *@##$^%^*#% a. 算candidate training set 莫名其妙的慢, b. 莫名其妙死機 c. docker吃光我硬碟... d. 改用pandas 3. pandas : $#%^$^&%&* a. 要自己平行化,ORZ 4. scikit-learn : @#$#$#$^%*& a. OOM

Slide 23

Slide 23 text

Lesson3 ● 佈置好和熟悉你的個人工具箱. 我花大部分時間在探索和學習使用工具 --- very slow iteration...,比賽是有時限的。 “Responsible Analysis and Quick Iteration Produce High- Performing Predictive Models” -- Nic Kridler , a Kaggle Master

Slide 24

Slide 24 text

Collaborative Data Science? ENV GitHub ? …. 它一開始被創造時並沒有data science這概念 Domino …. 超屌的Data Science Platform TEAM 一開始我根本不知到predictive modeling要怎麼team work KDD Cup: Team work is crucial – the first four winning teams were 4 to 21 people in size. Within a team, the most important role of every team member is to bring features. Often, team members start working independently not to get exposed to each other’s features,???? and merge their results later. Quite often, the teams are not even planning to work together early during the competition – they decide to run together once they establish their

Slide 25

Slide 25 text

Collaborative Data Science? 開個Google sheet,可以 ● report finding ● promote diversity 大家寫自己的做法,列出來的好處是方便採取 不同大方向,增加多元做法不同角度的看法, 當然不強迫不一樣,如果有興趣也可以和別人 一樣。 cuz Ensemble is benefit from diversity! name 想做什麼 分數 有什麼發現 a 專攻A table, …. 0 A table是垃圾 b 使用graph, blalalala 0.1 A+B table變黃金 c 使用linear,3$#$20- 1 data可能是linear d 專門視覺化 連結