Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2

superChing
October 01, 2015
60

 2

superChing

October 01, 2015
Tweet

Transcript

  1. Many method you can come up with. As a newbie,

    I use classification. classify two node into whether same user or not
  2. Sampling training data we want is I.I.D sample ~ P(device,cookie)=

    uniform(all poosible pairs). 1. cartesion product device and cookie , then sample 10% a. too slow 2. Previously I sample x% on devices and x% on cookies and cartesion- product them. a. WRONG? 3. take one sample from device and one from cookies. you can proof that the result ~ uniform(all pair) a. device samples and cookie samples need to have eqaul size and then horizonal concatenate the both. cookies devices fully connected bi-partite graph what do you do if you want to sample 10% device-cookie pairs .
  3. Lesson1 • 把問題定義好(what’s subproblems or surrogate problems) ◦ 不同定義決定了接下來所有步驟 .

    ◦ e.g. what “-1” label means: masked? unknown?unregister? determined by user- login? 不同解讀有不同sample方式, 假如他們沒說,那你應該要try看看。 • 遵守定義取樣 ◦ 一開始我沒有sample -1, 後來我卻有用-1,顯然我沒有 定義好也沒遵守my problem formulation.
  4. we have data: • total > 10G • devices (143K

    labeled) (203M total) • cookies (1.64M labeled) (217M total) • 180K positives • 143K * 1.64M - 180K = 235 Billion negatives • postive : negative ~ 1 : 1.2M • feature anonymous_c2 3萬個category • ip feature > 千萬個category • ip properties 可能是樹狀結構 (直接忽略) •
  5. • Ryan發現重要的特徵,可以過濾掉大部份 negative ◦ 過濾後的label , positive : negative ~

    1 : 529 • subsampling, oversampling • decision threshold • and there’re more….. 對付不平衡的label
  6. !這不是learning curve, 這只是看模擬出來的比 賽分數收斂情形。 cookie set predicted: truth empty: empty

    =1分 empty: somthing = 0分 使用10分之1 sample (小筆電只能塞這樣) Model with rank: 1 Parameters: {'thres__threshold': 0.12} validation score: 0.4 ? if you got 0.3x , then you are rank at <%50 Base Line (tanks to Chuyu’s filtered dataset) just IP jaccard feature and a threshold learner, threshold >0 = 至少有交集 ?
  7. 第二個模型 Features = basics += ip0 += all combinations and

    polynomial (if possible...) 小步驟避免OOM : 刪除稀有value anonymous_c2_d : #23892 of 31689 values are < threshold 50 最後feature數量約幾萬個 重要的feature : 1. ip0 jaccard x others 2. annonymous_c2 vs annonymous_c2
  8. 第四個模型 Features after filtering by ip0 and +使用六種ip欄位jaccard and cosine

    (no tfidf, should be better if using it) 重要的feature : 1. ip cosine x others (count is more important) 2. ip jaccard (消失於top 10) 3. annonymous_c2 vs annonymous_c2 (消失於top 10)
  9. ip index > e+8 此table如何做各種計算,groupBy into array ? 別鬧了。 用scipy建立sparse

    matrix row = df['index'].values col = df['ip'].values data = df[data_col].values return csr_matrix((data,(row,col))) 用scipy的加減乘除計算 共有六種ip資訊欄位,每個都算 jaccard and cosine ID IP count1 countN 123 4345123 1 ... ... 123 12345678 200 ... 345 234 0 ... ... IP table特別有趣,你可以用任何 NLP技術來拷問它 (在這裏bag of words is perfect assumption )
  10. ip0 jaccard ip0 jaccard+ categorical basics ip + all basics

    ip + numerical ip + only is_same ip包含svd + k- means+jaccard +cosine ip - svd - kmeans ip - kmeans
  11. ip0 jaccard ip0 jaccard+ categorical basics ip + all basics

    ip + numerical ip + only is_same ip包含svd + k- means+jaccard +cosine ip - svd - kmeans ip - kmeans
  12. 阻礙: 1. 小筆電training等半天 a. 上AWS ( thanks to teammate informing

    me about spot instance) 2. Spark : *@##$^%^*#% a. 算candidate training set 莫名其妙的慢, b. 莫名其妙死機 c. docker吃光我硬碟... d. 改用pandas 3. pandas : $#%^$^&%&* a. 要自己平行化,ORZ 4. scikit-learn : @#$#$#$^%*& a. OOM
  13. Collaborative Data Science? ENV GitHub ? …. 它一開始被創造時並沒有data science這概念 Domino

    …. 超屌的Data Science Platform TEAM 一開始我根本不知到predictive modeling要怎麼team work KDD Cup: Team work is crucial – the first four winning teams were 4 to 21 people in size. Within a team, the most important role of every team member is to bring features. Often, team members start working independently not to get exposed to each other’s features,???? and merge their results later. Quite often, the teams are not even planning to work together early during the competition – they decide to run together once they establish their
  14. Collaborative Data Science? 開個Google sheet,可以 • report finding • promote

    diversity 大家寫自己的做法,列出來的好處是方便採取 不同大方向,增加多元做法不同角度的看法, 當然不強迫不一樣,如果有興趣也可以和別人 一樣。 cuz Ensemble is benefit from diversity! name 想做什麼 分數 有什麼發現 a 專攻A table, …. 0 A table是垃圾 b 使用graph, blalalala 0.1 A+B table變黃金 c 使用linear,3$#$20- 1 data可能是linear d 專門視覺化 連結