$30 off During Our Annual Pro Sale. View Details »

2

superChing
October 01, 2015
47

 2

superChing

October 01, 2015
Tweet

Transcript

  1. Kaggle ICDM Challenge
    林煒清/Wayne

    View Slide

  2. data

    View Slide

  3. My imagination
    of
    the truth
    ?
    ?
    ?

    View Slide

  4. Many method you can come up with.
    As a newbie, I use classification.
    classify two node into whether same user or not

    View Slide

  5. My imagination
    of
    what’s it like after
    classification
    ?
    ?
    ?

    View Slide

  6. 分數都99
    納悶...
    sampling!!
    what do you do if you want to sample 10%
    device-cookie pairs .

    View Slide

  7. Sampling training data
    we want is I.I.D sample ~ P(device,cookie)= uniform(all poosible pairs).
    1. cartesion product device and cookie , then sample 10%
    a. too slow
    2. Previously I sample x% on devices and x% on cookies and cartesion-
    product them.
    a. WRONG?
    3. take one sample from device and one from cookies. you can proof
    that the result ~ uniform(all pair)
    a. device samples and cookie samples need to have eqaul size
    and then horizonal concatenate the both.
    cookies
    devices
    fully connected bi-partite graph
    what do you do if you want to
    sample 10% device-cookie
    pairs .

    View Slide

  8. Lesson1
    ● 把問題定義好(what’s subproblems or surrogate
    problems)
    ○ 不同定義決定了接下來所有步驟 .
    ○ e.g. what “-1” label means:
    masked? unknown?unregister? determined by user-
    login?
    不同解讀有不同sample方式,
    假如他們沒說,那你應該要try看看。
    ● 遵守定義取樣
    ○ 一開始我沒有sample -1, 後來我卻有用-1,顯然我沒有
    定義好也沒遵守my problem formulation.

    View Slide

  9. we have data:
    ● total > 10G
    ● devices (143K labeled) (203M total)
    ● cookies (1.64M labeled) (217M total)
    ● 180K positives
    ● 143K * 1.64M - 180K = 235 Billion
    negatives
    ● postive : negative ~ 1 : 1.2M
    ● feature anonymous_c2 3萬個category
    ● ip feature > 千萬個category
    ● ip properties 可能是樹狀結構 (直接忽略)

    View Slide

  10. ● Ryan發現重要的特徵,可以過濾掉大部份 negative
    ○ 過濾後的label , positive : negative ~ 1 : 529
    ● subsampling, oversampling
    ● decision threshold
    ● and there’re more…..
    對付不平衡的label

    View Slide

  11. !這不是learning curve, 這只是看模擬出來的比
    賽分數收斂情形。
    cookie set
    predicted: truth
    empty: empty =1分
    empty: somthing = 0分
    使用10分之1 sample (小筆電只能塞這樣)
    Model with rank: 1
    Parameters: {'thres__threshold': 0.12}
    validation score: 0.4 ?
    if you got 0.3x , then you are rank at <%50
    Base Line
    (tanks to Chuyu’s filtered dataset)
    just IP jaccard feature
    and a threshold learner, threshold >0 = 至少有交集
    ?

    View Slide

  12. 第二個模型
    Features = basics += ip0 += all combinations and polynomial (if possible...)
    小步驟避免OOM : 刪除稀有value
    anonymous_c2_d : #23892 of 31689 values are < threshold 50
    最後feature數量約幾萬個
    重要的feature :
    1. ip0 jaccard x others
    2. annonymous_c2 vs annonymous_c2

    View Slide

  13. 第三個模型
    Features after filtering by ip0
    重要的feature :
    1. annonymous_c2 vs annonymous_c2
    2. ip0 jaccard x others

    View Slide

  14. 第四個模型
    Features after filtering by ip0 and +使用六種ip欄位jaccard and cosine (no tfidf, should be better if
    using it)
    重要的feature :
    1. ip cosine x others (count is more important)
    2. ip jaccard (消失於top 10)
    3. annonymous_c2 vs annonymous_c2 (消失於top 10)

    View Slide

  15. ip index > e+8
    此table如何做各種計算,groupBy into array ? 別鬧了。
    用scipy建立sparse matrix
    row = df['index'].values
    col = df['ip'].values
    data = df[data_col].values
    return csr_matrix((data,(row,col)))
    用scipy的加減乘除計算
    共有六種ip資訊欄位,每個都算 jaccard and cosine
    ID IP count1 countN
    123 4345123 1 ... ...
    123 12345678 200 ...
    345 234 0 ...
    ...
    IP table特別有趣,你可以用任何 NLP技術來拷問它
    (在這裏bag of words is perfect assumption )

    View Slide

  16. 第五個模型
    feature += SVD += K-means
    有用tfidf,所以效果比較好我不知道是哪一個造成的。
    數量亂取,沒時間tune了

    View Slide

  17. ip0 jaccard
    ip0 jaccard+
    categorical
    basics
    ip + all
    basics
    ip +
    numerical
    ip + only
    is_same
    ip包含svd + k-
    means+jaccard
    +cosine
    ip - svd -
    kmeans
    ip - kmeans

    View Slide

  18. ip0 jaccard
    ip0 jaccard+
    categorical
    basics
    ip + all
    basics
    ip +
    numerical
    ip + only
    is_same
    ip包含svd + k-
    means+jaccard
    +cosine
    ip - svd -
    kmeans
    ip - kmeans

    View Slide

  19. Lesson2
    ● 請記得Kaggle有兩個deadline,
    第一個deadline在比賽結素前一
    個禮拜。
    ● 請記得Kaggle有兩個deadline,
    第一個deadline在比賽結素前一
    個禮拜。
    ● 請記得Kaggle有兩個deadline,
    第一個deadline在比賽結素前一
    個禮拜。

    View Slide

  20. IPython Cluster computation

    View Slide

  21. IPython Cluster computation

    View Slide

  22. 阻礙:
    1. 小筆電training等半天
    a. 上AWS ( thanks to teammate informing me about spot instance)
    2. Spark : *@##$^%^*#%
    a. 算candidate training set 莫名其妙的慢,
    b. 莫名其妙死機
    c. docker吃光我硬碟...
    d. 改用pandas
    3. pandas : $#%^$^&%&*
    a. 要自己平行化,ORZ
    4. scikit-learn : @#$#$#$^%*&
    a. OOM

    View Slide

  23. Lesson3
    ● 佈置好和熟悉你的個人工具箱.
    我花大部分時間在探索和學習使用工具 --- very slow
    iteration...,比賽是有時限的。
    “Responsible Analysis and Quick Iteration Produce High-
    Performing Predictive Models” -- Nic Kridler , a Kaggle
    Master

    View Slide

  24. Collaborative Data Science?
    ENV
    GitHub ? …. 它一開始被創造時並沒有data science這概念
    Domino …. 超屌的Data Science Platform
    TEAM
    一開始我根本不知到predictive modeling要怎麼team work
    KDD Cup: Team work is crucial – the first four winning teams were 4 to 21
    people in size. Within a team, the most important role of every team member
    is to bring features. Often, team members start working independently not to
    get exposed to each other’s features,???? and merge their results later.
    Quite often, the teams are not even planning to work together early during
    the competition – they decide to run together once they establish their

    View Slide

  25. Collaborative Data Science?
    開個Google sheet,可以
    ● report finding
    ● promote diversity
    大家寫自己的做法,列出來的好處是方便採取
    不同大方向,增加多元做法不同角度的看法,
    當然不強迫不一樣,如果有興趣也可以和別人
    一樣。
    cuz Ensemble is benefit from diversity!
    name 想做什麼 分數 有什麼發現
    a 專攻A table, …. 0 A table是垃圾
    b 使用graph, blalalala 0.1 A+B table變黃金
    c 使用linear,3$#$20- 1 data可能是linear
    d 專門視覺化 連結

    View Slide