uniform(all poosible pairs). 1. cartesion product device and cookie , then sample 10% a. too slow 2. Previously I sample x% on devices and x% on cookies and cartesion- product them. a. WRONG? 3. take one sample from device and one from cookies. you can proof that the result ~ uniform(all pair) a. device samples and cookie samples need to have eqaul size and then horizonal concatenate the both. cookies devices fully connected bi-partite graph what do you do if you want to sample 10% device-cookie pairs .
=1分 empty: somthing = 0分 使用10分之1 sample (小筆電只能塞這樣) Model with rank: 1 Parameters: {'thres__threshold': 0.12} validation score: 0.4 ? if you got 0.3x , then you are rank at <%50 Base Line (tanks to Chuyu’s filtered dataset) just IP jaccard feature and a threshold learner, threshold >0 = 至少有交集 ?
(no tfidf, should be better if using it) 重要的feature : 1. ip cosine x others (count is more important) 2. ip jaccard (消失於top 10) 3. annonymous_c2 vs annonymous_c2 (消失於top 10)
matrix row = df['index'].values col = df['ip'].values data = df[data_col].values return csr_matrix((data,(row,col))) 用scipy的加減乘除計算 共有六種ip資訊欄位,每個都算 jaccard and cosine ID IP count1 countN 123 4345123 1 ... ... 123 12345678 200 ... 345 234 0 ... ... IP table特別有趣,你可以用任何 NLP技術來拷問它 (在這裏bag of words is perfect assumption )
me about spot instance) 2. Spark : *@##$^%^*#% a. 算candidate training set 莫名其妙的慢, b. 莫名其妙死機 c. docker吃光我硬碟... d. 改用pandas 3. pandas : $#%^$^&%&* a. 要自己平行化,ORZ 4. scikit-learn : @#$#$#$^%*& a. OOM
…. 超屌的Data Science Platform TEAM 一開始我根本不知到predictive modeling要怎麼team work KDD Cup: Team work is crucial – the first four winning teams were 4 to 21 people in size. Within a team, the most important role of every team member is to bring features. Often, team members start working independently not to get exposed to each other’s features,???? and merge their results later. Quite often, the teams are not even planning to work together early during the competition – they decide to run together once they establish their
diversity 大家寫自己的做法,列出來的好處是方便採取 不同大方向,增加多元做法不同角度的看法, 當然不強迫不一樣,如果有興趣也可以和別人 一樣。 cuz Ensemble is benefit from diversity! name 想做什麼 分數 有什麼發現 a 專攻A table, …. 0 A table是垃圾 b 使用graph, blalalala 0.1 A+B table變黃金 c 使用linear,3$#$20- 1 data可能是linear d 專門視覺化 連結