Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python Scikit : Feature Selection

KMKLabs
November 17, 2015

Python Scikit : Feature Selection

Feature extraction merupakan metode yang digunakan untuk mencari fitur-fitur apa saja dari sebuah data yang sangat penting, terutama dalam membuat model seperti klarifikasi maupun regresi. Feature extraction sangat berguna terutama pada data dengan dimensi yang banyak (ratusan atau bahkan ribuan). dengan Feature extraction kita dapat menghemat cost pada saat komputasi data, baik dari segi space maupun waktu.

KMKLabs

November 17, 2015
Tweet

More Decks by KMKLabs

Other Decks in Technology

Transcript

  1. Problems ? • Many data Many feature → • Many

    features High dimension data →
  2. Example : Twitter 1. Status ( 23 ) ID, text,

    author, created_at, geo, retweet_count, ... 2. User ( 35 ) ID, screen_name, name, followers_count, time_zone, … High dimension lead to : Curse of dimensionality
  3. Curse of Dimensionality • More cost • Sparse data :

    bad for statistical method • Problem in sampling • ...etc
  4. Experiment (2) Algorithm : Naive Bayes (Bernoulli), 10 Fold Cross

    Validation Feature Selection : Based on coefficient on linear model
  5. Experiment (3) Features ( 22 ) : Data : 1250

    status_id author_id retweet_count geo status_day status_hour place in_reply_to_status_id in_reply_to_user_id verified followers_count friends_count protected location statuses_count geo_enabled lang favourites_count listed_count author_day author_hour time_zone
  6. Result W0 (Error) #Features Accuracy(%) 1.0 18 73.86 0.5 17

    73.86 0.1 12 72.50 0.05 8 72.50 0.01 4 72.34 Origin : 73.38
  7. Result Without Feature Selection (22) : Accuracy : 73.38 %

    With Feature Selection (17) : Accuracy : 73.86 %
  8. Significant Features friends_count protected location statuses_count geo_enabled lang favourites_count listed_count

    author_day author_hour time_zone status_id author_id retweet_count geo status_day status_hour place in_reply_to_status_id in_reply_to_user_id verified followers_count
  9. Conclusion The more smaller number of dimensions doesn't always mean

    higher accuracy In this experiment, the selected features can't be known (yet)