Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hackathon of Pixnet.net 2015

906495aee953d1a6dc3d661d28da0081?s=47 Erica Li
October 19, 2015

Hackathon of Pixnet.net 2015

Our result for spam detection.

906495aee953d1a6dc3d661d28da0081?s=128

Erica Li

October 19, 2015
Tweet

Transcript

  1. Spam Detection @ PIXNET Hackathon 2015 Menbers: Erical Gil Ryan

    Wayne Chuyu Bryan Bang! Bang!! Bang!!! We’re
  2. Submission Output Step 05 Lemon drops oat cake oat cake

    sugar plum sweet oreo Ensemble Selection Step 04 Feature Exaction Step 03 Droping HTML Tag Word Replacement Stemming PreProcessing Step 02 PIXNET Data Input Step 01 Action time Action type ratio Article links Article Html Tag
  3. All Domain in Spam All Domain in Not Spam 68.3%

    7,928 1.5% 170 30.2% 3,505 Intersect OUR Observation: 根據Spam與Non-Spam文章裡面URL的重覆比例做為區 Web design, Social Network, Art Design, Photography Portfolio
  4. Web design, Social Network, Art Design, Photography Our Work Name

    Click Here Web design, Social Network, Art Design, Photography Our Work Name Click Here OUR Observation: 文章的Feature差異,可用來辨別Spam User Web design, Social Network, Art Design, Photography Portfolio Web design, Social Network, Art Design, Photography Our Work Name Click Here feature url style tag hits length of article
  5. Normal User OUR Observation: Spam User在中午晚上用餐時間最活躍 Art Design, Photography Portfolio

    Spam User
  6. Spam最高重覆文章數量 Observation: 利用Spam與Non-Spam重覆數量分佈差距,提升 區別預測能力 Portfolio 想法: 因為spam常常複製別人的文章,所以在資料裡面的文章重複數量,多寡可能是一個識別是否 spam的特徵。我們先使用最簡單的 內建 hash來做文章的bucket,並且count所有的bucket

    hitness,我們就得到了文章的重複數量了。 實驗結果: 經過比較,這個特徵似乎有區別能力: spam與非spam重複數量分佈有些許差距,可以利用 locality hashing,加入群聚能力或許能 夠 提升區別能力。 Non-Spam最高重覆文章數量
  7. Observation: Spam User的Action在特定時間有固定的模式 Our Portfolio Art Design, Photography This is

    a example for a subtitle Spam的Action 順序有一些固定的 Pattern,左圖顯示的是編號94到 編號96的Spam User在月初的 Action 記錄.
  8. Observation: Spam User常使用的title

  9. Use Multi-Model To Vote Spam User Data Input Model 1

    Model 2 Model 3 Result Voteing Model
  10. None