Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sotoy Spam Detection White Paper

Sotoy Spam Detection White Paper

Parlinggoman Hasibuan

May 19, 2016
Tweet

More Decks by Parlinggoman Hasibuan

Other Decks in Technology

Transcript

  1. Outline • Introductions ◦ About Liputan6.com ◦ About vidio.com ◦

    UGC content • Related Work • Our work ◦ Methodology ◦ Results ◦ Implementation • Conclusions ◦ Contributions
  2. Introductions (liputan6.com) 1. Produk dari KMKLabs (Member of Emtek Group)

    2. Situs mengenai berita-berita di Indonesia 3. No.2 online news portal di Indonesia 4. Dimulai sejak Agustus 2014 5. Pengunjung website mencapai 3.5 Juta per Hari 6. Total artikel 1.2 Juta 7. Rata-rata comment per hari sebanya 550 comment
  3. Introductions (vidio.com) 1. Produk dari KMKLabs (Member of Emtek Group)

    2. Sebuah wadah untuk video sharing, fokus kepada local content creator 3. No.1 local video sharing di Indonesia 4. Dimulai sejak Agustus 2014 5. Pengunjung mencapai 1 Juta per Hari 6. Total video 300 ribu 7. Rata-rata upload video per hari 550 video 8. Rata-rata comment per hari 200 comments
  4. UGC Content • Setiap hari kita menghadapi ∓1000 content dari

    user • Tidak ada yg mengatur konten nya • Content Administrator melakukan manual checking untuk Spam comment dan video
  5. Related Work 1. M. Taufiq Nuruzzaman et al : Independent

    and Personal SMS Spam Filtering Feature : Word occurence, Classifier : Naive Bayes, SVM 2. Tiago. A. Almeid et al : Contributions to the study of SMS Spam Filtering: New Collection and Results Feature : Word Count, Classifier : Naive Bayes, k-Nearest Network, Decision Tree, SVM 3. Sin-Eon Kim et al : SMS Spam Filterinig Using Keyword Frequency Ratio Feature : Frequency ratio, Classifier : Naive Bayes, Decision Tree, Logistic Regression
  6. Naive Bayes P (C 1 | F 1 ,F 2

    , … F n ) compare to P(C 1 | F 1 ,F 2 , … F n )
  7. Dataset 1. Collecting comments from liputan6 and videos meta data

    (title and descriptions) from vidio.com 2. All data collect before September 2015 3. Composition from 284,976 data : a. 135,160 (47.45%) is Spam (labeled with 0) b. 148,816 (52.55%) is Ham (labeled with 1) 4. We offered is available to public by requests
  8. Tahap - tahap Dataset Cleaning 1. Convert text word to

    Hexadecimal and remove duplicate hex 2. Regex untuk membersihkan \n dan \r 3. Mengkonversi huruf latin ke utf-8 £ ¢ ¥ ¤ § © ª ® ° µ ¶ º Á À Ã Å Ä Ç Æ É È Ë Ê Í Ì Ï Î Ñ Ð Ó Ò Õ Ô Ö Ù Ø Û Ú Ý Ü ß Þ á à ã â å ä ç æ é è ë ê í ì ï î ñ ð ó ò õ ô ö ù ø û ú ý ü ÿ þ L c Y o S c a R o u P o A A A A A A C Ae E E E E I I I I N D O O O O O U O U U Y U b p a a a a a a c ae e e e e i i i i n d o o o o o u o u u y u y p ↓
  9. Tahap - tahap Dataset Cleaning 3. Gordon et al [1]

    dan Zhang et al [2] mengatakan jika menggunakan stemming dan stop words akan menurunkan performa dari classifier. Untuk itu pada penelitian ini kita akan membandingkan hasil antara menggunakan stemming dan classifier dan dengan tidak menggunakan 4. Stop words yang akan digunakan adalah stop words dari Tala [3] [1] Gordon V. Cormack. 2008. Email Spam Filtering: A Systematic Review. Found. Trends Inf. Retr. 1, 4 (April 2008), 335-455. DOI=http://dx.doi.org/10.1561/1500000006 [2] Le Zhang, Jingbo Zhu, and Tianshun Yao. 2004. An evaluation of statistical spam filtering techniques. 3, 4 (December 2004), 243-269. DOI=http://dx.doi.org/10. 1145/1039621.1039625 [3] Tala, F. 2003. A study of stemming effects on information retrieval in Bahasa Indonesia. M.S. thesis, University of Amsterdam.
  10. Feature Extraction 1. Menggunakan Term Frequency dan Bi-Gram sebagai features

    2. Jumlah Feature dari Cleaned Bi-Gram adalah 731,874 features 3. Jumlah Feature dari Cleaned Bi-Gram adalah 900,927 features
  11. Evaluation • We divide the data into 70 % data

    train (199,483 contents) and 30 % data test (85,943 contents). • Not using technique to reduce the dimensionality of the training data. • To comparing the results we are using measures Accuracy (A), Recall (R), Precision (P), F1-Score (F1), and Matthews Correlation Coefficient (MCC) [1]. [1] Tiago A. Almeida, Akebo Yamakami, and Jurandy Almeida. 2009. Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters. In Proceedings of the 2009 International Conference on Machine Learning and Applications (ICMLA '09). IEEE Computer Society, Washington, DC, USA, 517-522. DOI=http://dx.doi.org/10. 1109/ICMLA.2009.22
  12. Comparing Different Algorithm Table 1. Algorithm that used in this

    experiments Classifiers SVM Linear – SVM Random Forest - RF Logistic Regression - LR Decision Tree - DT MN-NB
  13. Comparing Different Algorithm Table 2. The results of classifiers using

    Term Frequency and Bi-Gram features and with cleaning stopwords and stemming Classifiers A% R% P% F1% MCC LR 90.77 90.58 91.75 91.16 0.82 SVM 90.74 90.06 92.15 91.09 0.81 MN-NB 88.14 87.52 89.67 88.58 0.76 RF 90.73 90.16 92.05 91.1 0.81 DT 89.85 86.83 92.48 90.1 0.8
  14. Comparing Different Algorithm Table 1. The results of classifiers using

    Term Frequency and Bi-Gram features and with cleaning stopwords and stemming Classifiers A% R% P% F1% MCC LR 90.77 90.58 91.75 91.16 0.82 SVM 90.74 90.06 92.15 91.09 0.81 MN-NB 88.14 87.52 89.67 88.58 0.76 RF 90.73 90.16 92.05 91.1 0.81 DT 89.85 86.83 92.48 90.1 0.8
  15. Comparing Different Algorithm Table 2. The results of classifiers using

    Term Frequency and Bi-Gram features without cleaning stop words and stemmings Classifiers A% R% P% F1% MCC LR 90.76 90.51 91.8 91.15 0.82 SVM 90.7 90.06 92.07 91.05 0.81 MN-NB 87.95 86.92 89.83 88.35 0.76 RF 90.76 90.55 91.76 91.15 0.81 DT 89.98 88.33 92.28 90.26 0.8
  16. Conclusions • Logistic Regression memberikan hasil F1-Score yang terbaik 90.77%

    • Membersihkan dan melakukan stemming dapat meningkatkan performa baik
  17. Implementations Sistem sotoy ini adalah sebuah API / web service.

    Untuk setiap komen atau video yang masuk akan dilakukan pengecekan
  18. Automation Training - Setiap perubahan oleh konten admin akan di

    pooling - Dijalankan otomatis pukul 06.00 - Akan melakukan cross validation 70% - 30 % - Membandingkan hasil prediksi model baru, model lama dan thresholds yaitu hasil observasi.
  19. Conclusions • Logistic Regression memberikan hasil F1-Score yang terbaik 90.77%

    • Membersihkan dan melakukan stemming dapat meningkatkan performa baik • Sukses melakukan deteksi spam sebanyak 47.694 (26,4 %) dari seluruh content yg masuk dari bulan November 2015 - April 2016
  20. Contributions • New Spam Indonesian Dataset • State technique for

    feature extractions for Indonesian Language spam detection • For future work add Spam dataset from UCI and using Instagram comments