Sotoy Spam Detection White Paper

Sotoy: Spam Detection for Indonesian Text Based Content Parlinggoman Hasibuan
[email protected] Yahya Eru C [email protected]

Outline • Introductions ◦ About Liputan6.com ◦ About vidio.com ◦
UGC content • Related Work • Our work ◦ Methodology ◦ Results ◦ Implementation • Conclusions ◦ Contributions

Introductions (liputan6.com) 1. Produk dari KMKLabs (Member of Emtek Group)
2. Situs mengenai berita-berita di Indonesia 3. No.2 online news portal di Indonesia 4. Dimulai sejak Agustus 2014 5. Pengunjung website mencapai 3.5 Juta per Hari 6. Total artikel 1.2 Juta 7. Rata-rata comment per hari sebanya 550 comment

Introductions (vidio.com) 1. Produk dari KMKLabs (Member of Emtek Group)
2. Sebuah wadah untuk video sharing, fokus kepada local content creator 3. No.1 local video sharing di Indonesia 4. Dimulai sejak Agustus 2014 5. Pengunjung mencapai 1 Juta per Hari 6. Total video 300 ribu 7. Rata-rata upload video per hari 550 video 8. Rata-rata comment per hari 200 comments

UGC Content • Setiap hari kita menghadapi ∓1000 content dari
user • Tidak ada yg mengatur konten nya • Content Administrator melakukan manual checking untuk Spam comment dan video

Related Work 1. M. Taufiq Nuruzzaman et al : Independent
and Personal SMS Spam Filtering Feature : Word occurence, Classifier : Naive Bayes, SVM 2. Tiago. A. Almeid et al : Contributions to the study of SMS Spam Filtering: New Collection and Results Feature : Word Count, Classifier : Naive Bayes, k-Nearest Network, Decision Tree, SVM 3. Sin-Eon Kim et al : SMS Spam Filterinig Using Keyword Frequency Ratio Feature : Frequency ratio, Classifier : Naive Bayes, Decision Tree, Logistic Regression

Logistic Regression

Linear SVM (Source: www.mblondel.org) →

Naive Bayes P (C 1 | F 1 ,F 2
, … F n ) compare to P(C 1 | F 1 ,F 2 , … F n )

Decision Tree Source: http://www.refactorthis.net

Random Forest Source: http://kazoo04.hatenablog.com

Our Method

Dataset 1. Collecting comments from liputan6 and videos meta data
(title and descriptions) from vidio.com 2. All data collect before September 2015 3. Composition from 284,976 data : a. 135,160 (47.45%) is Spam (labeled with 0) b. 148,816 (52.55%) is Ham (labeled with 1) 4. We offered is available to public by requests

Dataset

Tahap - tahap Dataset Cleaning 1. Convert text word to
Hexadecimal and remove duplicate hex 2. Regex untuk membersihkan \n dan \r 3. Mengkonversi huruf latin ke utf-8 £ ¢ ¥ ¤ § © ª ® ° µ ¶ º Á À Ã Å Ä Ç Æ É È Ë Ê Í Ì Ï Î Ñ Ð Ó Ò Õ Ô Ö Ù Ø Û Ú Ý Ü ß Þ á à ã â å ä ç æ é è ë ê í ì ï î ñ ð ó ò õ ô ö ù ø û ú ý ü ÿ þ L c Y o S c a R o u P o A A A A A A C Ae E E E E I I I I N D O O O O O U O U U Y U b p a a a a a a c ae e e e e i i i i n d o o o o o u o u u y u y p ↓

Tahap - tahap Dataset Cleaning 3. Gordon et al [1]
dan Zhang et al [2] mengatakan jika menggunakan stemming dan stop words akan menurunkan performa dari classifier. Untuk itu pada penelitian ini kita akan membandingkan hasil antara menggunakan stemming dan classifier dan dengan tidak menggunakan 4. Stop words yang akan digunakan adalah stop words dari Tala [3] [1] Gordon V. Cormack. 2008. Email Spam Filtering: A Systematic Review. Found. Trends Inf. Retr. 1, 4 (April 2008), 335-455. DOI=http://dx.doi.org/10.1561/1500000006 [2] Le Zhang, Jingbo Zhu, and Tianshun Yao. 2004. An evaluation of statistical spam filtering techniques. 3, 4 (December 2004), 243-269. DOI=http://dx.doi.org/10. 1145/1039621.1039625 [3] Tala, F. 2003. A study of stemming effects on information retrieval in Bahasa Indonesia. M.S. thesis, University of Amsterdam.

Feature Extraction 1. Menggunakan Term Frequency dan Bi-Gram sebagai features
2. Jumlah Feature dari Cleaned Bi-Gram adalah 731,874 features 3. Jumlah Feature dari Cleaned Bi-Gram adalah 900,927 features

Evaluation • We divide the data into 70 % data
train (199,483 contents) and 30 % data test (85,943 contents). • Not using technique to reduce the dimensionality of the training data. • To comparing the results we are using measures Accuracy (A), Recall (R), Precision (P), F1-Score (F1), and Matthews Correlation Coefficient (MCC) [1]. [1] Tiago A. Almeida, Akebo Yamakami, and Jurandy Almeida. 2009. Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters. In Proceedings of the 2009 International Conference on Machine Learning and Applications (ICMLA '09). IEEE Computer Society, Washington, DC, USA, 517-522. DOI=http://dx.doi.org/10. 1109/ICMLA.2009.22

Comparing Different Algorithm Table 1. Algorithm that used in this
experiments Classifiers SVM Linear – SVM Random Forest - RF Logistic Regression - LR Decision Tree - DT MN-NB

Comparing Different Algorithm Table 2. The results of classifiers using
Term Frequency and Bi-Gram features and with cleaning stopwords and stemming Classifiers A% R% P% F1% MCC LR 90.77 90.58 91.75 91.16 0.82 SVM 90.74 90.06 92.15 91.09 0.81 MN-NB 88.14 87.52 89.67 88.58 0.76 RF 90.73 90.16 92.05 91.1 0.81 DT 89.85 86.83 92.48 90.1 0.8

Term Frequency and Bi-Gram features and with cleaning stopwords and stemming Classifiers A% R% P% F1% MCC LR 90.77 90.58 91.75 91.16 0.82 SVM 90.74 90.06 92.15 91.09 0.81 MN-NB 88.14 87.52 89.67 88.58 0.76 RF 90.73 90.16 92.05 91.1 0.81 DT 89.85 86.83 92.48 90.1 0.8

Term Frequency and Bi-Gram features without cleaning stop words and stemmings Classifiers A% R% P% F1% MCC LR 90.76 90.51 91.8 91.15 0.82 SVM 90.7 90.06 92.07 91.05 0.81 MN-NB 87.95 86.92 89.83 88.35 0.76 RF 90.76 90.55 91.76 91.15 0.81 DT 89.98 88.33 92.28 90.26 0.8

Conclusions • Logistic Regression memberikan hasil F1-Score yang terbaik 90.77%
• Membersihkan dan melakukan stemming dapat meningkatkan performa baik

Implementations Sistem sotoy ini adalah sebuah API / web service.
Untuk setiap komen atau video yang masuk akan dilakukan pengecekan

Automation Training - Setiap perubahan oleh konten admin akan di
pooling - Dijalankan otomatis pukul 06.00 - Akan melakukan cross validation 70% - 30 % - Membandingkan hasil prediksi model baru, model lama dan thresholds yaitu hasil observasi.

Conclusions • Logistic Regression memberikan hasil F1-Score yang terbaik 90.77%
• Membersihkan dan melakukan stemming dapat meningkatkan performa baik • Sukses melakukan deteksi spam sebanyak 47.694 (26,4 %) dari seluruh content yg masuk dari bulan November 2015 - April 2016

Contributions • New Spam Indonesian Dataset • State technique for
feature extractions for Indonesian Language spam detection • For future work add Spam dataset from UCI and using Instagram comments

Sotoy Spam Detection White Paper

Sotoy Spam Detection White Paper

Parlinggoman Hasibuan

More Decks by Parlinggoman Hasibuan

Other Decks in Technology

Featured

Transcript

Sotoy: Spam Detection for Indonesian Text Based Content Parlinggoman Hasibuan

Outline • Introductions ◦ About Liputan6.com ◦ About vidio.com ◦

Introductions (liputan6.com) 1. Produk dari KMKLabs (Member of Emtek Group)

Introductions (vidio.com) 1. Produk dari KMKLabs (Member of Emtek Group)

UGC Content • Setiap hari kita menghadapi ∓1000 content dari

Related Work 1. M. Taufiq Nuruzzaman et al : Independent

Logistic Regression

Linear SVM (Source: www.mblondel.org) →

Naive Bayes P (C 1 | F 1 ,F 2

Decision Tree Source: http://www.refactorthis.net

Random Forest Source: http://kazoo04.hatenablog.com

Our Method

Dataset 1. Collecting comments from liputan6 and videos meta data

Dataset

Tahap - tahap Dataset Cleaning 1. Convert text word to

Tahap - tahap Dataset Cleaning 3. Gordon et al [1]

Feature Extraction 1. Menggunakan Term Frequency dan Bi-Gram sebagai features

Evaluation • We divide the data into 70 % data

Comparing Different Algorithm Table 1. Algorithm that used in this

Comparing Different Algorithm Table 2. The results of classifiers using

Comparing Different Algorithm Table 1. The results of classifiers using

Comparing Different Algorithm Table 2. The results of classifiers using

Conclusions • Logistic Regression memberikan hasil F1-Score yang terbaik 90.77%

Implementations Sistem sotoy ini adalah sebuah API / web service.

Automation Training - Setiap perubahan oleh konten admin akan di

Conclusions • Logistic Regression memberikan hasil F1-Score yang terbaik 90.77%

Contributions • New Spam Indonesian Dataset • State technique for