Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Endless War with Spammers - Inside Our Spam Detection System

KMKLabs
March 22, 2018

Endless War with Spammers - Inside Our Spam Detection System

Pada Tech Talk kali ini, kita membahas cara kerja Spam Detection System yang membentengi vidio.com dan BBM khususnya untuk video spammer. Spam Detection System yang diberi nama SOTOY ini juga membantu 'membersihkan' spam dalam komen livestreaming di vidio.com dan komen di grup BBM. SOTOY sendiri khususnya untuk video spammer mempunyai 5 layer yang dipersiapkan untuk mendeteksi spammer di setiap layer yang dibuat. Pada video ini dibahas cara kerja 5 layer tersebut dalam mendeteksi spammer.

KMKLabs

March 22, 2018
Tweet

More Decks by KMKLabs

Other Decks in Technology

Transcript

  1. What is Spammers Contents? Videos contains only text or link

    to another website Static video contains only Text Repetitive videos but A LOT PORN
  2. It's not okay to post large amounts of untargeted, unwanted,

    or repetitive content in videos, If the main purpose of your content is to drive people off of YouTube and onto another site, it will likely violate YouTube’s spam policy. Richie Holland WE HATE THEM ! they are throwing trashes videos into our platform Its expensive to transcode We must build barricades to protect this !
  3. But they have weakness pattern Most of them write same-kind

    of title & description , yes everybody have it …
  4. Then. . . We Create our Spam Detection System Spam

    Detector Video Spam Detector Comment Vidio Livestreaming BBM SOTO Y Full Movie Short Duration Detector Forbidden Text Detector Spam Video Metadata Detector Forbidden Link Detector Spam Video Text Detector 3 5 5
  5. Our Source of Ground-Truth Data … Video Pooling Video Text

    Pooling Spam/Misleading or Sexual Content or Sotoy Detector [ Title + Description ]
  6. Spam Detector Video 1.0 Forbidden Text Detector Spam Video Text

    Detector Containing Phone Number ASCII Ratio TFxIDF Linear SVC
  7. Data Preprocessing Feature Extraction Classifier Evaluation Method Tokenize video_text_pooling Remove

    Stop Words Stemming Remove Digit N_Grams(1, 2) TF x IDF Linear SVC Cross Validation SPAM VIDEO TEXT DETECTOR
  8. Penyelundupan Beras Singapura Media : iNews TV Rubrikasi : First

    News Waktu/Tgl : 4:41 WIB 7 September 2016 Narasumber : - Data Preprocessing Tokenize Remove Stop Words Stemming Remove Digit penyelundupan, beras, singapura, media, :, iNews, TV, Rubrikasi, :, First, News, Waktu/Tgl, :, 4:41, WIB, 7, September, 2016, Narasumber, :, - penyelundupan, beras, singapura, media, :, iNews, TV, Rubrikasi, :, First, News, Waktu/Tgl, :, 4:41, WIB, 7, September, 2016, Narasumber, :, - penyelundupan, beras, singapura, media, iNews, TV, Rubrikasi, First, News, Waktu/Tgl, 4:41, WIB, 7, September, 2016, Narasumber
  9. penyelundupan, beras, singapura, media, iNews, TV, Rubrikasi, First, News, Waktu/Tgl,

    WIB, September, Narasumber Feature Extraction N_Grams(1, 2) TF x IDF {'penyelundupan’, 'beras, 'singapura’, …,'Tgl WIB’, 'WIB September', 'September Narasumber'} The main point is to form every problem into set of vectors. – Vector Space Models TF Score (Term Frequency) measures how frequently a term occurs in a document If 100 documents in corpus contains the term “beras” 10 times then TF(beras) = 10 / 100 = 0.1
  10. Evaluation: Cross Validation Spam Not Spam Spam TP FN Not

    Spam FP TN Actual Prediction Precision is proportion of Videos we predict as spam actually spam Recall is proportion of Videos actually spam predicted as spam
  11. Imam Hasan Memerdekakan Budak SAFINAH TV - Apakah yang menyebabkan

    Imam Hasan as memerdekakan seorang budak? Berikut adalah salah satu kisah ketika perjalanan menuju masjid, Imam Hasan as melewati seorang budak kulit hitam yang sedang duduk di pojok gang. Di tangan budak itu ada sepotong roti. Acapkali menyantap roti itu, sang budak memberi sepotong roti untuk... Selamat menyaksikan. Lebih banyak Kisah Teladan Imam Hasan al-Mujtaba a.s. di https://Safinah-Online.com dengan mengeklik tautan berikut: Imam Hasan a.s. dan Seorang Badui [https://goo.gl/YBfxbp] Baju Lebaran Imam Hasan dan Imam Husain a.s. [https://goo.gl/y32bhy] Selamat Membaca! Anda dapat mengikuti kami di: Instagram: https://instagram.com/SafinahOnline Facebook: https://Facebook.com/SafinahOnline Telegram: https://Telegram.me/SafinahOnline Twitter: https://Twitter.com/SafinahOnline Youtube: https://goo.gl/PFHkBR New Problem …
  12. Spam Detector Video 2.0 Forbidden Text Detector Spam Video Text

    Detector Forbidden Link Detector Contains > 2 external link SPA M HAM [VIDIO, LIPUTAN6, BINTANG, BOLA, KLIKDOKTER, BBM]
  13. Spam Detector Video 3.0 Forbidden Text Detector Spam Video Text

    Detector Forbidden Link Detector Spam Video Metadata Detector Video Text Extraction Standard Scaler SVC
  14. Feature Extraction Z – Score Normalization a.k.a StandardScaler video_text_pooling Price

    Count SPAM VIDEO METADATA DETECTOR URL Count Digit Count Repeated Count Lexicon Count Feature Vector Feature Scaling SVC (Rbf Kernel) Evaluation Method kata sexy hot kontol bokep telanjang bugil onani masturbasi sex seks harga alat bantu sex cewek nikmat sex toys ejakulasi disfungsi ereksi
  15. Spam Detector Video 4.0 (Latest Version) Forbidden Text Detector Spam

    Video Text Detector Forbidden Link Detector Spam Video Metadata Detector Full Movie Short Duration Detector Is Title+Description /full.{0,2}movie/g Is duration <= 60 SPA M HAM
  16. We train our classifier for every 07:00 A.M. video_text_pooling SPAM_VIDEO_TEXT_DETECTOR

    SPAM_VIDEO_METADATA_DETECTOR MODEL PREDICTION Eager Learner MODEL TRAINING video_statuses
  17. Problems we are still facing … •They upload a video

    with normal description & title, then after sometimes they edit the video description & title •They upload video with normal title but no description at all •Porn
  18. The Future Strategy of this war… We create User Behavior

    Video Spam Detection System . . . Video Processing (?)