Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building profanity filters on mobile: clbuttic ...

vixentael
September 05, 2015

Building profanity filters on mobile: clbuttic sh!t

open pdf to be able to tap on links

-------------------
- profanity filters: why we need them in mobile at all?
- handle tricky cases: what is wrong with word 'classic'
- how to filter fast (strings vs sets)
- gentle filtering not to scare users

vixentael

September 05, 2015
Tweet

More Decks by vixentael

Other Decks in Programming

Transcript

  1. Framework Days. IT Saturday. 5.09.2015 WHY FILTERING BAD lack of

    trust to your users their willing to break rules →
  2. LET’S BUILD FILTER! Framework Days. IT Saturday. 5.09.2015 filter =

    list of dirty words + list of replacements + filter rule
  3. – George Carlin, 1972 Shit, piss, fuck, cunt, cocksucker, motherfucker,

    and tits. “Seven Words You Can Never Say on Television” Framework Days. IT Saturday. 5.09.2015
  4. RANGE OF WORD Framework Days. IT Saturday. 5.09.2015 NSRange range

    = [text rangeOfString:badWord options:NSCaseInsensitiveSearch]; BOOL hasDirtyWord = [text localizedCaseInsensitiveContainsString:badWord];
  5. RANGE OF WORD Framework Days. IT Saturday. 5.09.2015 - (NSArray

    * )rangesOfBadWordsWithSpaceInString:(NSString * )text { __block NSMutableArray * result = [NSMutableArray array]; [self.listOfBadWordsWithSpace enumerateObjectsUsingBlock:^(NSString * badWord, NSUInteger idx, BOOL * stop) { NSRange range = [text rangeOfString:badWord options:NSCaseInsensitiveSearch]; while (range.location != NSNotFound) { [result addObject:[NSValue valueWithRange:range]]; NSRange nextRange = NSMakeRange(range.location + 1, [text length] - range.location - 1); range = [text rangeOfString:badWord options:NSCaseInsensitiveSearch range:nextRange]; } }]; return result; }
  6. SEARCH BY ENTRY Framework Days. IT Saturday. 5.09.2015 Get your

    ass down here! The grass around the creek was new, giving it a velvety look. Dusty, his heartless assassin, had found his mate.
  7. SEARCH BY ENTRY Framework Days. IT Saturday. 5.09.2015 Get your

    ass down here! The grass around the creek was new, giving it a velvety look. Dusty, his heartless assassin, had found his mate.
  8. SEARCH BY ENTRY Framework Days. IT Saturday. 5.09.2015 Get your

    ass down here! The grass around the creek was new, giving it a velvety look. Dusty, his heartless assassin, had found his mate.
  9. Framework Days. IT Saturday. 5.09.2015 Get your ass down here!

    The grass around the creek was new, giving it a velvety look. Dusty, his heartless assassin, had found his mate. FALSE POSITIVES
  10. Framework Days. IT Saturday. 5.09.2015 assart assault association assurance ‘ASS’

    WORDS harassment hassel hourglass impassable pass passion piassaba preassign 1250 words found http://www.morewords.com/contains/ass/
  11. Framework Days. IT Saturday. 5.09.2015 Constitution → Consbreastution AND FAILS

    AGAIN… medieval → medireview Tyson Gay → Tyson Homosexual
  12. FILTER RULES Framework Days. IT Saturday. 5.09.2015 1. search by

    entry 2. search whole word don’t u know me?
  13. SEARCH WHOLE WORD Framework Days. IT Saturday. 5.09.2015 NSString *

    scanned; if ([scanner scanCharactersFromSet:wordCharacters intoString:&scanned]) { if ([wordSet containsObject:[scanned lowercaseString]]) { NSRange range = NSMakeRange(scanner.scanLocation - scanned.length, scanned.length); [result addObject:[NSValue valueWithRange:range]]; } } NSSet * badWordsSet = [NSMutableSet setWithArray:self.listOfBadWords]; NSScanner * scanner = [NSScanner scannerWithString:text]; NSCharacterSet * wordCharacters = [NSCharacterSet alphanumericCharacterSet];
  14. SEARCH WHOLE WORD Framework Days. IT Saturday. 5.09.2015 Get your

    ass down here! The grass around the creek was new, giving it a velvety look. Dusty, his heartless assassin, had found his mate.
  15. Framework Days. IT Saturday. 5.09.2015 Get your a s s

    down here! You'd probably fire my a.s.s the first day on the job. You've covered my a_s_s every time I screwed up. PUNCTUATION
  16. Framework Days. IT Saturday. 5.09.2015 Get your a s s

    down here! You'd probably fire my a.s.s the first day on the job. You've covered my a_s_s every time I screwed up. PUNCTUATION
  17. FILTER RULES Framework Days. IT Saturday. 5.09.2015 1. search by

    entry 2. search whole word 3. handle punctuation don’t tell anyone…
  18. BITCH B!TCH B1TCH 8ITCH ßITCH 13ITCH L3ITCH BI7CH BI+CH BI†CH

    BIT[H BIT¢H BIT<H BITC# BITC: B1T¢H 8!†C# 8ITC/-/ 817[# (3][+(: Framework Days. IT Saturday. 5.09.2015
  19. FILTER RULES Framework Days. IT Saturday. 5.09.2015 1. search by

    entry 2. search whole word 3. handle punctuation 4. handle l33t speak my name is…
  20. NICE TITS In 2007, the Royal Society for the Protection

    of Birds blocked ornithological terms such as cock (male bird) and tit, shag and booby from its discussion forums Framework Days. IT Saturday. 5.09.2015
  21. FILTER RULES Framework Days. IT Saturday. 5.09.2015 1. search by

    entry 2. search whole word 3. handle punctuation 4. handle l33t speak 5. remember about exceptions blue- footed booby!
  22. TEXT FILTERING ON IOS Framework Days. IT Saturday. 5.09.2015 words

    dictionary (boobs, b00bs, b00b5) whole word scan NSScanner, NSSet
  23. TEXT FILTERING ON IOS Framework Days. IT Saturday. 5.09.2015 phrases

    dictionary (b o o b s, b.o.o.b.s, b!o!o!bs) substring scan rangeOfString
  24. TEXT FILTERING ON IOS Framework Days. IT Saturday. 5.09.2015 words

    dictionary (boobs, b00bs, b00b5) whole word scan NSScanner, NSSet phrases dictionary (b o o b s, b.o.o.b.s, bo!obs) substring scan rangeOfString +
  25. HOW FAST IS IT? Framework Days. IT Saturday. 5.09.2015 time,

    seconds 0 0,1 0,2 0,3 0,4 user text, characters count 1000 5000 10000 20000 range scanner both dirty words dictionary contains 455 words
  26. LIVE FILTERING Framework Days. IT Saturday. 5.09.2015 use RAC and

    run filter every time user inputs character
  27. IMPROVE FILTERING • Keep dictionary up to date • Whitelist

    • Levenshtein distance • Soundex functions (where a word sounds like another) • Naive bayesian inference filtering of phrases/terms Framework Days. IT Saturday. 5.09.2015
  28. DIRTY WORDS • list of dirty words in different languages

    https://github.com/shutterstock/List-of-Dirty- Naughty-Obscene-and-Otherwise-Bad-Words • list of dirty words i’ve used https://gist.github.com/vixentael/ 5ce4168e3e94d9686405 Framework Days. IT Saturday. 5.09.2015