Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Language Processing in Objective-C

Natural Language Processing in Objective-C

* Audio of this talk available here: http://soundcloud.com/mattt-thompson/objc-nlp

Apple has provided some truly remarkable language APIs in its frameworks. It's almost unfair how good they are, considering how most languages struggle just to handle Unicode correctly. From tokenizers and part-of-speech taggers, to transcription, data detectors, and document classification using latent semantic analysis; this session will cover the APIs as well as the linguistic theory behind them, so that you may leverage these insanely powerful technologies in your application.

D29bb4d2d2f2ba2c2fb5a329e1e4651f?s=128

Mattt Thompson

October 27, 2012
Tweet

Transcript

  1. Natural Language Processing in Objective-C Mattt Thompson CocoaConf PDX 2012

  2. There are two indicators that can tell you (with startling

    accuracy) how nice a language is to use:
  3. None
  4. • API Consistency

  5. • API Consistency • Quality of String Implementation

  6. Ruby

  7. None
  8. PHP

  9. None
  10. Objective-C

  11. None
  12. NSString

  13. Objective-C Linguistic APIs

  14. • CFStringTransform Objective-C Linguistic APIs

  15. • CFStringTransform • CFStringTokenizer Objective-C Linguistic APIs

  16. • CFStringTransform • CFStringTokenizer • NSLinguisticTagger Objective-C Linguistic APIs

  17. • CFStringTransform • CFStringTokenizer • NSLinguisticTagger • NSDataDetector Objective-C Linguistic

    APIs
  18. • CFStringTransform • CFStringTokenizer • NSLinguisticTagger • NSDataDetector • LatentSemanticMapping

    Objective-C Linguistic APIs
  19. CFStringTransform

  20. CFStringTransform • Strip Accents and Diacritics • Name Unicode Characters

    • Transliterate Between Orthographies
  21. None
  22. @"I wîsh the Énġlišh långuãge hađ mørē iñteŕêßţing çharäčtèrş"

  23. NSMutableString *string = [@"I wîsh the Énġlišh långuãge hađ mørē

    iñteŕêßţing çharäčtèrş" mutableCopy]; NSLog(@"Before: %@", string); CFStringTransform( (__bridge CFMutableStringRef)string, NULL, kCFStringTransformStripCombiningMarks, NO); NSLog(@"After: %@", string);
  24. @"I wish the English language hađ møre intereßting characters"

  25. • đ - d with stroke • ø - o

    with stroke • ß - eszet
  26. CFStringTransform • Strip Accents and Diacritics • Name Unicode Characters

    • Transliterate Between Orthographies
  27. None
  28. Emoji Dick

  29. None
  30. NSMutableString *string = [@"" mutableCopy]; NSLog(@"Emoji: %@", string); CFStringTransform( (__bridge

    CFMutableStringRef)string, NULL, kCFStringTransformToUnicodeName, NO); NSLog(@"Unicode Name: %@", string);
  31. @"{PIG FACE}"

  32. CFStringTransform • Strip Accents and Diacritics • Name Unicode Characters

    • Transliterate Between Orthographies
  33. য়ࡇ ъթझఋੌ

  34. NSMutableString *string = [@"য়ࡇ ъթ झఋੌ" mutableCopy]; NSLog(@"Before: %@", string);

    CFStringTransform( (__bridge CFMutableStringRef)string, NULL, kCFStringTransformToLatin, NO); NSLog(@"After: %@", string);
  35. oppan gangnam seutail

  36. None
  37. Transformation Input Output kCFStringTransformLatinArabic mrḥbạ !"#$% kCFStringTransformLatinCyrillic privet привет kCFStringTransformLatinGreek

    geiá sou γειά σου kCFStringTransformLatinHangul annyeonghaseyo উ֞ೞࣁਃ kCFStringTransformLatinHebrew şlwm םולש kCFStringTransformLatinHiragana hiragana ͻΒ͕ͳ kCFStringTransformLatinKatakana katakana ΧλΧφ kCFStringTransformLatinThai s̄wạs̄dī สวัสดี kCFStringTransformHiraganaKatakana ʹ΄Μ͝ χϗϯΰ kCFStringTransformMandarinLatin தจ zhōng wén
  38. CFStringTokenizer

  39. In English, words are space-delimited

  40. In English, words are space-delimited ೔ຊޠͷݴ༿Ͱ͢΂͕ͯ Ұॹʹ͍Δ!!!!!!

  41. [string componentsSeparatedByCharactersInSet: [NSCharacterSet whitespaceCharacterSet]]

  42. [string componentsSeparatedByCharactersInSet: [NSCharacterSet whitespaceCharacterSet]] x

  43. NSString *string = @"೔ຊޠͷݴ༿Ͱ͢΂͕ͯҰॹʹ͍Δ"; NSMutableArray *mutableTokens = [NSMutableArray array]; CFStringTokenizerRef

    tokenizer = CFStringTokenizerCreate(NULL, (__bridge CFStringRef)(string), CFRangeMake(0, [string length]), kCFStringTokenizerUnitWord, CFLocaleCopyCurrent()); CFStringTokenizerTokenType tokenType = kCFStringTokenizerTokenNone; while((tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)) != kCFStringTokenizerTokenNone) { CFRange tokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer); CFStringRef token = CFStringCreateWithSubstring(kCFAllocatorDefault, (__bridge CFStringRef)(string), tokenRange); [mutableTokens addObject:(__bridge NSString *)(token)]; } NSLog(@"Tokens: %@", mutableTokens);
  44. ೔ຊ, ޠ, ͷ, ݴ༿, Ͱ, ͢΂ͯ, ͕, Ұॹ, ʹ, ͍Δ

    ರЧ∽Ƒ࿽∋ƊżƜƉů ၂⇞ƎŧƮ
  45. • kCFStringTokenizerUnitWord • kCFStringTokenizerUnitSentence • kCFStringTokenizerUnitParagraph • kCFStringTokenizerUnitLineBreak • kCFStringTokenizerUnitWordBoundary

  46. NSLinguisticTagger

  47. NSLinguisticTagger

  48. NSLinguisticTagger • Tokenize

  49. NSLinguisticTagger • Tokenize • Part of Speech

  50. NSLinguisticTagger • Tokenize • Part of Speech • Word Stem

  51. NSLinguisticTagger • Tokenize • Part of Speech • Word Stem

    • Named Entity Recognition
  52. NSLinguisticTagger • Tokenize • Part of Speech • Word Stem

    • Named Entity Recognition • Language & Script Detection
  53. None
  54. None
  55. How is the weather in Portland?

  56. NSString *question = @"How is the weather in Portland?"; NSLinguisticTaggerOptions

    options = NSLinguisticTaggerOmitWhitespace | NSLinguisticTaggerOmitPunctuation | NSLinguisticTaggerJoinNames; NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] initWithTagSchemes: [NSLinguisticTagger availableTagSchemesForLanguage:@"en"] options:options]; tagger.string = question;
  57. [tagger enumerateTagsInRange: NSMakeRange(0, [question length]) scheme:NSLinguisticTagSchemeNameTypeOrLexicalClass options:options usingBlock:^(NSString *tag, NSRange

    tokenRange, NSRange sentenceRange, BOOL *stop) { NSString *token = [question substringWithRange:tokenRange]; NSLog(@"%@: %@", token, tag); }];
  58. How Adverb is Verb the Determiner weather Noun in Preposition

    Portland PlaceName
  59. How Adverb is Verb the Determiner weather Noun in Preposition

    Portland PlaceName
  60. None
  61. None
  62. None
  63. • NSLinguisticTagSchemeTokenType) • NSLinguisticTagSchemeLexicalClass • NSLinguisticTagSchemeNameType

  64. • NSLinguisticTagWord • NSLinguisticTagPunctuation • NSLinguisticTagWhitespace • NSLinguisticTagOther NSLinguisticTagSchemeTokenType

  65. • NSLinguisticTagNoun • NSLinguisticTagVerb • NSLinguisticTagAdjective • NSLinguisticTagAdverb • NSLinguisticTagPronoun

    • NSLinguisticTagDeterminer NSLinguisticTagSchemeLexicalClass -- Snip 20 Other Parts of Speech --
  66. • NSLinguisticTagPersonalName • NSLinguisticTagPlaceName • NSLinguisticTagOrganizationName NSLinguisticTagSchemeNameType

  67. NSDataDetector

  68. None
  69. NSString *string = @"Speak at CocoaConf at 7900 82nd Avenue

    Portland, OR 97220 starting 4:00 on October 27, 2012"; NSDataDetector *detector = [NSDataDetector dataDetectorWithTypes:NSTextCheckingAllSyst emTypes error:nil]; [detector enumerateMatchesInString:string options:0 range:NSMakeRange(0, [string length]) usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop) { NSLog(@"Result: %@", [string substringWithRange:result.range]); }];
  70. • 7900 82nd Avenue Portland, OR 97220 • 4:00 on

    October 27, 2012
  71. NSRegularExpression Subclass

  72. NSTextCheckingResult

  73. • NSTextCheckingTypeOrthography • NSTextCheckingTypeSpelling • NSTextCheckingTypeGrammar • NSTextCheckingTypeQuote • NSTextCheckingTypeDash

    • NSTextCheckingTypeReplacement • NSTextCheckingTypeCorrection NSTextCheckingType
  74. • NSTextCheckingTypeDate • NSTextCheckingTypeAddress • NSTextCheckingTypeLink • NSTextCheckingTypePhoneNumber • NSTextCheckingTypeTransitInformation

    NSTextCheckingType
  75. OS X NSDataDetector iOS UITextView -dataDetectorTypes

  76. Latent Semantic Mapping

  77. None
  78. None
  79. None
  80. None
  81. None
  82. None
  83. Spam Legit ?

  84. None
  85. News Technology ? Sports Food Cats

  86. None
  87. None
  88. Latent Semantic Mapping

  89. None
  90. lsm(1)

  91. None
  92. WWDC 2011 Session 136

  93. Objective-C Linguistic APIs

  94. • CFStringTransform Objective-C Linguistic APIs

  95. • CFStringTransform • CFStringTokenizer Objective-C Linguistic APIs

  96. • CFStringTransform • CFStringTokenizer • NSLinguisticTagger Objective-C Linguistic APIs

  97. • CFStringTransform • CFStringTokenizer • NSLinguisticTagger • NSDataDetector Objective-C Linguistic

    APIs
  98. • CFStringTransform • CFStringTokenizer • NSLinguisticTagger • NSDataDetector • LatentSemanticMapping

    Objective-C Linguistic APIs
  99. Search Engine

  100. None
  101. • Tokenize

  102. • Tokenize • Normalize

  103. • Tokenize • Normalize • Capitalize ($$$)

  104. Advanced Data Detection

  105. Natural Language Input Processing

  106. ???

  107. Thanks! @mattt github.com/mattt NSHipster.com