Slide 1

Slide 1 text

Natural Language Processing in Objective-C Mattt Thompson CocoaConf PDX 2012

Slide 2

Slide 2 text

There are two indicators that can tell you (with startling accuracy) how nice a language is to use:

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

• API Consistency

Slide 5

Slide 5 text

• API Consistency • Quality of String Implementation

Slide 6

Slide 6 text

Ruby

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

PHP

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Objective-C

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

NSString

Slide 13

Slide 13 text

Objective-C Linguistic APIs

Slide 14

Slide 14 text

• CFStringTransform Objective-C Linguistic APIs

Slide 15

Slide 15 text

• CFStringTransform • CFStringTokenizer Objective-C Linguistic APIs

Slide 16

Slide 16 text

• CFStringTransform • CFStringTokenizer • NSLinguisticTagger Objective-C Linguistic APIs

Slide 17

Slide 17 text

• CFStringTransform • CFStringTokenizer • NSLinguisticTagger • NSDataDetector Objective-C Linguistic APIs

Slide 18

Slide 18 text

• CFStringTransform • CFStringTokenizer • NSLinguisticTagger • NSDataDetector • LatentSemanticMapping Objective-C Linguistic APIs

Slide 19

Slide 19 text

CFStringTransform

Slide 20

Slide 20 text

CFStringTransform • Strip Accents and Diacritics • Name Unicode Characters • Transliterate Between Orthographies

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

@"I wîsh the Énġlišh långuãge hađ mørē iñteŕêßţing çharäčtèrş"

Slide 23

Slide 23 text

NSMutableString *string = [@"I wîsh the Énġlišh långuãge hađ mørē iñteŕêßţing çharäčtèrş" mutableCopy]; NSLog(@"Before: %@", string); CFStringTransform( (__bridge CFMutableStringRef)string, NULL, kCFStringTransformStripCombiningMarks, NO); NSLog(@"After: %@", string);

Slide 24

Slide 24 text

@"I wish the English language hađ møre intereßting characters"

Slide 25

Slide 25 text

• đ - d with stroke • ø - o with stroke • ß - eszet

Slide 26

Slide 26 text

CFStringTransform • Strip Accents and Diacritics • Name Unicode Characters • Transliterate Between Orthographies

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Emoji Dick

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

NSMutableString *string = [@"" mutableCopy]; NSLog(@"Emoji: %@", string); CFStringTransform( (__bridge CFMutableStringRef)string, NULL, kCFStringTransformToUnicodeName, NO); NSLog(@"Unicode Name: %@", string);

Slide 31

Slide 31 text

@"{PIG FACE}"

Slide 32

Slide 32 text

CFStringTransform • Strip Accents and Diacritics • Name Unicode Characters • Transliterate Between Orthographies

Slide 33

Slide 33 text

য়ࡇ ъթझఋੌ

Slide 34

Slide 34 text

NSMutableString *string = [@"য়ࡇ ъթ झఋੌ" mutableCopy]; NSLog(@"Before: %@", string); CFStringTransform( (__bridge CFMutableStringRef)string, NULL, kCFStringTransformToLatin, NO); NSLog(@"After: %@", string);

Slide 35

Slide 35 text

oppan gangnam seutail

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Transformation Input Output kCFStringTransformLatinArabic mrḥbạ !"#$% kCFStringTransformLatinCyrillic privet привет kCFStringTransformLatinGreek geiá sou γειά σου kCFStringTransformLatinHangul annyeonghaseyo উ֞ೞࣁਃ kCFStringTransformLatinHebrew şlwm םולש kCFStringTransformLatinHiragana hiragana ͻΒ͕ͳ kCFStringTransformLatinKatakana katakana ΧλΧφ kCFStringTransformLatinThai s̄wạs̄dī สวัสดี kCFStringTransformHiraganaKatakana ʹ΄Μ͝ χϗϯΰ kCFStringTransformMandarinLatin தจ zhōng wén

Slide 38

Slide 38 text

CFStringTokenizer

Slide 39

Slide 39 text

In English, words are space-delimited

Slide 40

Slide 40 text

In English, words are space-delimited ೔ຊޠͷݴ༿Ͱ͢΂͕ͯ Ұॹʹ͍Δ!!!!!!

Slide 41

Slide 41 text

[string componentsSeparatedByCharactersInSet: [NSCharacterSet whitespaceCharacterSet]]

Slide 42

Slide 42 text

[string componentsSeparatedByCharactersInSet: [NSCharacterSet whitespaceCharacterSet]] x

Slide 43

Slide 43 text

NSString *string = @"೔ຊޠͷݴ༿Ͱ͢΂͕ͯҰॹʹ͍Δ"; NSMutableArray *mutableTokens = [NSMutableArray array]; CFStringTokenizerRef tokenizer = CFStringTokenizerCreate(NULL, (__bridge CFStringRef)(string), CFRangeMake(0, [string length]), kCFStringTokenizerUnitWord, CFLocaleCopyCurrent()); CFStringTokenizerTokenType tokenType = kCFStringTokenizerTokenNone; while((tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)) != kCFStringTokenizerTokenNone) { CFRange tokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer); CFStringRef token = CFStringCreateWithSubstring(kCFAllocatorDefault, (__bridge CFStringRef)(string), tokenRange); [mutableTokens addObject:(__bridge NSString *)(token)]; } NSLog(@"Tokens: %@", mutableTokens);

Slide 44

Slide 44 text

೔ຊ, ޠ, ͷ, ݴ༿, Ͱ, ͢΂ͯ, ͕, Ұॹ, ʹ, ͍Δ ರЧ∽Ƒ࿽∋ƊżƜƉů ၂⇞ƎŧƮ

Slide 45

Slide 45 text

• kCFStringTokenizerUnitWord • kCFStringTokenizerUnitSentence • kCFStringTokenizerUnitParagraph • kCFStringTokenizerUnitLineBreak • kCFStringTokenizerUnitWordBoundary

Slide 46

Slide 46 text

NSLinguisticTagger

Slide 47

Slide 47 text

NSLinguisticTagger

Slide 48

Slide 48 text

NSLinguisticTagger • Tokenize

Slide 49

Slide 49 text

NSLinguisticTagger • Tokenize • Part of Speech

Slide 50

Slide 50 text

NSLinguisticTagger • Tokenize • Part of Speech • Word Stem

Slide 51

Slide 51 text

NSLinguisticTagger • Tokenize • Part of Speech • Word Stem • Named Entity Recognition

Slide 52

Slide 52 text

NSLinguisticTagger • Tokenize • Part of Speech • Word Stem • Named Entity Recognition • Language & Script Detection

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

How is the weather in Portland?

Slide 56

Slide 56 text

NSString *question = @"How is the weather in Portland?"; NSLinguisticTaggerOptions options = NSLinguisticTaggerOmitWhitespace | NSLinguisticTaggerOmitPunctuation | NSLinguisticTaggerJoinNames; NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] initWithTagSchemes: [NSLinguisticTagger availableTagSchemesForLanguage:@"en"] options:options]; tagger.string = question;

Slide 57

Slide 57 text

[tagger enumerateTagsInRange: NSMakeRange(0, [question length]) scheme:NSLinguisticTagSchemeNameTypeOrLexicalClass options:options usingBlock:^(NSString *tag, NSRange tokenRange, NSRange sentenceRange, BOOL *stop) { NSString *token = [question substringWithRange:tokenRange]; NSLog(@"%@: %@", token, tag); }];

Slide 58

Slide 58 text

How Adverb is Verb the Determiner weather Noun in Preposition Portland PlaceName

Slide 59

Slide 59 text

How Adverb is Verb the Determiner weather Noun in Preposition Portland PlaceName

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

• NSLinguisticTagSchemeTokenType) • NSLinguisticTagSchemeLexicalClass • NSLinguisticTagSchemeNameType

Slide 64

Slide 64 text

• NSLinguisticTagWord • NSLinguisticTagPunctuation • NSLinguisticTagWhitespace • NSLinguisticTagOther NSLinguisticTagSchemeTokenType

Slide 65

Slide 65 text

• NSLinguisticTagNoun • NSLinguisticTagVerb • NSLinguisticTagAdjective • NSLinguisticTagAdverb • NSLinguisticTagPronoun • NSLinguisticTagDeterminer NSLinguisticTagSchemeLexicalClass -- Snip 20 Other Parts of Speech --

Slide 66

Slide 66 text

• NSLinguisticTagPersonalName • NSLinguisticTagPlaceName • NSLinguisticTagOrganizationName NSLinguisticTagSchemeNameType

Slide 67

Slide 67 text

NSDataDetector

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

NSString *string = @"Speak at CocoaConf at 7900 82nd Avenue Portland, OR 97220 starting 4:00 on October 27, 2012"; NSDataDetector *detector = [NSDataDetector dataDetectorWithTypes:NSTextCheckingAllSyst emTypes error:nil]; [detector enumerateMatchesInString:string options:0 range:NSMakeRange(0, [string length]) usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop) { NSLog(@"Result: %@", [string substringWithRange:result.range]); }];

Slide 70

Slide 70 text

• 7900 82nd Avenue Portland, OR 97220 • 4:00 on October 27, 2012

Slide 71

Slide 71 text

NSRegularExpression Subclass

Slide 72

Slide 72 text

NSTextCheckingResult

Slide 73

Slide 73 text

• NSTextCheckingTypeOrthography • NSTextCheckingTypeSpelling • NSTextCheckingTypeGrammar • NSTextCheckingTypeQuote • NSTextCheckingTypeDash • NSTextCheckingTypeReplacement • NSTextCheckingTypeCorrection NSTextCheckingType

Slide 74

Slide 74 text

• NSTextCheckingTypeDate • NSTextCheckingTypeAddress • NSTextCheckingTypeLink • NSTextCheckingTypePhoneNumber • NSTextCheckingTypeTransitInformation NSTextCheckingType

Slide 75

Slide 75 text

OS X NSDataDetector iOS UITextView -dataDetectorTypes

Slide 76

Slide 76 text

Latent Semantic Mapping

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

Spam Legit ?

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

News Technology ? Sports Food Cats

Slide 86

Slide 86 text

No content

Slide 87

Slide 87 text

No content

Slide 88

Slide 88 text

Latent Semantic Mapping

Slide 89

Slide 89 text

No content

Slide 90

Slide 90 text

lsm(1)

Slide 91

Slide 91 text

No content

Slide 92

Slide 92 text

WWDC 2011 Session 136

Slide 93

Slide 93 text

Objective-C Linguistic APIs

Slide 94

Slide 94 text

• CFStringTransform Objective-C Linguistic APIs

Slide 95

Slide 95 text

• CFStringTransform • CFStringTokenizer Objective-C Linguistic APIs

Slide 96

Slide 96 text

• CFStringTransform • CFStringTokenizer • NSLinguisticTagger Objective-C Linguistic APIs

Slide 97

Slide 97 text

• CFStringTransform • CFStringTokenizer • NSLinguisticTagger • NSDataDetector Objective-C Linguistic APIs

Slide 98

Slide 98 text

• CFStringTransform • CFStringTokenizer • NSLinguisticTagger • NSDataDetector • LatentSemanticMapping Objective-C Linguistic APIs

Slide 99

Slide 99 text

Search Engine

Slide 100

Slide 100 text

No content

Slide 101

Slide 101 text

• Tokenize

Slide 102

Slide 102 text

• Tokenize • Normalize

Slide 103

Slide 103 text

• Tokenize • Normalize • Capitalize ($$$)

Slide 104

Slide 104 text

Advanced Data Detection

Slide 105

Slide 105 text

Natural Language Input Processing

Slide 106

Slide 106 text

???

Slide 107

Slide 107 text

Thanks! @mattt github.com/mattt NSHipster.com