Классифицируем текст в iOS без CoreML: как и зачем?

Классифицируем текст в iOS без CoreML: как и зачем?

Talk by Vjacheslav Volodjko.

В роботі над SMS Filter нам доводиться розв’язувати ряд задач класифікації текстів, підходами до деяких із них я хотів би поділитись. Що ми робитимемо:
➡️ Поговоримо про задачі класифікації текстів та вбудовані в iOS SDK засоби для їх розв’язання: NLLanguageRecognizer, MLTextClassifier.
➡️ З’ясуємо деякі обмеження NLLanguageRecognizer та MLTextClassifier.
➡️ Сбробуємо обійти ці обмеження, побудувавши власний класифікатор тесту.
➡️ Розглянемо деякі техніки, що дозволять нам вбудувати цей класифікатор в AppExtension.
➡️ Оцінимо ефективність нашого рішення.

This talk was made for CocoaHeads Kyiv #15 which took place Jul 28, 2019. (https://cocoaheads.org.ua/cocoaheadskyiv/15)

Video: https://youtu.be/LKS0Ewm1mMQ

Db84cf61fdada06b63f43f310b68b462?s=128

CocoaHeads Ukraine

July 28, 2019
Tweet

Transcript

  1. Viacheslav Volodko Classifying a text to iOS without CoreML: how

    and why?
  2. • Filters SMS spam • Freemium model • ML-based checks

    on 
 Server-side • 4 localizations: • Ukrainian • English • German • Russian SMS Filter
  3. Why Language Detection? 3 1. Its preliminary step in SMS

    Spam detection 2. We can’t claim we filter spam for languages we don’t know.
  4. NLLanguageRecognizer.dominantLanguage(for: "Hello, how are you doing?")?.name // English NLLanguageRecognizer.dominantLanguage(for: "Привіт,

    як твої справи")?.name // Українська NLLanguageRecognizer.dominantLanguage(for: "Привет, как твои дела?")?.name // Русский NLLanguageRecognizer.dominantLanguage(for: "Hallo, wie geht es dir?")?.name // Deutsch NLLanguageRecognizer 4 let realWorldSMS = """ VITAEMO Kompiuternum vidbirom na nomer,vipav pryz:AUTO-MAZDA SX-5 Detali: +38(095)857-58-64 abo na saiti: www.mir-europay.com.ua """ NLLanguageRecognizer.dominantLanguage(for: realWorldSMS)?.name // Hrvatskia
  5. Why not NSStringTransform? 5 NLLanguageRecognizer.dominantLanguage(for: detransliteratedString)?.name // Русский let detransliteratedString

    = realWorldSMS.applyingTransform( StringTransform.latinToCyrillic, reverse: false) // ВИТАЕМО Компиутернум видбиром на номер,випав // прыз:АУТО-МАЗДА СКС-5 // Детали: // +38(095)857-58-64 // або на саити: // ууу.мир-еуропаы.цом.уа
  6. Why not Detransliteration? Other 10% Bulgarian 4% English 11% None

    19% Russian 21% Ukrainian 35% let transliteratedUk = realWorldSMS.transliterate(to: "uk") // Вітаємо... let transliteratedRu = realWorldSMS.transliterate(to: "ru") // Витаемо... let predictionUk = nlLanguageRecognizer.languageHypothese(for: transliteratedUk) let predictionRu = nlLanguageRecognizer.languageHypothese(for: transliteratedRu) let predictionEn = nlLanguageRecognizer.languageHypothese(for: realWorldSMS) return [predictionEn, predictionUk, predictionRu] .sorted(by: \Prediction.probability) .last
  7. Why not Detransliteration? 7 let ukrainianTranslitText = "Privit, jak tvoji

    spravy?" let detransliteredUkrText = ukrainianTranslitText .applyingTransform(StringTransform.latinToCyrillic, reverse: false) ?? "" // Привит, йак твойи справы? 
 let englishText = "Hello, how are you doing?" let detransliteredEngText = englishText .applyingTransform(StringTransform.latinToCyrillic, reverse: false) ?? "" // Хелло, хоу аре ыоу доинг?
  8. So what now? 8 Language detection = Text classification

  9. Let’s use Core ML + Create ML 9 A. Text

    classification models included: • maximum entropy model • conditional random field B. It’s ready made solution
  10. Core ML + Create ML 10 Create ML

  11. Prepare dataset 11 func testPreprocessText() { // GIVEN let text

    = """ Вітаємо, dear@friend.com! Ми заборгували вам 5.00 гривень, і хотіли б повернути їх до 21.03.2019. Зателефонуйте нам на +38 (012) 345-67-89 або відвідайте example.com, щоб дізнатись деталі! """ // WHEN let preprocessedText = testedPreprocessor.preprocessedText(for: text) // THEN XCTAssertEqual(preprocessedText, "Вітаємо Ми заборгували вам гривень і хотіли б повернути їх " + "Зателефонуйте нам на або відвідайте щоб дізнатись деталі") } Вітаємо, dear@friend.com! Ми заборгували вам 5.00 гривень, і хотіли б повернути їх до 21.03.2019. Зателефонуйте нам на +38 (012) 345-67-89 або відвідайте example.com, щоб дізнатись деталі!
  12. Training Core ML model 12 public struct DatasetItem { let

    text: String let label: String } public protocol Dataset { var items: [DatasetItem] } public static func trainCoreMLClassifier(with preprocessor: Preprocessor, on dataset: Dataset) throws -> MLTextClassifier { let data: [String: MLDataValueConvertible] = [ "text": dataset.items.map { preprocessor.preprocessedText(for: $0.text) }, "label": dataset.items.map { $0.label }, ] let trainingDataTable = try MLDataTable(dictionary: data) let mlClassifier = try MLTextClassifier(trainingData: trainingDataTable, textColumn: "text", labelColumn: "label") return mlClassifier }
  13. Using CoreML Model 13 public func predictedLabel(for string: String) ->

    String? { let input = try? MLDictionaryFeatureProvider(dictionary: ["text": string]) let prediction = try? mlModel.prediction(from: input) return prediction?.featureValue(for: "label")?.stringValue } let language = predictedLabel(for: "Hello, how are you?") // English
  14. Evaluating CoreML Model 14 Dataset Train data 80% Test data

    20%
  15. Evaluating CoreML Model 15 func testAccuracy() { // GIVEN let

    preprocessor = TrivialPreprocessor() let (trainDataset, testDataset) = self.testDatasets.languagesDataset.splitTestDataset(startPersentage: 0.8, endPersentage: 1.0) let classifier = CoreMLClassifier.train(with: preprocessor, on: trainDataset) // WHEN let testResults = classifier.test(on: testDataset) // THEN XCTAssertGreaterThan(testResults.accuracy, 1.0) // failed: ("0.9463667820069204") is not greater than ("1.0") - }
  16. Problem 16 Dataset Train data 80% Test data 20% Ukrainian

    English Russian German
  17. Cross Validation 17 Step 1 Ukrainian English Russian German 0

    30 60 90 120 Train Data Test Data Train Data
  18. Cross Validation 18 Step 2 Ukrainian English Russian German 0

    30 60 90 120 Train Data Test Data Train Data
  19. Cross Validation 19 Step 3 Ukrainian English Russian German 0

    30 60 90 120 Train Data Test Data Train Data
  20. Cross Validation 20 Step 3 Ukrainian English Russian German 0

    30 60 90 120 Train Data Test Data Train Data
  21. Cross Validation 21 Step 3 Ukrainian English Russian German 0

    30 60 90 120 Train Data Test Data Train Data
  22. Cross Validation 22 Step 3 Ukrainian English Russian German 0

    30 60 90 120 Train Data Test Data Train Data
  23. Cross Validation 23 func testCrossvalidateAdvancedPreprocessor() { // GIVEN let dataset

    = testDatasets.languagesDataset // WHEN let results = CoreMLClassifier.crossValidate(on: dataset, with: AdvancedPreprocessor()) // THEN XCTAssertGreaterThan(results.accuracy, 1.0) // failed: ("0.9661251296232285") is not greater than ("1.0") }
  24. CoreML vs NLLanguageRecognizer 24 func testCrossvalidateAdvancedPreprocessor() { // GIVEN let

    dataset = testDatasets.languagesDataset // WHEN let results = CoreMLClassifier.crossValidate(on: dataset, with: AdvancedPreprocessor()) // THEN XCTAssertGreaterThan(results.accuracy, 1.0) // failed: ("0.9661251296232285") is not greater than ("1.0") } func testAccuracy() { // GIVEN let testDataset = testDatasets.languagesDataset // WHEN let results = nlLanguageRecognizerClassifier.test(on: testDataset) // THEN XCTAssertGreaterThan(results.accuracy, 1.0) // failed: ("0.8022435526772291") is not greater than ("1.0") } NL Language Recognizer 80.2% Core ML 96.6%
  25. Happy end

  26. What could go wrong? 26

  27. RAM Problem 27 • CoreML model file size: ~500 KB

    • Loading - breaks 6 mb RAM limit
  28. Memory-Mapped File 28 Virtual Memory Physical Memory RAM Physical Memory:

    Memory mapped files
  29. Core ML + Memory-Mapped file 29

  30. Building our own classifier 30 • Max Entropy • Conditional

    random field • Naive Bayes • Decision Tree • Many others
  31. P(A|B) = P(B|A)P(A) P(B) Naive Bayes classifier 31 • Based

    on Bayes’ Theorem: Thomas Bayes
 1701-1761
  32. C = {c1, c2, ..., cr } F = {f1,

    f2, ..., fq } D = {d1, d2, ..., dm } Naive Bayes classifier 32 Text samples: Features (words): Classes (languages):
  33. = argmax c∈C P(d|c)P(c) = argmax c∈C (ln P(d|c)P(c)) Cmax

    = argmax c∈C P(c|d) = argmax c∈C P(d|c)P(c) P(d) = Naive Bayes classifier 33 • Based on Bayes’ Theorem:
  34. • P(fi ∩ fj) = P(fi |c)P(fj |c) Naive Bayes

    classifier 34 Assumptions: • no word depends on other words: order of words does not matter
  35. Naive Bayes classifier 35 Cmax = argmax c∈C ln P(d|c)P(c)

    = argmax c∈C ln P(f1, f2, ..., fn |c)P(c) = argmax c∈C ln P(c) n i=1 P(fi |c) = argmax c∈C (ln P(c) + n i=1 ln P(fi |c))
  36. P(fi |cj) = count(fi, cj) n k=1 count(fk, cj) Naive

    Bayes classifier 36 Laplace smoothing: P(fi |cj) = count(fi, cj) + z n k=1 count(fk, cj) + nz
  37. Naive Bayes classifier 37 Building model: typealias Model = [String:

    [String: Int]] var model: Model = [ "uk": [ "Вітаю": 1, "вас": 2, ... ], "en": [ "Hello": 1, "dear": 2, ... ] ... ]
  38. Naive Bayes classifier 38 Building model: for label in labels

    { for text in trainTextsForLabel[label] { let words = preprocessor.preprocess(text: text) for word in words { model[label][word] += 1 } } }
  39. Naive Bayes classifier 39 Predicting label of text: A. Preprocess

    text
 B. Split onto words
 C. Calculate probability of each word in label "Зателефонуйте нам на +38 (012) 345-67-89" "Зателефонуйте нам на" ["Зателефонуйте", "нам", "на"] ["uk": ["Зателефонуйте": 0.84, "нам": 0.1, "на": 0.1], "ru": ["Зателефонуйте": 0.0, "нам": 0.1, "на": 0.1], … ]
  40. Naive Bayes classifier 40 Predicting label of text: D. Calculate

    probability of label:
 
 
 
 E. Return label with max probability: [ "uk": -180.3, "ru": -234.5, "en": -2004.3, ... ] "uk"
  41. Naive Bayes classifier 41 Cross Validation: func testCrossvalidate() { //

    GIVEN let dataset = self.testDatasets.testDataset // WHEN let results = NaiveBayesClassifier.crossValidate(on: dataset, with: TrivialPreprocessor()) // THEN XCTAssertGreaterThan(results.accuracy, 1.0) // 0.9782382220164371 } NL Language Recognizer 80.2% Core ML 96.6% Naive Bayes 97.8%
  42. Objective-C 
 Wrapper Framework Naive Bayes + FlatBuffers 42 Schema

    File FlatBuffers schema compiler C++ File
  43. var model: Model = [ "uk": [ "Вітаю": 1, "вас":

    2, ... ], "en": [ "Hello": 1, "dear": 2, ... ] ... ] namespace flatcollections; table StringIntDictionary { entries: [StringIntDictionaryEntry]; } table StringIntDictionaryEntry { key: string (key); value: int64; } root_type StringIntDictionary; Naive Bayes + FlatBuffers 43 schema.fbs:
  44. FlatBuffers: Create Dictionary 44 #import "schema_generated.h" ... @property (nonatomic, copy)

    NSDictionary<NSString *, NSNumber *> *dictionary; ... - (NSData *)serialize { } // 1. Alloc 10MB buffer on stack FlatBufferBuilder builder(1024 * 1024 * 10); // 2. Iterate NSDictionary keys and values, converting them into // flatcollections::StringIntDictionaryEntry structs std::vector<Offset<StringIntDictionaryEntry>> entries; for (NSString *key in self.dictionary.allKeys) { int64_t value = (int64_t)[self.dictionary objectForKey:key].integerValue; auto entry = CreateStringIntDictionaryEntryDirect(builder, key.UTF8String, value); entries.push_back(entry); } // 3. Create flatcollections::StringIntDictionary auto vector = builder.CreateVectorOfSortedTables(&entries); auto dictionary = CreateStringIntDictionary(builder, vector); // 4. Return flatbuffer as NSData builder.Finish(dictionary); NSData *data = [NSData dataWithBytes:builder.GetBufferPointer() length:builder.GetSize()]; return data;
  45. #import "MMStringIntDictionary.h" #import "schema_generated.h" using namespace flatcollections; @interface MMStringIntDictionary ()

    @property (nonatomic, unsafe_unretained) const StringIntDictionary *dict; @end @implementation MMStringIntDictionary - (instancetype)initWithFileURL:(NSURL *)fileURL error:(NSError *__autoreleasing *)error { NSData *data = [NSData dataWithContentsOfURL:fileURL options:NSDataReadingMappedAlways error:error]; if (nil == data) { return nil; } return [self initWithData:data]; } FlatBuffers: using Dictionary
  46. Naive Bayes + FlatBuffers 46 var model: Model = [

    "uk": [ "Вітаю": 1, "вас": 2, ... ], "en": [ "Hello": 1, "dear": 2, ... ] ... ] typealias Model = 
 [String: MMStringIntDictionary] typealias Model = [String: [String: Int]]
  47. Results 47 Accuracy Fits in 
 6Mb RAM Overall NL

    Language Recognizer ❌ 80,2% ✅ Core ML ✅ 96,6% ❌ Naive Bayes + 
 FlatBuffers ✅ 97,8% ✅
  48. Core ML 48 Pros: • Dramatically simple • Reliable •

    Fast Cons: • No flexibility • Limited ML tasks/ algorithms
  49. Machine learning 49 • Not a rocket science in 2019

    • Great competitive advantage • Must-have skill for SW engineer in future
  50. One last thing 50 Don’t be afraid to build your

    own bicycles
  51. Thanks Viacheslav Volodko killobatt@gmail.com t.me/killobatt Attributions: 
 Create ML Docs:


    developer.apple.com/documentation 
 Naive Bayes Classifier: 
 habr.com/ru/post/184574/ 
 FlatBuffers
 google.github.io/flatbuffers/ flatbuffers_guide_tutorial.html Code samples: github.com/killobatt/TextClassification