Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Классифицируем текст в iOS без CoreML: как и зачем?

Классифицируем текст в iOS без CoreML: как и зачем?

Talk by Vjacheslav Volodjko.

В роботі над SMS Filter нам доводиться розв’язувати ряд задач класифікації текстів, підходами до деяких із них я хотів би поділитись. Що ми робитимемо:
➡️ Поговоримо про задачі класифікації текстів та вбудовані в iOS SDK засоби для їх розв’язання: NLLanguageRecognizer, MLTextClassifier.
➡️ З’ясуємо деякі обмеження NLLanguageRecognizer та MLTextClassifier.
➡️ Сбробуємо обійти ці обмеження, побудувавши власний класифікатор тесту.
➡️ Розглянемо деякі техніки, що дозволять нам вбудувати цей класифікатор в AppExtension.
➡️ Оцінимо ефективність нашого рішення.

This talk was made for CocoaHeads Kyiv #15 which took place Jul 28, 2019. (https://cocoaheads.org.ua/cocoaheadskyiv/15)

Video: https://youtu.be/LKS0Ewm1mMQ

CocoaHeads Ukraine

July 28, 2019
Tweet

More Decks by CocoaHeads Ukraine

Other Decks in Programming

Transcript

  1. Viacheslav Volodko
    Classifying a text to iOS without CoreML:
    how and why?

    View Slide

  2. • Filters SMS spam
    • Freemium model
    • ML-based checks on 

    Server-side
    • 4 localizations:
    • Ukrainian
    • English
    • German
    • Russian
    SMS Filter

    View Slide

  3. Why Language Detection? 3
    1. Its preliminary step in SMS Spam detection
    2. We can’t claim we filter spam for languages we don’t know.

    View Slide

  4. NLLanguageRecognizer.dominantLanguage(for: "Hello, how are you doing?")?.name
    // English
    NLLanguageRecognizer.dominantLanguage(for: "Привіт, як твої справи")?.name
    // Українська
    NLLanguageRecognizer.dominantLanguage(for: "Привет, как твои дела?")?.name
    // Русский
    NLLanguageRecognizer.dominantLanguage(for: "Hallo, wie geht es dir?")?.name
    // Deutsch
    NLLanguageRecognizer 4
    let realWorldSMS =
    """
    VITAEMO Kompiuternum vidbirom na nomer,vipav
    pryz:AUTO-MAZDA SX-5
    Detali:
    +38(095)857-58-64
    abo na saiti:
    www.mir-europay.com.ua
    """
    NLLanguageRecognizer.dominantLanguage(for: realWorldSMS)?.name
    // Hrvatskia

    View Slide

  5. Why not NSStringTransform? 5
    NLLanguageRecognizer.dominantLanguage(for: detransliteratedString)?.name
    // Русский
    let detransliteratedString = realWorldSMS.applyingTransform(
    StringTransform.latinToCyrillic,
    reverse: false)
    // ВИТАЕМО Компиутернум видбиром на номер,випав
    // прыз:АУТО-МАЗДА СКС-5
    // Детали:
    // +38(095)857-58-64
    // або на саити:
    // ууу.мир-еуропаы.цом.уа

    View Slide

  6. Why not Detransliteration?
    Other
    10%
    Bulgarian
    4%
    English
    11%
    None
    19%
    Russian
    21%
    Ukrainian
    35%
    let transliteratedUk = realWorldSMS.transliterate(to: "uk") // Вітаємо...
    let transliteratedRu = realWorldSMS.transliterate(to: "ru") // Витаемо...
    let predictionUk = nlLanguageRecognizer.languageHypothese(for: transliteratedUk)
    let predictionRu = nlLanguageRecognizer.languageHypothese(for: transliteratedRu)
    let predictionEn = nlLanguageRecognizer.languageHypothese(for: realWorldSMS)
    return [predictionEn, predictionUk, predictionRu]
    .sorted(by: \Prediction.probability)
    .last

    View Slide

  7. Why not Detransliteration? 7
    let ukrainianTranslitText = "Privit, jak tvoji spravy?"
    let detransliteredUkrText = ukrainianTranslitText
    .applyingTransform(StringTransform.latinToCyrillic, reverse: false) ?? ""
    // Привит, йак твойи справы?

    let englishText = "Hello, how are you doing?"
    let detransliteredEngText = englishText
    .applyingTransform(StringTransform.latinToCyrillic, reverse: false) ?? ""
    // Хелло, хоу аре ыоу доинг?

    View Slide

  8. So what now? 8
    Language detection = Text classification

    View Slide

  9. Let’s use Core ML + Create ML 9
    A. Text classification models included:
    • maximum entropy model
    • conditional random field
    B. It’s ready made solution

    View Slide

  10. Core ML + Create ML 10
    Create ML

    View Slide

  11. Prepare dataset 11
    func testPreprocessText() {
    // GIVEN
    let text = """
    Вітаємо, [email protected]!
    Ми заборгували вам 5.00 гривень,
    і хотіли б повернути їх до 21.03.2019.
    Зателефонуйте нам на +38 (012) 345-67-89
    або відвідайте example.com, щоб дізнатись деталі!
    """
    // WHEN
    let preprocessedText = testedPreprocessor.preprocessedText(for: text)
    // THEN
    XCTAssertEqual(preprocessedText,
    "Вітаємо Ми заборгували вам гривень і хотіли б повернути їх " +
    "Зателефонуйте нам на або відвідайте щоб дізнатись деталі")
    }
    Вітаємо, [email protected]!
    Ми заборгували вам 5.00 гривень,
    і хотіли б повернути їх до 21.03.2019.
    Зателефонуйте нам на +38 (012) 345-67-89
    або відвідайте example.com, щоб дізнатись деталі!

    View Slide

  12. Training Core ML model 12
    public struct DatasetItem {
    let text: String
    let label: String
    }
    public protocol Dataset {
    var items: [DatasetItem]
    }
    public static func trainCoreMLClassifier(with preprocessor: Preprocessor,
    on dataset: Dataset) throws -> MLTextClassifier {
    let data: [String: MLDataValueConvertible] = [
    "text": dataset.items.map { preprocessor.preprocessedText(for: $0.text) },
    "label": dataset.items.map { $0.label },
    ]
    let trainingDataTable = try MLDataTable(dictionary: data)
    let mlClassifier = try MLTextClassifier(trainingData: trainingDataTable,
    textColumn: "text",
    labelColumn: "label")
    return mlClassifier
    }

    View Slide

  13. Using CoreML Model 13
    public func predictedLabel(for string: String) -> String? {
    let input =
    try? MLDictionaryFeatureProvider(dictionary: ["text": string])
    let prediction = try? mlModel.prediction(from: input)
    return prediction?.featureValue(for: "label")?.stringValue
    }
    let language = predictedLabel(for: "Hello, how are you?") // English

    View Slide

  14. Evaluating CoreML Model 14
    Dataset
    Train data
    80%
    Test data
    20%

    View Slide

  15. Evaluating CoreML Model 15
    func testAccuracy() {
    // GIVEN
    let preprocessor = TrivialPreprocessor()
    let (trainDataset, testDataset) =
    self.testDatasets.languagesDataset.splitTestDataset(startPersentage: 0.8,
    endPersentage: 1.0)
    let classifier = CoreMLClassifier.train(with: preprocessor, on: trainDataset)
    // WHEN
    let testResults = classifier.test(on: testDataset)
    // THEN
    XCTAssertGreaterThan(testResults.accuracy, 1.0)
    // failed: ("0.9463667820069204") is not greater than ("1.0") -
    }

    View Slide

  16. Problem 16
    Dataset
    Train data
    80%
    Test data
    20%
    Ukrainian English Russian German

    View Slide

  17. Cross Validation 17
    Step 1
    Ukrainian
    English
    Russian
    German
    0 30 60 90 120
    Train Data Test Data Train Data

    View Slide

  18. Cross Validation 18
    Step 2
    Ukrainian
    English
    Russian
    German
    0 30 60 90 120
    Train Data Test Data Train Data

    View Slide

  19. Cross Validation 19
    Step 3
    Ukrainian
    English
    Russian
    German
    0 30 60 90 120
    Train Data Test Data Train Data

    View Slide

  20. Cross Validation 20
    Step 3
    Ukrainian
    English
    Russian
    German
    0 30 60 90 120
    Train Data Test Data Train Data

    View Slide

  21. Cross Validation 21
    Step 3
    Ukrainian
    English
    Russian
    German
    0 30 60 90 120
    Train Data Test Data Train Data

    View Slide

  22. Cross Validation 22
    Step 3
    Ukrainian
    English
    Russian
    German
    0 30 60 90 120
    Train Data Test Data Train Data

    View Slide

  23. Cross Validation 23
    func testCrossvalidateAdvancedPreprocessor() {
    // GIVEN
    let dataset = testDatasets.languagesDataset
    // WHEN
    let results =
    CoreMLClassifier.crossValidate(on: dataset,
    with: AdvancedPreprocessor())
    // THEN
    XCTAssertGreaterThan(results.accuracy, 1.0)
    // failed: ("0.9661251296232285") is not greater than ("1.0")
    }

    View Slide

  24. CoreML vs NLLanguageRecognizer 24
    func testCrossvalidateAdvancedPreprocessor() {
    // GIVEN
    let dataset = testDatasets.languagesDataset
    // WHEN
    let results =
    CoreMLClassifier.crossValidate(on: dataset,
    with: AdvancedPreprocessor())
    // THEN
    XCTAssertGreaterThan(results.accuracy, 1.0)
    // failed: ("0.9661251296232285") is not greater than ("1.0")
    }
    func testAccuracy() {
    // GIVEN
    let testDataset = testDatasets.languagesDataset
    // WHEN
    let results = nlLanguageRecognizerClassifier.test(on: testDataset)
    // THEN
    XCTAssertGreaterThan(results.accuracy, 1.0)
    // failed: ("0.8022435526772291") is not greater than ("1.0")
    }
    NL Language Recognizer 80.2%
    Core ML 96.6%

    View Slide

  25. Happy end

    View Slide

  26. What could go wrong?
    26

    View Slide

  27. RAM Problem 27
    • CoreML model file size: ~500 KB
    • Loading - breaks 6 mb RAM limit

    View Slide

  28. Memory-Mapped File 28
    Virtual Memory
    Physical Memory RAM
    Physical Memory: Memory mapped files

    View Slide

  29. Core ML + Memory-Mapped file 29

    View Slide

  30. Building our own classifier 30
    • Max Entropy
    • Conditional random field
    • Naive Bayes
    • Decision Tree
    • Many others

    View Slide

  31. P(A|B) =
    P(B|A)P(A)
    P(B)
    Naive Bayes classifier 31
    • Based on Bayes’ Theorem:
    Thomas Bayes

    1701-1761

    View Slide

  32. C = {c1, c2, ..., cr
    }
    F = {f1, f2, ..., fq
    }
    D = {d1, d2, ..., dm
    }
    Naive Bayes classifier 32
    Text samples:
    Features (words):
    Classes (languages):

    View Slide

  33. = argmax
    c∈C
    P(d|c)P(c) = argmax
    c∈C
    (ln P(d|c)P(c))
    Cmax = argmax
    c∈C
    P(c|d) = argmax
    c∈C
    P(d|c)P(c)
    P(d)
    =
    Naive Bayes classifier 33
    • Based on Bayes’ Theorem:

    View Slide


  34. P(fi
    ∩ fj) = P(fi
    |c)P(fj
    |c)
    Naive Bayes classifier 34
    Assumptions:
    • no word depends on other words:
    order of words does not matter

    View Slide

  35. Naive Bayes classifier 35
    Cmax = argmax
    c∈C
    ln P(d|c)P(c)
    = argmax
    c∈C
    ln P(f1, f2, ..., fn
    |c)P(c)
    = argmax
    c∈C
    ln P(c)
    n
    i=1
    P(fi
    |c)
    = argmax
    c∈C
    (ln P(c) +
    n
    i=1
    ln P(fi
    |c))

    View Slide

  36. P(fi
    |cj) =
    count(fi, cj)
    n
    k=1
    count(fk, cj)
    Naive Bayes classifier 36
    Laplace smoothing:
    P(fi
    |cj) =
    count(fi, cj) + z
    n
    k=1
    count(fk, cj) + nz

    View Slide

  37. Naive Bayes classifier 37
    Building model:
    typealias Model = [String: [String: Int]]
    var model: Model = [
    "uk": [
    "Вітаю": 1,
    "вас": 2,
    ...
    ],
    "en": [
    "Hello": 1,
    "dear": 2,
    ...
    ]
    ...
    ]

    View Slide

  38. Naive Bayes classifier 38
    Building model:
    for label in labels {
    for text in trainTextsForLabel[label] {
    let words = preprocessor.preprocess(text: text)
    for word in words {
    model[label][word] += 1
    }
    }
    }

    View Slide

  39. Naive Bayes classifier 39
    Predicting label of text:
    A. Preprocess text

    B. Split onto words

    C. Calculate probability of each word in label
    "Зателефонуйте нам на +38 (012) 345-67-89" "Зателефонуйте нам на"
    ["Зателефонуйте", "нам", "на"]
    ["uk": ["Зателефонуйте": 0.84,
    "нам": 0.1,
    "на": 0.1],
    "ru": ["Зателефонуйте": 0.0,
    "нам": 0.1,
    "на": 0.1],

    ]

    View Slide

  40. Naive Bayes classifier 40
    Predicting label of text:
    D. Calculate probability of label:




    E. Return label with max probability:
    [
    "uk": -180.3,
    "ru": -234.5,
    "en": -2004.3,
    ...
    ]
    "uk"

    View Slide

  41. Naive Bayes classifier 41
    Cross Validation:
    func testCrossvalidate() {
    // GIVEN
    let dataset = self.testDatasets.testDataset
    // WHEN
    let results = NaiveBayesClassifier.crossValidate(on: dataset,
    with: TrivialPreprocessor())
    // THEN
    XCTAssertGreaterThan(results.accuracy, 1.0)
    // 0.9782382220164371
    }
    NL Language Recognizer 80.2%
    Core ML 96.6%
    Naive Bayes 97.8%

    View Slide

  42. Objective-C 

    Wrapper Framework
    Naive Bayes + FlatBuffers 42
    Schema
    File
    FlatBuffers
    schema
    compiler
    C++
    File

    View Slide

  43. var model: Model = [
    "uk": [
    "Вітаю": 1,
    "вас": 2,
    ...
    ],
    "en": [
    "Hello": 1,
    "dear": 2,
    ...
    ]
    ...
    ]
    namespace flatcollections;
    table StringIntDictionary {
    entries: [StringIntDictionaryEntry];
    }
    table StringIntDictionaryEntry {
    key: string (key);
    value: int64;
    }
    root_type StringIntDictionary;
    Naive Bayes + FlatBuffers 43
    schema.fbs:

    View Slide

  44. FlatBuffers: Create Dictionary 44
    #import "schema_generated.h"
    ...
    @property (nonatomic, copy) NSDictionary *dictionary;
    ...
    - (NSData *)serialize {
    }
    // 1. Alloc 10MB buffer on stack
    FlatBufferBuilder builder(1024 * 1024 * 10);
    // 2. Iterate NSDictionary keys and values, converting them into
    // flatcollections::StringIntDictionaryEntry structs
    std::vector> entries;
    for (NSString *key in self.dictionary.allKeys) {
    int64_t value = (int64_t)[self.dictionary objectForKey:key].integerValue;
    auto entry = CreateStringIntDictionaryEntryDirect(builder,
    key.UTF8String,
    value);
    entries.push_back(entry);
    }
    // 3. Create flatcollections::StringIntDictionary
    auto vector = builder.CreateVectorOfSortedTables(&entries);
    auto dictionary = CreateStringIntDictionary(builder, vector);
    // 4. Return flatbuffer as NSData
    builder.Finish(dictionary);
    NSData *data = [NSData dataWithBytes:builder.GetBufferPointer()
    length:builder.GetSize()];
    return data;

    View Slide

  45. #import "MMStringIntDictionary.h"
    #import "schema_generated.h"
    using namespace flatcollections;
    @interface MMStringIntDictionary ()
    @property (nonatomic, unsafe_unretained) const StringIntDictionary *dict;
    @end
    @implementation MMStringIntDictionary
    - (instancetype)initWithFileURL:(NSURL *)fileURL
    error:(NSError *__autoreleasing *)error {
    NSData *data = [NSData dataWithContentsOfURL:fileURL
    options:NSDataReadingMappedAlways error:error];
    if (nil == data) {
    return nil;
    }
    return [self initWithData:data];
    }
    FlatBuffers: using Dictionary

    View Slide

  46. Naive Bayes + FlatBuffers 46
    var model: Model = [
    "uk": [
    "Вітаю": 1,
    "вас": 2,
    ...
    ],
    "en": [
    "Hello": 1,
    "dear": 2,
    ...
    ]
    ...
    ]
    typealias Model = 

    [String: MMStringIntDictionary]
    typealias Model =
    [String: [String: Int]]

    View Slide

  47. Results 47
    Accuracy
    Fits in 

    6Mb RAM
    Overall
    NL Language
    Recognizer
    ❌ 80,2% ✅
    Core ML ✅ 96,6% ❌
    Naive Bayes + 

    FlatBuffers
    ✅ 97,8% ✅

    View Slide

  48. Core ML 48
    Pros:
    • Dramatically simple
    • Reliable
    • Fast
    Cons:
    • No flexibility
    • Limited ML tasks/
    algorithms

    View Slide

  49. Machine learning 49
    • Not a rocket science in 2019
    • Great competitive advantage
    • Must-have skill for SW engineer in future

    View Slide

  50. One last thing 50
    Don’t be afraid to build your own bicycles

    View Slide

  51. Thanks
    Viacheslav Volodko
    [email protected]
    t.me/killobatt
    Attributions: 

    Create ML Docs:

    developer.apple.com/documentation

    Naive Bayes Classifier: 

    habr.com/ru/post/184574/

    FlatBuffers

    google.github.io/flatbuffers/
    flatbuffers_guide_tutorial.html
    Code samples:
    github.com/killobatt/TextClassification

    View Slide