$30 off During Our Annual Pro Sale. View Details »

Detecting the right "Apples" and "Oranges" - 1 hour talk on Python for Brand Disambiguation using scikit-learn at BrightonPython June 2013

Detecting the right "Apples" and "Oranges" - 1 hour talk on Python for Brand Disambiguation using scikit-learn at BrightonPython June 2013

ianozsvald

June 12, 2013
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. www.morconsulting.c
    Detecting the right Apples and
    Oranges – social media brand
    disambiguation using Python and
    scikit-learn
    Ian Ozsvald @IanOzsvald MorConsulting.com

    View Slide

  2. [email protected] @IanOzsvald
    BrightonPython June 2013
    Goal
    • Word Sense Disambiguation
    − Apple, Orange
    − Homeland
    − Elite, Valve
    − 38 Degrees
    − Cold

    1 month MIT licensed Github project:
    ianozsvald/social_media_brand_disambiguator

    View Slide

  3. [email protected] @IanOzsvald
    BrightonPython June 2013
    About Ian Ozsvald
    • “Applying AI in Industry”
    • Teach: PyCon, EuroSciPy, EuroPython
    • MorConsulting.com
    • ShowMeDo.com
    • IanOzsvald.com
    • StartupChile for Computer Vision 2012

    View Slide

  4. [email protected] @IanOzsvald
    BrightonPython June 2013
    Why another disambiguator?
    • 400 million tweets per day
    • AlchemyAPI, OpenCalais, Spotlight
    – Not trained on social media
    – Cannot adapt to new brands (Johnny Coke)
    • Rule building often “by hand”
    (Radian6/BrandWatch)
    • Can we build an auto-updating, adaptable,
    automatic disambiguator?

    View Slide

  5. [email protected] @IanOzsvald
    BrightonPython June 2013
    Example is-brand

    View Slide

  6. [email protected] @IanOzsvald
    BrightonPython June 2013
    Example is-brand?

    View Slide

  7. [email protected] @IanOzsvald
    BrightonPython June 2013
    Example not-brand?

    View Slide

  8. [email protected] @IanOzsvald
    BrightonPython June 2013
    Example not-brand

    View Slide

  9. [email protected] @IanOzsvald
    BrightonPython June 2013
    Scikit-learn (learn1.py)
    train_set = [u”The Daily Apple...”, …]
    target = np.array([1, ...])
    vectorizer = CountVectorizer(ngram_range=(1, 1))
    train_set_dense =
    vectorizer.fit_transform(train_set).toarray()
    vectorizer.get_feature_names()
    '00', '01gzw6l7h8', '2nite', '40gb', 'applenews',
    'co', # no 't'
    'jam', 'sauce', 'iphone', 'mac', …
    'would', 'wouldn', … 'ya', 'yay', 'yaaaay', …
    # hashtags? http:// @users?

    View Slide

  10. [email protected] @IanOzsvald
    BrightonPython June 2013
    train_set_dense (sparse matrix)

    View Slide

  11. [email protected] @IanOzsvald
    BrightonPython June 2013
    Scikit-learn (learn1.py)
    clf = LogisticRegression()
    clfl = clf.fit(train_set_dense, target)
    clfl.score(train_set_dense, target)
    twt_vector = vectorizer.transform([u'i like my
    apple, eating it makes me happy']).todense()
    clfl.predict(twt_vector)
    [0]
    clfl.predict_proba(twt_vector))
    [[ 0.94366966 0.05633034]]
    # Cross Validation
    # Feature Extraction to debug

    View Slide

  12. [email protected] @IanOzsvald
    BrightonPython June 2013
    Scikit-learn (learn1.py)
    ch2 = SelectKBest(chi2, k=30)
    X_train=ch2.fit_transform(trainVectorizerArray,
    target)
    ch2.get_support()
    # feature ids of 30 'best' features
    np.where(ch2.get_support()) # do a touch of math
    # ordered by highest significance:
    vectorizer.get_feature_names()[feature_ids…]
    Some extracted features:
    'co', 'http' (300 in vs 118 out of class)
    'cook', 'ceo' (81 in vs 10 out of class)

    'juice', 'pie' (2 in vs 80 out of class)
    'ipad' (56 in vs 1 out of class)

    View Slide

  13. [email protected] @IanOzsvald
    BrightonPython June 2013
    Feature prevalence in training

    View Slide

  14. [email protected] @IanOzsvald
    BrightonPython June 2013
    Results for “apple”
    • Gold Standard: 2014 in & out of class
    • 2/3 is-brand, 1/3 not-brand (684 tweets)
    • Test/train: balanced 584 tweets, CrValid.
    • Validation set: balanced 200 tweets
    • Reuters OpenCalais on validation set:
    – 92.5% Precision (2 wrong)
    – 25% Recall

    View Slide

  15. [email protected] @IanOzsvald
    BrightonPython June 2013
    Results
    • Reuters OpenCalais:
    – 92.5% Precision (2 wrong)
    – 25% Recall
    • This tool:
    – 100% Precision
    – 51% Recall

    View Slide

  16. [email protected] @IanOzsvald
    BrightonPython June 2013
    Status
    • Not generalised (needs more work!)
    • Github repo, data to follow
    • Progress: IanOzsvald.com
    • Ready for collaboration
    • Python 2.7 (Py3.3 compatible?)
    • Want:
    – Collaborations
    – Real use cases

    View Slide

  17. [email protected] @IanOzsvald
    BrightonPython June 2013
    twitter-text-python
    • Easily extract tweet-specific terms
    • github.com/ianozsvald/twitter-text-python
    result = p.parse("@ianozsvald, you now support
    #IvoWertzel's tweet parser!
    https://github.com/ianozsvald/")
    result.users # ['ianozsvald']
    result.tags # ['IvoWertzel']
    result.urls # ['https://github.com/iano...']

    View Slide

  18. [email protected] @IanOzsvald
    BrightonPython June 2013
    Future?
    • NLP/ML Pub Meet (2 weeks – email me)
    • Boot strap to larger data sets
    • CMU Tweet Parser (“Stanford-for-tweets”)
    • Features: Stems, WordNet, ConceptNet

    View Slide

  19. [email protected] @IanOzsvald
    BrightonPython June 2013
    Future?
    • Model the user's disambiguated history?
    • What does a user's friend group talk
    about?
    • Compression using character n-grams
    only?
    • Annotate.io future service?

    View Slide

  20. [email protected] @IanOzsvald
    BrightonPython June 2013
    Thank You
    [email protected]
    • @IanOzsvald
    • MorConsulting.com
    • Annotate.io
    • GitHub/IanOzsvald

    View Slide