Detecting the right "Apples" and "Oranges" - 1 hour talk on Python for Brand Disambiguation using scikit-learn at BrightonPython June 2013

www.morconsulting.c Detecting the right Apples and Oranges – social media
brand disambiguation using Python and scikit-learn Ian Ozsvald @IanOzsvald MorConsulting.com

[email protected] @IanOzsvald BrightonPython June 2013 Goal • Word Sense Disambiguation
− Apple, Orange − Homeland − Elite, Valve − 38 Degrees − Cold  1 month MIT licensed Github project: ianozsvald/social_media_brand_disambiguator

[email protected] @IanOzsvald BrightonPython June 2013 About Ian Ozsvald • “Applying
AI in Industry” • Teach: PyCon, EuroSciPy, EuroPython • MorConsulting.com • ShowMeDo.com • IanOzsvald.com • StartupChile for Computer Vision 2012

[email protected] @IanOzsvald BrightonPython June 2013 Why another disambiguator? • 400
million tweets per day • AlchemyAPI, OpenCalais, Spotlight – Not trained on social media – Cannot adapt to new brands (Johnny Coke) • Rule building often “by hand” (Radian6/BrandWatch) • Can we build an auto-updating, adaptable, automatic disambiguator?

[email protected] @IanOzsvald BrightonPython June 2013 Example is-brand

[email protected] @IanOzsvald BrightonPython June 2013 Example is-brand?

[email protected] @IanOzsvald BrightonPython June 2013 Example not-brand?

[email protected] @IanOzsvald BrightonPython June 2013 Example not-brand

[email protected] @IanOzsvald BrightonPython June 2013 Scikit-learn (learn1.py) train_set = [u”The
Daily Apple...”, …] target = np.array([1, ...]) vectorizer = CountVectorizer(ngram_range=(1, 1)) train_set_dense = vectorizer.fit_transform(train_set).toarray() vectorizer.get_feature_names() '00', '01gzw6l7h8', '2nite', '40gb', 'applenews', 'co', # no 't' 'jam', 'sauce', 'iphone', 'mac', … 'would', 'wouldn', … 'ya', 'yay', 'yaaaay', … # hashtags? http:// @users?

[email protected] @IanOzsvald BrightonPython June 2013 train_set_dense (sparse matrix)

[email protected] @IanOzsvald BrightonPython June 2013 Scikit-learn (learn1.py) clf = LogisticRegression()
clfl = clf.fit(train_set_dense, target) clfl.score(train_set_dense, target) twt_vector = vectorizer.transform([u'i like my apple, eating it makes me happy']).todense() clfl.predict(twt_vector) [0] clfl.predict_proba(twt_vector)) [[ 0.94366966 0.05633034]] # Cross Validation # Feature Extraction to debug

[email protected] @IanOzsvald BrightonPython June 2013 Scikit-learn (learn1.py) ch2 = SelectKBest(chi2,
k=30) X_train=ch2.fit_transform(trainVectorizerArray, target) ch2.get_support() # feature ids of 30 'best' features np.where(ch2.get_support()) # do a touch of math # ordered by highest significance: vectorizer.get_feature_names()[feature_ids…] Some extracted features: 'co', 'http' (300 in vs 118 out of class) 'cook', 'ceo' (81 in vs 10 out of class) … 'juice', 'pie' (2 in vs 80 out of class) 'ipad' (56 in vs 1 out of class)

[email protected] @IanOzsvald BrightonPython June 2013 Feature prevalence in training

[email protected] @IanOzsvald BrightonPython June 2013 Results for “apple” • Gold
Standard: 2014 in & out of class • 2/3 is-brand, 1/3 not-brand (684 tweets) • Test/train: balanced 584 tweets, CrValid. • Validation set: balanced 200 tweets • Reuters OpenCalais on validation set: – 92.5% Precision (2 wrong) – 25% Recall

[email protected] @IanOzsvald BrightonPython June 2013 Results • Reuters OpenCalais: –
92.5% Precision (2 wrong) – 25% Recall • This tool: – 100% Precision – 51% Recall

[email protected] @IanOzsvald BrightonPython June 2013 Status • Not generalised (needs
more work!) • Github repo, data to follow • Progress: IanOzsvald.com • Ready for collaboration • Python 2.7 (Py3.3 compatible?) • Want: – Collaborations – Real use cases

[email protected] @IanOzsvald BrightonPython June 2013 twitter-text-python • Easily extract tweet-specific
terms • github.com/ianozsvald/twitter-text-python result = p.parse("@ianozsvald, you now support #IvoWertzel's tweet parser! https://github.com/ianozsvald/") result.users # ['ianozsvald'] result.tags # ['IvoWertzel'] result.urls # ['https://github.com/iano...']

[email protected] @IanOzsvald BrightonPython June 2013 Future? • NLP/ML Pub Meet
(2 weeks – email me) • Boot strap to larger data sets • CMU Tweet Parser (“Stanford-for-tweets”) • Features: Stems, WordNet, ConceptNet

[email protected] @IanOzsvald BrightonPython June 2013 Future? • Model the user's
disambiguated history? • What does a user's friend group talk about? • Compression using character n-grams only? • Annotate.io future service?

[email protected] @IanOzsvald BrightonPython June 2013 Thank You • [email protected] •
@IanOzsvald • MorConsulting.com • Annotate.io • GitHub/IanOzsvald

Detecting the right "Apples" and "Oranges" - 1 ...

Detecting the right "Apples" and "Oranges" - 1 hour talk on Python for Brand Disambiguation using scikit-learn at BrightonPython June 2013

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

www.morconsulting.c Detecting the right Apples and Oranges – social media

[email protected] @IanOzsvald BrightonPython June 2013 Goal • Word Sense Disambiguation

[email protected] @IanOzsvald BrightonPython June 2013 About Ian Ozsvald • “Applying

[email protected] @IanOzsvald BrightonPython June 2013 Why another disambiguator? • 400

[email protected] @IanOzsvald BrightonPython June 2013 Example is-brand

[email protected] @IanOzsvald BrightonPython June 2013 Example is-brand?

[email protected] @IanOzsvald BrightonPython June 2013 Example not-brand?

[email protected] @IanOzsvald BrightonPython June 2013 Example not-brand

[email protected] @IanOzsvald BrightonPython June 2013 Scikit-learn (learn1.py) train_set = [u”The

[email protected] @IanOzsvald BrightonPython June 2013 train_set_dense (sparse matrix)

[email protected] @IanOzsvald BrightonPython June 2013 Scikit-learn (learn1.py) clf = LogisticRegression()

[email protected] @IanOzsvald BrightonPython June 2013 Scikit-learn (learn1.py) ch2 = SelectKBest(chi2,

[email protected] @IanOzsvald BrightonPython June 2013 Feature prevalence in training

[email protected] @IanOzsvald BrightonPython June 2013 Results for “apple” • Gold

[email protected] @IanOzsvald BrightonPython June 2013 Results • Reuters OpenCalais: –

[email protected] @IanOzsvald BrightonPython June 2013 Status • Not generalised (needs

[email protected] @IanOzsvald BrightonPython June 2013 twitter-text-python • Easily extract tweet-specific

[email protected] @IanOzsvald BrightonPython June 2013 Future? • NLP/ML Pub Meet

[email protected] @IanOzsvald BrightonPython June 2013 Future? • Model the user's

[email protected] @IanOzsvald BrightonPython June 2013 Thank You • [email protected] •