Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Detecting the right Apples and Oranges – social...

Detecting the right Apples and Oranges – social media brand disambiguation using Python and scikit-learn

by Ian Ozvald, Data Scientist at MorConsulting. Talk at Data Science London 12/06/13

Avatar for Data Science London

Data Science London

July 13, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. www.morconsulting.c Detecting the right Apples and Oranges – social media

    brand disambiguation using Python and scikit-learn Ian Ozsvald @IanOzsvald MorConsulting.com
  2. [email protected] @IanOzsvald Data Science London June 2013 Goal • Word

    Sense Disambiguation − Apple, Orange − Homeland − Elite, Valve − 38 Degrees − Cold  1 month MIT licensed Github project: ianozsvald/social_media_brand_disambiguator
  3. [email protected] @IanOzsvald Data Science London June 2013 About Ian Ozsvald

    • “Applying AI in Industry” • Teach: PyCon, EuroSciPy, EuroPython • MorConsulting.com • ShowMeDo.com • IanOzsvald.com • StartupChile for Computer Vision 2012
  4. [email protected] @IanOzsvald Data Science London June 2013 Why another disambiguator?

    • 400 million tweets per day • AlchemyAPI, OpenCalais, Spotlight – Not trained on social media – Cannot adapt to new brands (Johnny Coke) • Rule building often “by hand” (Radian6/BrandWatch) • Can we build an auto-updating, adaptable, automatic disambiguator?
  5. [email protected] @IanOzsvald Data Science London June 2013 Scikit-learn (learn1.py) train_set

    = [u”The Daily Apple...”, …] target = np.array([1, ...]) vectorizer = CountVectorizer(ngram_range=(1, 1)) train_set_dense = vectorizer.fit_transform(train_set).toarray() vectorizer.get_feature_names() '00', '01gzw6l7h8', '2nite', '40gb', 'applenews', 'co', # no 't' 'jam', 'sauce', 'iphone', 'mac', … 'would', 'wouldn', … 'ya', 'yay', 'yaaaay', … # hashtags? http:// @users?
  6. [email protected] @IanOzsvald Data Science London June 2013 Scikit-learn (learn1.py) clf

    = LogisticRegression() clfl = clf.fit(train_set_dense, target) clfl.score(train_set_dense, target) twt_vector = vectorizer.transform([u'i like my apple, eating it makes me happy']).todense() clfl.predict(twt_vector) [0] clfl.predict_proba(twt_vector)) [[ 0.94366966 0.05633034]] # Cross Validation # Feature Extraction to debug
  7. [email protected] @IanOzsvald Data Science London June 2013 Scikit-learn (learn1.py) ch2

    = SelectKBest(chi2, k=30) X_train=ch2.fit_transform(trainVectorizerArray, target) ch2.get_support() # feature ids of 30 'best' features np.where(ch2.get_support()) # do a touch of math # ordered by highest significance: vectorizer.get_feature_names()[feature_ids…] Some extracted features: 'co', 'http' (300 in vs 118 out of class) 'cook', 'ceo' (81 in vs 10 out of class) … 'juice', 'pie' (2 in vs 80 out of class) 'ipad' (56 in vs 1 out of class)
  8. [email protected] @IanOzsvald Data Science London June 2013 Results for “apple”

    • Gold Standard: 2014 in & out of class • 2/3 is-brand, 1/3 not-brand (684 tweets) • Test/train: balanced 584 tweets, CrValid. • Validation set: balanced 200 tweets • Reuters OpenCalais on validation set: – 92.5% Precision (2 wrong) – 25% Recall
  9. [email protected] @IanOzsvald Data Science London June 2013 Results • Reuters

    OpenCalais: – 92.5% Precision (2 wrong) – 25% Recall • This tool: – 100% Precision – 51% Recall
  10. [email protected] @IanOzsvald Data Science London June 2013 Status • Not

    generalised (needs more work!) • Github repo, data to follow • Progress: IanOzsvald.com • Ready for collaboration • Python 2.7 (Py3.3 compatible?) • Want: – Collaborations – Real use cases
  11. [email protected] @IanOzsvald Data Science London June 2013 twitter-text-python • Easily

    extract tweet-specific terms • github.com/ianozsvald/twitter-text-python result = p.parse("@ianozsvald, you now support #IvoWertzel's tweet parser! https://github.com/ianozsvald/") result.users # ['ianozsvald'] result.tags # ['IvoWertzel'] result.urls # ['https://github.com/iano...']
  12. [email protected] @IanOzsvald Data Science London June 2013 Future? • NLP/ML

    Pub Meet (2 weeks – email me) • Boot strap to larger data sets • CMU Tweet Parser (“Stanford-for-tweets”) • Features: Stems, WordNet, ConceptNet
  13. [email protected] @IanOzsvald Data Science London June 2013 Future? • Model

    the user's disambiguated history? • What does a user's friend group talk about? • Compression using character n-grams only? • Annotate.io future service?
  14. [email protected] @IanOzsvald Data Science London June 2013 Thank You •

    [email protected] • @IanOzsvald • MorConsulting.com • Annotate.io • GitHub/IanOzsvald