Slide 1

Slide 1 text

www.morconsulting.c Detecting the right Apples and Oranges – social media brand disambiguation using Python and scikit-learn Ian Ozsvald @IanOzsvald MorConsulting.com

Slide 2

Slide 2 text

[email protected] @IanOzsvald PyConUK September 2013 Goal • Word Sense Disambiguation − Apple, Orange − Homeland, Lost, Defiance − Elite, Valve − Cold, Stuffy  MIT licensed Github project: ianozsvald/social_media_brand_disambiguator

Slide 3

Slide 3 text

[email protected] @IanOzsvald PyConUK September 2013 About Ian Ozsvald • “Applying Parallel/NLP/ML in Industry” • MorConsulting.com • Teach: PyCon, EuroSciPy, EuroPython • Authoring “High Performance Python” • ShowMeDo.com • IanOzsvald.com • Several prior startups

Slide 4

Slide 4 text

[email protected] @IanOzsvald PyConUK September 2013 Why another disambiguator? • 400 million tweets per day • AlchemyAPI, OpenCalais, Spotlight – Not trained on social media – Cannot adapt to new brands (Johnny Coke) • Rule building often “by hand” (Radian6/BrandWatch) • Can we build an auto-updating, adaptable, automatic disambiguator?

Slide 5

Slide 5 text

[email protected] @IanOzsvald PyConUK September 2013 Example is-brand

Slide 6

Slide 6 text

[email protected] @IanOzsvald PyConUK September 2013 Example is-brand?

Slide 7

Slide 7 text

[email protected] @IanOzsvald PyConUK September 2013 Example not-brand?

Slide 8

Slide 8 text

[email protected] @IanOzsvald PyConUK September 2013 Example not-brand

Slide 9

Slide 9 text

[email protected] @IanOzsvald PyConUK September 2013 Scikit-learn (learn1.py) train_set = [u”The Daily Apple...”, …] target = np.array([1, ...]) vectorizer = CountVectorizer(ngram_range=(1, 1)) train_set_dense = vectorizer.fit_transform(train_set).toarray() vectorizer.get_feature_names() '00', '01gzw6l7h8', '2nite', '40gb', 'applenews', 'co', # no 't' 'jam', 'sauce', 'iphone', 'mac', … 'would', 'wouldn', … 'ya', 'yay', 'yaaaay', … # hashtags? http:// @users?

Slide 10

Slide 10 text

[email protected] @IanOzsvald PyConUK September 2013 train_set_dense (sparse matrix)

Slide 11

Slide 11 text

[email protected] @IanOzsvald PyConUK September 2013 Scikit-learn (learn1.py) clf = LogisticRegression() clfl = clf.fit(train_set_dense, target) clfl.score(train_set_dense, target) twt_vector = vectorizer.transform([u'i like my apple, eating it makes me happy']).todense() clfl.predict(twt_vector) [0] clfl.predict_proba(twt_vector)) [[ 0.94366966 0.05633034]] # Cross Validation # Feature Extraction to debug

Slide 12

Slide 12 text

[email protected] @IanOzsvald PyConUK September 2013 Results for “apple” • Gold Standard: 2014 in & out of class • 2/3 is-brand, 1/3 not-brand (684 tweets) • Test/train: balanced 584 tweets, CrValid. • Validation set: balanced 200 tweets • Reuters OpenCalais on validation set: – 92.5% Precision (2 wrong) – 25% Recall

Slide 13

Slide 13 text

[email protected] @IanOzsvald PyConUK September 2013 Results • Reuters OpenCalais: – 92.5% Precision (2 wrong) – 25% Recall • This tool: – 100% Precision – 51% Recall

Slide 14

Slide 14 text

[email protected] @IanOzsvald PyConUK September 2013 Status • Not generalised (needs more work!) • Github repo, data to follow • Progress: IanOzsvald.com • Ready for collaboration • Python 2.7 (Py3.3 compatible?) • Want: – Collaborations (thanks Sarwar Bhulyan) – Real use cases

Slide 15

Slide 15 text

[email protected] @IanOzsvald PyConUK September 2013 Future? • Build NLP meet in London? • Boot strap to larger data sets • CMU Tweet Parser (“Stanford-for-tweets”) • Features: Stems, WordNet, ConceptNet • Annotate.io future service?

Slide 16

Slide 16 text

[email protected] @IanOzsvald PyConUK September 2013 “High Performance Python” • Book is in the works... • Please join the mailing list via IanOzsvald.com

Slide 17

Slide 17 text

[email protected] @IanOzsvald PyConUK September 2013 Thank You • [email protected] • @IanOzsvald • MorConsulting.com • Annotate.io • GitHub/IanOzsvald