• 400 million tweets per day • AlchemyAPI, OpenCalais, Spotlight – Not trained on social media – Cannot adapt to new brands (Johnny Coke) • Rule building often “by hand” (Radian6/BrandWatch) • Can we build an auto-updating, adaptable, automatic disambiguator?
= SelectKBest(chi2, k=30) X_train=ch2.fit_transform(trainVectorizerArray, target) ch2.get_support() # feature ids of 30 'best' features np.where(ch2.get_support()) # do a touch of math # ordered by highest significance: vectorizer.get_feature_names()[feature_ids…] Some extracted features: 'co', 'http' (300 in vs 118 out of class) 'cook', 'ceo' (81 in vs 10 out of class) … 'juice', 'pie' (2 in vs 80 out of class) 'ipad' (56 in vs 1 out of class)
generalised (needs more work!) • Github repo, data to follow • Progress: IanOzsvald.com • Ready for collaboration • Python 2.7 (Py3.3 compatible?) • Want: – Collaborations – Real use cases
the user's disambiguated history? • What does a user's friend group talk about? • Compression using character n-grams only? • Annotate.io future service?