Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Text Analytics in Python and R with examples fr...

Avatar for Ben Healey Ben Healey
September 19, 2013

Text Analytics in Python and R with examples from Tobacco Control

Ben has been doing data sciencey work since 1999 for organisations in the banking, retailing, health and education industries. He is currently on contracts with Pharmac and Aspire2025 (a Tobacco Control research collaboration) where, happily, he gets to use his data-wrangling powers for good.

This presentation focuses on analysing text, with Tobacco Control as the context. Examples include monitoring mentions of NZ's smokefree goal by politicians and examining media uptake of BATNZ's Agree/Disagree PR campaign. It covers common obstacles during data extraction, cleaning and analysis, along with the key Python and R packages you can use to help clear them.

Avatar for Ben Healey

Ben Healey

September 19, 2013
Tweet

Other Decks in Technology

Transcript

  1. http://scrapy.org Spiders  Items  Pipelines - readLines, XML /

    Rcurl / scrapeR packages - tm package (factiva plugin), twitteR - Beautiful Soup - Pandas (eg, financial data) http://blog.siliconstraits.vn/building-web-crawler-scrapy/
  2. • Translating text to consistent form – Scrapy returns unicode

    strings – Māori  Maori • SWAPSET = [[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]] • translation_table = dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET]) • cleaned_content = html_content.translate(translation_table) – Or… • test=u’Māori’ (you already have unicode) • Unidecode(test) (returns ‘Maori’)
  3. • Dealing with non-Unicode – http://nedbatchelder.com/text/unipain.html – Some scraped html

    will be in latin1 (mismatch UTF8) – Have your datastore default to UTF-8 – Learn to love whack-a-mole • Dealing with too many spaces: – newstring = ' '.join(mystring.split()) – Or… use re • Don’t forget the metadata! – Define a common data structure early if you have multiple sources
  4. Text Standardisation • Stopwords – "a, about, above, across, ...

    yourself, yourselves, you've, z” • Stemmers – "some sample stemmed words"  "some sampl stem word“ • Tokenisers (eg, for bigrams) – BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) – tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) – ‘and said’, ‘and security’ Natural Language Toolkit tm package
  5. Text Standardisation libs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels") …

    cleanCorpus = function(corpus) { corpus.tmp = tm_map(corpus, tolower) # ??? Not sure. corpus.tmp = tm_map(corpus.tmp, removePunctuation) corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english")) corpus.tmp = tm_map(corpus.tmp, stripWhitespace) return(corpus.tmp) } posts.corpus = cleanCorpus(posts.corpus) posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)
  6. Text Standardisation • Using dictionaries for stem completion politi.tdm <-

    TermDocumentMatrix(politi.corpus) politi.tdm = removeSparseTerms(politi.tdm, 0.99) politi.tdm = as.matrix(politi.tdm) # get word counts in decreasing order, put these into a plain text doc. word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE) length(word_freqs) smalldict = PlainTextDocument(names(word_freqs)) politi.corpus_final = tm_map(politi.corpus_stemmed, stemCompletion, dictionary=smalldict, type="first")
  7. Deduplication • Python sets – shingles1 = set(get_shingles(record1['standardised_content'])) • Shingling

    and Jaccard similarity – (a,rose,is,a,rose,is,a,rose) – {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)} • {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)} – http://infolab.stanford.edu/~ullman/mmds/ch3.pdf  a free text http://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf
  8. Frequency Analysis • Document-Term Matrix – politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed, control

    = list(wordLengths=c(4,Inf))) • Frequent and co-occurring terms – findFreqTerms(politi.dtm, 5000) [1] "2011" "also" "announc" "area" "around" [6] "auckland" "better" "bill" "build" "busi" – findAssocs(politi.dtm, "smoke", 0.5) smoke tobacco quit smokefre smoker 2025 cigarett 1.00 0.74 0.68 0.62 0.62 0.58 0.57
  9. • Exploration and feature extraction – Metadata gathered at time

    of collection (eg, Scrapy) – RODBC or MySQLdb with plain ol’ SQL – Native or package functions for length of strings, sna, etc. • Unsupervised – nltk.cluster – tm, topicmodels, as.matrix(dtm)  kmeans, etc. • Supervised – First hurdle: Training set  – nltk.classify – tm, e1071, others… Classification
  10. 2 posts or fewer more than 750 posts 846 1,157

    23 45,499 41.0% 1.3% 1.1% 50.1%
  11. • LDA (topicmodels) – New users – Highly active users

    Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 good smoke just smoke feel day time day quit day thank week get can dont well patch realli one like will start think will still Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time
  12. • LDA (topicmodels) – Highly active users (HAU) – HAU1

    (F, 38, PI) – HAU2 (F, 33, NZE) – HAU3 (M, 48, NZE) Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time 18% 14% 40% 8% 20% 31% 21% 27% 6% 16% 16% 9% 21% 49% 5%
  13. Recap • Your text will probably be messy – Python,

    R-based tools reduce the pain • Simple analyses can generate useful insight • Combine with data of other types for context – source, quantities, dates, network position, history • May surface useful features for classification Slides, Code: [email protected]