Text Analytics in Python and R with examples from Tobacco Control

Text Analytics with and (w/ examples from Tobacco Control) @BenHealey

The Process Look intensely Frequencies Classification Bright Idea Gather Clean
Standardise De-dup and select

http://scrapy.org Spiders  Items  Pipelines - readLines, XML /
Rcurl / scrapeR packages - tm package (factiva plugin), twitteR - Beautiful Soup - Pandas (eg, financial data) http://blog.siliconstraits.vn/building-web-crawler-scrapy/

• Translating text to consistent form – Scrapy returns unicode
strings – Māori  Maori • SWAPSET = [[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]] • translation_table = dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET]) • cleaned_content = html_content.translate(translation_table) – Or… • test=u’Māori’ (you already have unicode) • Unidecode(test) (returns ‘Maori’)

• Dealing with non-Unicode – http://nedbatchelder.com/text/unipain.html – Some scraped html
will be in latin1 (mismatch UTF8) – Have your datastore default to UTF-8 – Learn to love whack-a-mole • Dealing with too many spaces: – newstring = ' '.join(mystring.split()) – Or… use re • Don’t forget the metadata! – Define a common data structure early if you have multiple sources

Text Standardisation • Stopwords – "a, about, above, across, ...
yourself, yourselves, you've, z” • Stemmers – "some sample stemmed words"  "some sampl stem word“ • Tokenisers (eg, for bigrams) – BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) – tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) – ‘and said’, ‘and security’ Natural Language Toolkit tm package

Text Standardisation libs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels") …
cleanCorpus = function(corpus) { corpus.tmp = tm_map(corpus, tolower) # ??? Not sure. corpus.tmp = tm_map(corpus.tmp, removePunctuation) corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english")) corpus.tmp = tm_map(corpus.tmp, stripWhitespace) return(corpus.tmp) } posts.corpus = cleanCorpus(posts.corpus) posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)

Text Standardisation • Using dictionaries for stem completion politi.tdm <-
TermDocumentMatrix(politi.corpus) politi.tdm = removeSparseTerms(politi.tdm, 0.99) politi.tdm = as.matrix(politi.tdm) # get word counts in decreasing order, put these into a plain text doc. word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE) length(word_freqs) smalldict = PlainTextDocument(names(word_freqs)) politi.corpus_final = tm_map(politi.corpus_stemmed, stemCompletion, dictionary=smalldict, type="first")

Deduplication • Python sets – shingles1 = set(get_shingles(record1['standardised_content'])) • Shingling
and Jaccard similarity – (a,rose,is,a,rose,is,a,rose) – {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)} • {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)} – http://infolab.stanford.edu/~ullman/mmds/ch3.pdf  a free text http://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf

Frequency Analysis • Document-Term Matrix – politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed, control
= list(wordLengths=c(4,Inf))) • Frequent and co-occurring terms – findFreqTerms(politi.dtm, 5000) [1] "2011" "also" "announc" "area" "around" [6] "auckland" "better" "bill" "build" "busi" – findAssocs(politi.dtm, "smoke", 0.5) smoke tobacco quit smokefre smoker 2025 cigarett 1.00 0.74 0.68 0.62 0.62 0.58 0.57

Mentions of the 2025 goal

Top 100 terms: Tariana Turia Note: Documents from Aug 2011
– July 2012 Wordcloud package

Top 100 terms: Tony Ryall Note: Documents from Aug 2011
– July 2012

• Exploration and feature extraction – Metadata gathered at time
of collection (eg, Scrapy) – RODBC or MySQLdb with plain ol’ SQL – Native or package functions for length of strings, sna, etc. • Unsupervised – nltk.cluster – tm, topicmodels, as.matrix(dtm)  kmeans, etc. • Supervised – First hurdle: Training set  – nltk.classify – tm, e1071, others… Classification

2 posts or fewer more than 750 posts 846 1,157
23 45,499 41.0% 1.3% 1.1% 50.1%

Cohort: New users (posters) in Q1 2012

• LDA (topicmodels) – New users – Highly active users
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 good smoke just smoke feel day time day quit day thank week get can dont well patch realli one like will start think will still Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time

• LDA (topicmodels) – Highly active users (HAU) – HAU1
(F, 38, PI) – HAU2 (F, 33, NZE) – HAU3 (M, 48, NZE) Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time 18% 14% 40% 8% 20% 31% 21% 27% 6% 16% 16% 9% 21% 49% 5%

Recap • Your text will probably be messy – Python,
R-based tools reduce the pain • Simple analyses can generate useful insight • Combine with data of other types for context – source, quantities, dates, network position, history • May surface useful features for classification Slides, Code: [email protected]

Text Analytics in Python and R with examples fr...

Text Analytics in Python and R with examples from Tobacco Control

Ben Healey

Other Decks in Technology

Featured

Transcript

Text Analytics with and (w/ examples from Tobacco Control) @BenHealey

The Process Look intensely Frequencies Classification Bright Idea Gather Clean

http://scrapy.org Spiders  Items  Pipelines - readLines, XML /

• Translating text to consistent form – Scrapy returns unicode

• Dealing with non-Unicode – http://nedbatchelder.com/text/unipain.html – Some scraped html

Text Standardisation • Stopwords – "a, about, above, across, ...

Text Standardisation libs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels") …

Text Standardisation • Using dictionaries for stem completion politi.tdm <-

Deduplication • Python sets – shingles1 = set(get_shingles(record1['standardised_content'])) • Shingling

Frequency Analysis • Document-Term Matrix – politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed, control

Mentions of the 2025 goal

Mentions of the 2025 goal

Top 100 terms: Tariana Turia Note: Documents from Aug 2011

Top 100 terms: Tony Ryall Note: Documents from Aug 2011

• Exploration and feature extraction – Metadata gathered at time

2 posts or fewer more than 750 posts 846 1,157

Cohort: New users (posters) in Q1 2012

• LDA (topicmodels) – New users – Highly active users

• LDA (topicmodels) – Highly active users (HAU) – HAU1

Recap • Your text will probably be messy – Python,