Visual Pipelines for Text Analysis

Slide 1

Slide 1 text

Visual Pipelines for Text Analysis Benjamin Bengfort District Data Labs

Slide 2

Slide 2 text

Visual Steering of Text ML Models. - Language Aware Data Products - Pipelines are essential to text models. - Yellowbrick for: - Topic Models - Sentiment Analysis

Slide 3

Slide 3 text

Natural Language Understanding (AI) Models for semantic understanding, reasoning, and generation of natural languages for human-computer interaction. Computational Linguistics (NLP) Approaches to demonstrate how humans interpret and understand language and show how languages evolve. Language Aware Data Products are Not Necessarily

Slide 4

Slide 4 text

Sidebar: An Academic Context input projection hidden layer output layer Language Models and Perplexity Information Extraction/Retrieval

Slide 5

Slide 5 text

Text Models

Slide 6

Slide 6 text

Instances = Documents or Utterances (no matter their size)

Slide 7

Slide 7 text

Text as Features in Machine Learning Is “Bag of Words” a bad thing? - Discourse Features - Syntactic Features - Morphological Features - N-Grams - Grammar Extracted Phrases - Named Entities - Chunks

Slide 8

Slide 8 text

Vectorization 0 at 2 bat 1 can 0 door 1 echolocation 0 elephant 0 of 0 open 0 potato 2 see 0 she 1 sight 1 sneeze 0 studio 1 the 0 to 1 via 0 w onder The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio.

Slide 9

Slide 9 text

High Dimensional Space! Baleen sample contains 2,021 files in 6 categories. Structured as: 36,816 paragraphs (18.217 mean paragraphs per file) 61,597 sentences (1.673 mean sentences per paragraph). Word count of 1,365,829 with a vocabulary of 51,227 (26.662 lexical diversity). Corpus scan took 2.7726128101348877 seconds. Reductions: - Lemmatization, Stemming - Truncation - Dimensionality Reduction - Embeddings

Slide 10

Slide 10 text

We’re doing feature engineering! pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('svd', TruncatedSVD()), ('model', SGDClassifier()), ]) parameters = { 'vect__max_df': (0.5, 0.75, 1.0), 'vect__max_features': (None, 5000, 10000), 'tfidf__use_idf': (True, False), 'tfidf__norm': ('l1', 'l2'), 'svd__n_components': (500, 1000, 2000, 5000), 'model__alpha': (0.00001, 0.000001), 'model__penalty': ('l2', 'elasticnet'), } search = GridSearchCV(pipeline, parameters) search.fit(X, y)

Slide 11

Slide 11 text

Data Loader Text Normalization Text Vectorization Feature Decomposition Estimator Data Loader Feature Union Pipeline Estimator Text Normalization Document Features Text Extraction Summary Vectorization Article Vectorization Concept Features Metadata Features Dict Vectorizer Text Pipelines

Slide 12

Slide 12 text

- Naive Bayes (FAST) - Maximum Entropy (Logistic Regression) - SVMs (usually linear) or SGD - RBMs - Perceptrons Commonly Used Models - Hidden Markov Models - LDA/LSA - DBSCAN/Birch - KMeans

Slide 13

Slide 13 text

The Model Selection Triple Arun Kumar http://bit.ly/2abVNrI Feature Analysis Algorithm Selection Hyperparameter Tuning

Slide 14

Slide 14 text

Visual Steering

Slide 15

Slide 15 text

The basic text visualization is … from yellowbrick.text import FrequencyVisualizer from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer() docs = vect.fit_transform(docs) viz = FrequencyVisualizer( features=vect.get_feature_names() ) viz.fit(docs) viz.poof()

Slide 16

Slide 16 text

J. Heer and B. Shneiderman, “Interactive dynamics for visual analysis,” Queue, vol. 10, no. 2, p. 30, 2012. The Visual Analytics Mantra Overview First Zoom and Filter Details on Demand

Slide 17

Slide 17 text

Conditional Frequency by Label from yellowbrick.text import FrequencyVisualizer from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer() docs = vect.fit_transform(docs) viz = FrequencyVisualizer( features=vect.get_feature_names() ) viz.fit(docs, labels) viz.poof()

Slide 18

Slide 18 text

“TED Word Flows” Santiago Ortiz

Slide 19

Slide 19 text

Topic Modeling

Slide 20

Slide 20 text

Topic Modeling Training Text, Documents Feature Vectors Clustering Algorithm New Document Feature Vector Topic A Topic B Topic C Similarity

Slide 21

Slide 21 text

Topic #0 â jim abenomics japan japanese rickards institution reckoning q please Topic #1 favoriting weidenhammer udacity daydream allo marks thrun raby seo aquafaba Topic #2 kearny secaucus crowhurst 2015â stringer loopholes skylineâ billowing newark hoboken Topic #3 megayachts mischa lovekin superyacht cw_anderson enchanted flickrallen sevigny zeppelin chloe Topic #4 yotel hubspot unplugged disrupted benioff maranhã lenã maranhenses vitormarigoduring quora ... Topic #10 github zube dewalt combinator jira scrum 42floors zendesk logins zenhub The Basic Topic Model - LDA from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import LatentDirichletAllocation as LDA model = Pipeline([ ('norm', TextNormalizer()), ('tfidf', TfidfVectorizer( tokenizer=identity, preprocessor=None, lowercase=False )), ('lda', LDA(n_topics=50)), ])

Slide 22

Slide 22 text

Topic feature importance distributions!

Slide 23

Slide 23 text

Visualizing Topic Space - Projections TF-IDF Vectors Frequency Vectors

Slide 24

Slide 24 text

Comparing Topics: PyLDAViz for Interactive Exploration

Slide 25

Slide 25 text

Visualizing Document Space - TSNE from yellowbrick.text import TSNEVisualizer vect = Pipeline([ ('norm', TextNormalizer()), ('tfidf', TfidfVectorizer( tokenizer=identity, preprocessor=None, lowercase=False )), ]) docs = vect.fit_transform(docs) tsne = TSNEVisualizer() tsne.fit(docs, labels) tsne.poof()

Slide 26

Slide 26 text

Spherical Topics with K-Means Text is typically very sparse. However, with a different distance metric that’s spherical, we can regain density: - Cosine - Jaccard - Soundex Opening up the possibility to using other clustering algorithms effectively.

Slide 27

Slide 27 text

Selecting K: Elbow Curves from sklearn.cluster import MiniBatchKMeans from yellowbrick.cluster import KElbowVisualizer viz = KElbowVisualizer( MiniBatchKMeans(), k=20 ) viz.fit(Xt.toarray()) viz.poof()

Slide 28

Slide 28 text

Selecting K - Silhouette Scores from yellowbrick.cluster import SilhouetteVisualizer viz = SilhouetteVisualizer(KMeans(12)) viz.fit(Xt.toarray()) viz.poof()

Slide 29

Slide 29 text

Sentiment Analysis

Slide 30

Slide 30 text

Visualizing Tokenization and Segmentation

Slide 31

Slide 31 text

Sentiment Analysis Training Instances Training Labels Feature Vectors Classification Algorithm New Instance Feature Vector Predictive Model Predicted Label

Slide 32

Slide 32 text

“Edging is one of those necessary evils if you want a great looking house. Blistered hands and time wasted, I decided there had to be a better way and this is surely it.”

Slide 33

Slide 33 text

Part of Speech Tagging I used to use a really primitive manual edger that I inherited from my father . Blistered hands and time wasted , I decided there had to be a better way and this is surely it.Edging is one of those necessary evils if you want a great looking house . I do n't edge every time I mow . Usually I do it every other time . The first time out after a long winter , edging usually takes a little longer . After that , edging is a snap because you are basically in maintanence mode.I also use this around my landscaping and flower beds with equally great results.The blade on the Edge Hog is easily replaceable and the tell tale sign to replace it is when the edge starts to look a little rough and the machine seems slower . import nltk tokens = list( nltk.pos_tag( nltk.word_tokenize(text) )) viz = PosTagVisualizer() print(viz.transform(tokens))

Slide 34

Slide 34 text

Syntax Analysis Edging is one of those necessary evils if you want a great looking house. NNP VBZ CD IN DT JJ NNS IN PRP VBP DT NN JJ VBG NP AD JP NP PP NP NP VP VP NP SB AR S

Slide 35

Slide 35 text

Class Balance from yellowbrick.classifier import ClassBalance model = MultinomialNB() visualizer = ClassBalance(model) visualizer.fit(X_train, y_train) visualizer.score(X_test, y_test) visualizer.poof()

Slide 36

Slide 36 text

Classification Reports

Slide 37

Slide 37 text

Confusion Matrices

Slide 38

Slide 38 text

ROC/AUC Curves

Slide 39

Slide 39 text

Yellowbrick

Slide 40

Slide 40 text

Yellowbrick at a Glance ● Extend the Scikit-Learn API. ● Enhance the model selection process. ● Tools for feature visualization, visual diagnostics, and visual steering. ● Visualize the model space

Slide 41

Slide 41 text

The Scikit-Learn API Buitinck, Lars, et al. "API design for machine learning software: experiences from the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013). class Estimator(object): def fit(self, X, y=None): """ Fits estimator to data. """ # set state of self return self def predict(self, X): """ Predict response of X """ # compute predictions pred return pred class Transformer(Estimator): def transform(self, X): """ Transforms the input data. """ # transform X to X_prime return X_prime class Pipeline(Transfomer): @property def named_steps(self): """ Returns a sequence of estimators """ return self.steps @property def _final_estimator(self): """ Terminating estimator """ return self.steps[-1]

Slide 42

Slide 42 text

The Matplotlib API The matplotlib API combines two primary components: - pyplot: procedural descriptions of visualization. - artists: object oriented construction of visual elements. Drawing occurs globally and renders on demand.

Slide 43

Slide 43 text

The trick: combine functional/procedural matplotlib + object-oriented Scikit-Learn. Yellowbrick

Slide 44

Slide 44 text

Visualizers A visualizer is an estimator that produces visualizations based on data rather than new datasets or predictions. Visualizers are intended to work in concert with Transformers and Estimators to allow human insight into the modeling process. class Visualizer(Estimator): def draw(self): """ Draw the data """ self.ax.plot() def finalize(self): """ Complete the figure """ self.ax.set_title() def poof(self): """ Show the figure """ plt.show()

Slide 45

Slide 45 text

Scikit-Learn Pipelines: fit() and predict() Data Loader Transformer Transformer Estimator Data Loader Transformer Transformer Estimator Transformer

Slide 46

Slide 46 text

Yellowbrick Visual Pipelines Data Loader Transformer(s) Feature Visualization Estimator fit() draw() predict() Data Loader Transformer(s) EstimatorCV Evaluation Visualization fit() predict() score() draw()

Slide 47

Slide 47 text

Model Selection Pipelines Multi-Estimator Visualization Data Loader Transformer(s) Estimator Estimator Estimator Estimator Cross Validation Cross Validation Cross Validation Cross Validation

Slide 48

Slide 48 text

Contributions Welcome!