Slide 1

Slide 1 text

Institute for Software Integrated Systems Vanderbilt University A Blocks-Based Introduction to Text Analysis Brian Broll [email protected] Clifford Anderson, Sarah Burriss, Corey Brady, Mark Schoenfield

Slide 2

Slide 2 text

Meet the Team 2 Sarah Burriss Doctoral Student Department of Teaching & Learning, Vanderbilt University Mark Schoenfield, Professor of English, Vanderbilt University And me, Corey Brady Assistant Professor Learning Sciences, Vanderbilt Brian Broll Research Scientist, Vanderbilt University

Slide 3

Slide 3 text

Research Questions: an origin A Culture of Litigation: 1765-1835 A project underway on desktops (wooden and virtual) and on library shelves 3

Slide 4

Slide 4 text

Research Questions 4

Slide 5

Slide 5 text

Research Questions ▪ Could a systematic computer-assisted analysis of the 4m articles in BP supplement a close-reading analysis of a subsection? ▪ Could it improve the selection of that subsection? ▪ Could it clarify or help discover new questions? 1. Cross-examination as a popularized term, as well as evidence of public attitudes toward it 2. Developing respect for juries; and the tension between law/facts paralleling that between judge/jury 3. Generalization and discussions of the term jury and its relationship to citizenship and the public versus the people argument 4. Judges lawyers and other agents of the court as celebrity figures. As a subsection celebrity figures as a phenomenon. The wit and witicisms of lawyers, their public reputations 5

Slide 6

Slide 6 text

Institutional Context 6

Slide 7

Slide 7 text

Motivating Question Can we use blocks-based programming (via NetsBlox) to aid both in these research questions and students’ access to text analytical concepts? 7

Slide 8

Slide 8 text

Brief Intro to NetsBlox ▪ NetsBlox is an extension of Snap! which provides many new features such as: ▪ Networking Capabilities ▪ Undo Capabilities ▪ Collaborative Editing ▪ Shared Projects ▪ Sharing libraries ▪ One of the new networking concepts is Remote Procedure Calls which enables users to invoke code implemented remotely. Examples include: ▪ Google Maps ▪ Cloud Variables 8

Slide 9

Slide 9 text

Text Analysis in NetsBlox ▪ We explored a number of different questions within NetsBlox pertaining to learning text analysis concepts: ▪ Can we enable the students to interactively probe machine learning models to learn about their strengths and weaknesses? Can we hypothesize about the causes based on this interaction? ▪ Sentimental Writer Example ▪ Can we introduce students to word embeddings? ▪ Word Embeddings Example ▪ Can we enable students to train their own word embeddings? ▪ Training Word Embeddings on Middlemarch 9

Slide 10

Slide 10 text

Thank you! 10

Slide 11

Slide 11 text

Appendix Additional information about presented topics can be found in this section. 11

Slide 12

Slide 12 text

FAQ ▪ I can’t find the “TextAnalysis” category in my services? ▪ These are private, auxiliary services and must be explicitly enabled for individual users or groups ▪ For more information, check out Services Overview 12

Slide 13

Slide 13 text

Text Analysis Exercises This subsection contains more details about the motivation, goals, and discussion topics for each of the presented projects. 13

Slide 14

Slide 14 text

Probing Existing Models ▪ Motivation: Interaction with machine learning models for text analysis could facilitate a better understanding of strengths and weakness. ▪ Question: Can we enable the students to interactively probe machine learning models to form hypotheses about their shortcomings? ▪ Approach: Using the ParallelDots Service, create a typewriter which color-codes the text based on the sentiment. Then explore the predictions made by ParallelDots by writing with the typewriter! 14

Slide 15

Slide 15 text

Probing Existing Models ▪ Student Questions: ▪ Can I fool the model? ▪ What if I use historic text? ▪ What if I use long sentences? ▪ What if I use unnatural punctuation? ▪ Does it care if use ALL CAPS? ▪ Discussion Questions: ▪ Why do I think ____ fools the model (or doesn’t)? ▪ What is the impact of training data on the resultant model? Is this an error with the training itself or the data? 15

Slide 16

Slide 16 text

Word Embeddings ▪ Motivation: Word embeddings can be trained in an unsupervised way (ie, we don’t need to label the training data by hand) so they are a good candidate for exploration of the 4+ million documents of interest. ▪ Question: Can we introduce students to word embeddings so they can understand how it might be able to be used to support or reject hypotheses? ▪ Approach: Use the WordEmbeddings service (only available to members of the class) to enable students to retrieve pre-trained word embeddings. ▪ Important Concepts: Vectors, Vector Spaces, Cosine Similarity, Euclidean Distance 16

Slide 17

Slide 17 text

Training Word Embeddings ▪ Motivation: Training word embeddings could enable students to find quantitative evidence for hypotheses about the data. ▪ Question: Can we enable students to train their own word embeddings from scratch within NetsBlox? ▪ Approach: Using the Word2Vec and Datasets services, students can incrementally build datasets and then train word embeddings on the dataset. 17

Slide 18

Slide 18 text

Training Word Embeddings ▪ Student Questions: ▪ Can I train a language model using the Middlemarch text? ▪ Should I trust the model? ▪ Discussion Questions: ▪ Can I trust the results of the model? ▪ Did the model have enough data? How can I check that the results are not due to the random initialization of the model? 18