Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Blocks-Based Introduction to Text Analysis

A Blocks-Based Introduction to Text Analysis

In this talk, we will present our ongoing work introducing computational thinking to humanists as part of the Computational Thinking and Learning Initiative (CTLI) at Vanderbilt University. Our approach was specifically tailored toward text analysis and exploring how quantitative approaches can complement existing qualitative techniques in literary scholarship. We found blocks-based programming effective in supporting a powerful paradigm of interaction and facilitated a deep understanding of the content. Furthermore, we explored the use of a blocks-based environment to facilitate the integration of a diverse set of related tools including data storage and exploration.

For more information, check out https://medium.com/swlh/teaching-text-mining-online-dfd94926b18e

Brian Broll

July 31, 2020
Tweet

More Decks by Brian Broll

Other Decks in Programming

Transcript

  1. Institute for Software Integrated Systems Vanderbilt University A Blocks-Based Introduction

    to Text Analysis Brian Broll [email protected] Clifford Anderson, Sarah Burriss, Corey Brady, Mark Schoenfield
  2. Meet the Team 2 Sarah Burriss Doctoral Student Department of

    Teaching & Learning, Vanderbilt University Mark Schoenfield, Professor of English, Vanderbilt University And me, Corey Brady Assistant Professor Learning Sciences, Vanderbilt Brian Broll Research Scientist, Vanderbilt University
  3. Research Questions: an origin A Culture of Litigation: 1765-1835 A

    project underway on desktops (wooden and virtual) and on library shelves 3
  4. Research Questions ▪ Could a systematic computer-assisted analysis of the

    4m articles in BP supplement a close-reading analysis of a subsection? ▪ Could it improve the selection of that subsection? ▪ Could it clarify or help discover new questions? 1. Cross-examination as a popularized term, as well as evidence of public attitudes toward it 2. Developing respect for juries; and the tension between law/facts paralleling that between judge/jury 3. Generalization and discussions of the term jury and its relationship to citizenship and the public versus the people argument 4. Judges lawyers and other agents of the court as celebrity figures. As a subsection celebrity figures as a phenomenon. The wit and witicisms of lawyers, their public reputations 5
  5. Motivating Question Can we use blocks-based programming (via NetsBlox) to

    aid both in these research questions and students’ access to text analytical concepts? 7
  6. Brief Intro to NetsBlox ▪ NetsBlox is an extension of

    Snap! which provides many new features such as: ▪ Networking Capabilities ▪ Undo Capabilities ▪ Collaborative Editing ▪ Shared Projects ▪ Sharing libraries ▪ One of the new networking concepts is Remote Procedure Calls which enables users to invoke code implemented remotely. Examples include: ▪ Google Maps ▪ Cloud Variables 8
  7. Text Analysis in NetsBlox ▪ We explored a number of

    different questions within NetsBlox pertaining to learning text analysis concepts: ▪ Can we enable the students to interactively probe machine learning models to learn about their strengths and weaknesses? Can we hypothesize about the causes based on this interaction? ▪ Sentimental Writer Example ▪ Can we introduce students to word embeddings? ▪ Word Embeddings Example ▪ Can we enable students to train their own word embeddings? ▪ Training Word Embeddings on Middlemarch 9
  8. FAQ ▪ I can’t find the “TextAnalysis” category in my

    services? ▪ These are private, auxiliary services and must be explicitly enabled for individual users or groups ▪ For more information, check out Services Overview 12
  9. Text Analysis Exercises This subsection contains more details about the

    motivation, goals, and discussion topics for each of the presented projects. 13
  10. Probing Existing Models ▪ Motivation: Interaction with machine learning models

    for text analysis could facilitate a better understanding of strengths and weakness. ▪ Question: Can we enable the students to interactively probe machine learning models to form hypotheses about their shortcomings? ▪ Approach: Using the ParallelDots Service, create a typewriter which color-codes the text based on the sentiment. Then explore the predictions made by ParallelDots by writing with the typewriter! 14
  11. Probing Existing Models ▪ Student Questions: ▪ Can I fool

    the model? ▪ What if I use historic text? ▪ What if I use long sentences? ▪ What if I use unnatural punctuation? ▪ Does it care if use ALL CAPS? ▪ Discussion Questions: ▪ Why do I think ____ fools the model (or doesn’t)? ▪ What is the impact of training data on the resultant model? Is this an error with the training itself or the data? 15
  12. Word Embeddings ▪ Motivation: Word embeddings can be trained in

    an unsupervised way (ie, we don’t need to label the training data by hand) so they are a good candidate for exploration of the 4+ million documents of interest. ▪ Question: Can we introduce students to word embeddings so they can understand how it might be able to be used to support or reject hypotheses? ▪ Approach: Use the WordEmbeddings service (only available to members of the class) to enable students to retrieve pre-trained word embeddings. ▪ Important Concepts: Vectors, Vector Spaces, Cosine Similarity, Euclidean Distance 16
  13. Training Word Embeddings ▪ Motivation: Training word embeddings could enable

    students to find quantitative evidence for hypotheses about the data. ▪ Question: Can we enable students to train their own word embeddings from scratch within NetsBlox? ▪ Approach: Using the Word2Vec and Datasets services, students can incrementally build datasets and then train word embeddings on the dataset. 17
  14. Training Word Embeddings ▪ Student Questions: ▪ Can I train

    a language model using the Middlemarch text? ▪ Should I trust the model? ▪ Discussion Questions: ▪ Can I trust the results of the model? ▪ Did the model have enough data? How can I check that the results are not due to the random initialization of the model? 18