Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Blocks-Based Introduction to Text Analysis

A Blocks-Based Introduction to Text Analysis

In this talk, we will present our ongoing work introducing computational thinking to humanists as part of the Computational Thinking and Learning Initiative (CTLI) at Vanderbilt University. Our approach was specifically tailored toward text analysis and exploring how quantitative approaches can complement existing qualitative techniques in literary scholarship. We found blocks-based programming effective in supporting a powerful paradigm of interaction and facilitated a deep understanding of the content. Furthermore, we explored the use of a blocks-based environment to facilitate the integration of a diverse set of related tools including data storage and exploration.

For more information, check out https://medium.com/swlh/teaching-text-mining-online-dfd94926b18e

Brian Broll

July 31, 2020
Tweet

More Decks by Brian Broll

Other Decks in Programming

Transcript

  1. Institute for Software Integrated Systems
    Vanderbilt University
    A Blocks-Based Introduction
    to Text Analysis
    Brian Broll
    [email protected]
    Clifford Anderson, Sarah Burriss, Corey Brady,
    Mark Schoenfield

    View Slide

  2. Meet the Team
    2
    Sarah Burriss
    Doctoral Student
    Department of
    Teaching & Learning,
    Vanderbilt University
    Mark Schoenfield,
    Professor of English,
    Vanderbilt University
    And me,
    Corey Brady
    Assistant Professor
    Learning Sciences, Vanderbilt
    Brian Broll
    Research Scientist,
    Vanderbilt University

    View Slide

  3. Research Questions: an origin
    A Culture of Litigation:
    1765-1835
    A project underway
    on desktops (wooden
    and virtual) and on
    library shelves
    3

    View Slide

  4. Research Questions
    4

    View Slide

  5. Research Questions
    ▪ Could a systematic computer-assisted analysis of the 4m
    articles in BP supplement a close-reading analysis of a
    subsection?
    ▪ Could it improve the selection of that subsection?
    ▪ Could it clarify or help discover new questions?
    1. Cross-examination as a popularized term, as well as evidence of public attitudes
    toward it
    2. Developing respect for juries; and the tension between law/facts paralleling that
    between judge/jury
    3. Generalization and discussions of the term jury and its relationship to citizenship and
    the public versus the people argument
    4. Judges lawyers and other agents of the court as celebrity figures. As a subsection
    celebrity figures as a phenomenon. The wit and witicisms of lawyers, their public
    reputations
    5

    View Slide

  6. Institutional Context
    6

    View Slide

  7. Motivating Question
    Can we use blocks-based programming (via
    NetsBlox) to aid both in these research questions
    and students’ access to text analytical concepts?
    7

    View Slide

  8. Brief Intro to NetsBlox
    ▪ NetsBlox is an extension of Snap! which provides many
    new features such as:
    ▪ Networking Capabilities
    ▪ Undo Capabilities
    ▪ Collaborative Editing
    ▪ Shared Projects
    ▪ Sharing libraries
    ▪ One of the new networking concepts is Remote
    Procedure Calls which enables users to invoke code
    implemented remotely. Examples include:
    ▪ Google Maps
    ▪ Cloud Variables
    8

    View Slide

  9. Text Analysis in NetsBlox
    ▪ We explored a number of different questions within
    NetsBlox pertaining to learning text analysis concepts:
    ▪ Can we enable the students to interactively probe machine
    learning models to learn about their strengths and
    weaknesses? Can we hypothesize about the causes based on
    this interaction?
    ▪ Sentimental Writer Example
    ▪ Can we introduce students to word embeddings?
    ▪ Word Embeddings Example
    ▪ Can we enable students to train their own word embeddings?
    ▪ Training Word Embeddings on Middlemarch
    9

    View Slide

  10. Thank you!
    10

    View Slide

  11. Appendix
    Additional information about presented topics can
    be found in this section.
    11

    View Slide

  12. FAQ
    ▪ I can’t find the “TextAnalysis” category in my
    services?
    ▪ These are private, auxiliary services and must be explicitly
    enabled for individual users or groups
    ▪ For more information, check out Services Overview
    12

    View Slide

  13. Text Analysis Exercises
    This subsection contains more details about the
    motivation, goals, and discussion topics for each of
    the presented projects.
    13

    View Slide

  14. Probing Existing Models
    ▪ Motivation: Interaction with machine learning models
    for text analysis could facilitate a better understanding of
    strengths and weakness.
    ▪ Question: Can we enable the students to interactively
    probe machine learning models to form hypotheses
    about their shortcomings?
    ▪ Approach: Using the ParallelDots Service, create a
    typewriter which color-codes the text based on the
    sentiment. Then explore the predictions made by
    ParallelDots by writing with the typewriter!
    14

    View Slide

  15. Probing Existing Models
    ▪ Student Questions:
    ▪ Can I fool the model?
    ▪ What if I use historic text?
    ▪ What if I use long sentences?
    ▪ What if I use unnatural punctuation?
    ▪ Does it care if use ALL CAPS?
    ▪ Discussion Questions:
    ▪ Why do I think ____ fools the model (or doesn’t)?
    ▪ What is the impact of training data on the resultant
    model? Is this an error with the training itself or the
    data?
    15

    View Slide

  16. Word Embeddings
    ▪ Motivation: Word embeddings can be trained in an
    unsupervised way (ie, we don’t need to label the
    training data by hand) so they are a good candidate for
    exploration of the 4+ million documents of interest.
    ▪ Question: Can we introduce students to word
    embeddings so they can understand how it might be
    able to be used to support or reject hypotheses?
    ▪ Approach: Use the WordEmbeddings service (only
    available to members of the class) to enable students to
    retrieve pre-trained word embeddings.
    ▪ Important Concepts: Vectors, Vector Spaces, Cosine
    Similarity, Euclidean Distance
    16

    View Slide

  17. Training Word Embeddings
    ▪ Motivation: Training word embeddings could enable
    students to find quantitative evidence for hypotheses
    about the data.
    ▪ Question: Can we enable students to train their own
    word embeddings from scratch within NetsBlox?
    ▪ Approach: Using the Word2Vec and Datasets services,
    students can incrementally build datasets and then train
    word embeddings on the dataset.
    17

    View Slide

  18. Training Word Embeddings
    ▪ Student Questions:
    ▪ Can I train a language model using the Middlemarch
    text?
    ▪ Should I trust the model?
    ▪ Discussion Questions:
    ▪ Can I trust the results of the model?
    ▪ Did the model have enough data? How can I check
    that the results are not due to the random
    initialization of the model?
    18

    View Slide