Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Literati Proposal: Literature Recommendation and Difficulty Analysis

Literati Proposal: Literature Recommendation and Difficulty Analysis

Kalan MacRow

March 05, 2013
Tweet

More Decks by Kalan MacRow

Other Decks in Research

Transcript

  1. • The amount of text available on the web is

    massive and ever increasing • Journals, conference publications • The web is no longer just for research ◦ Buy and sell books ◦ Classic literature projects ◦ E-books getting popular • How do we find what we want? Motivation
  2. • Searching by exact author or title used to be

    sufficient ◦ Now all search engines must at least have fuzzy full text search ◦ Data mining and recommendations • But we are still limited in the ways we can search ◦ How can we find more authors like Margaret Atwood? ◦ Friends often make better book recommendations than Amazon does The Evolution of Search
  3. • There is more to literature than just its author

    and subject matter ◦ A favorite author might write on a variety of subjects ◦ Not all crime dramas are created equal • It's about the way an author writes ◦ A fan of Romeo and Juliet won't necessarily enjoy Twilight ◦ Poets write very differently from journalists ◦ Some research papers are easier to read than others ◦ But even a given author's style may vary Style is Important!
  4. • We propose a tool that recommends documents based on

    stylistic features ◦ Break free from author and keyword searching ◦ Or can be combined with existing search methods for even better results ◦ User data no longer required! • Identify and detect stylistic features ◦ Word counts, sentence length, POS tagging • Other techniques to be investigated ◦ Further investigation of what defines a style A New Kind of Search
  5. • Readers have varying skill levels and comprehension ◦ Graduate

    students are used to research papers, but first years generally aren't ◦ Children are still learning how to read • Style can also tell us about the level of reading ability required ◦ Estimate appropriate age group for youth literature ◦ Find research papers accessible to high school students Other Useful Applications
  6. • Determine a set of style-defining features ◦ Average word,

    sentence, paragraph lengths ◦ Function word frequencies ◦ Content word frequencies ◦ Habitual words ◦ POS features (TBD) • Acquire the full text (or substantial excerpts) of a large number of books and articles • Compute features for each and index them ◦ After this computation, searching is quick! Our Proposed Approach
  7. • Build a "style" search engine using our index of

    features → texts 1. You enter a book title and author 2. We lookup the stylistic features of that book 3. Our "magic" search algorithm finds texts with similar feature sets in the index 4. You peruse the results • Diversity features: difficulty, complexity Our Proposed Approach (2)
  8. Limitations of our Work • Will not have time for

    a proper evaluation • Ideally: a formal user study to evaluate satisfaction, productivity, performance, etc. • Would be nice to compare all of the above with commercial recommenders (Amazon) • Instead: we are shooting for a working system that returns useful results!
  9. • Directly comparable systems are highly proprietary (Amazon, Indigo, Barnes

    & Noble, Chapters) • Based on massive quantities of customer activity data • Biased by complex commercial (sales, advertising) agendas • Academic work mostly focused on authorship recovery/attribution Problems with Existing Work
  10. • Bagavandas and Manimanna (2004) [1] develop some useful notions

    of stylometric features (function words, content words, habitual words) • Schler et. al. (2005) [4] identify grammatical differences between male and female bloggers of various ages (females use more pronouns, males more articles!) • Zheng et. al. (2006) [5] develop lexical, syntactic, structural and context-specific writing-style features. • Fazlican and Patton (2004) [2] investigate changing style across metrics such as vocabulary, word length and word frequencies. Related Work
  11. 1. Bagavandas, B. and Manimannan, G. Quantification of Stylistic Traits:

    A Statistical Approach. JADT 2004. 2. Fazlican and Patton, J. Change of Writing Style With Time. Computers and the Humanities 38: 61 - 82, 2004. 3. Burstein, J. and Wolska, M. Towards Evaluation of Writing Style: Finding Overly Repetitive Word Use in Student Essays. Princeton University Technical Report. 4. Schler, J., Koppel, M., Argamon, S., Pennebaker, J. Effects of Age and Gender of Blogging. AAAI, 2005. 5. Zheng, R., Li, J., Chen, H., Huang, Z. A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques. JASIST 57(3): 378-393, 2006. 6. Holmes, D., Kardos, J. Who Was the Author? An Introduction to Stylometry. CHANGE 16 (2) 5-8, 2003. 7. Mendenhall, T. The Characteristic Curves of Composition. Science, 11:237-249, 1887. 8. Apache Lucene. http://lucene.apache.org/core/ References