Slide 1

Slide 1 text

Literati Literature Recommendation and Difficulty Analysis Sylvie Foss and Kalan MacRow

Slide 2

Slide 2 text

● The amount of text available on the web is massive and ever increasing ● Journals, conference publications ● The web is no longer just for research ○ Buy and sell books ○ Classic literature projects ○ E-books getting popular ● How do we find what we want? Motivation

Slide 3

Slide 3 text

● Searching by exact author or title used to be sufficient ○ Now all search engines must at least have fuzzy full text search ○ Data mining and recommendations ● But we are still limited in the ways we can search ○ How can we find more authors like Margaret Atwood? ○ Friends often make better book recommendations than Amazon does The Evolution of Search

Slide 4

Slide 4 text

● There is more to literature than just its author and subject matter ○ A favorite author might write on a variety of subjects ○ Not all crime dramas are created equal ● It's about the way an author writes ○ A fan of Romeo and Juliet won't necessarily enjoy Twilight ○ Poets write very differently from journalists ○ Some research papers are easier to read than others ○ But even a given author's style may vary Style is Important!

Slide 5

Slide 5 text

● We propose a tool that recommends documents based on stylistic features ○ Break free from author and keyword searching ○ Or can be combined with existing search methods for even better results ○ User data no longer required! ● Identify and detect stylistic features ○ Word counts, sentence length, POS tagging ● Other techniques to be investigated ○ Further investigation of what defines a style A New Kind of Search

Slide 6

Slide 6 text

● Readers have varying skill levels and comprehension ○ Graduate students are used to research papers, but first years generally aren't ○ Children are still learning how to read ● Style can also tell us about the level of reading ability required ○ Estimate appropriate age group for youth literature ○ Find research papers accessible to high school students Other Useful Applications

Slide 7

Slide 7 text

● Determine a set of style-defining features ○ Average word, sentence, paragraph lengths ○ Function word frequencies ○ Content word frequencies ○ Habitual words ○ POS features (TBD) ● Acquire the full text (or substantial excerpts) of a large number of books and articles ● Compute features for each and index them ○ After this computation, searching is quick! Our Proposed Approach

Slide 8

Slide 8 text

● Build a "style" search engine using our index of features → texts 1. You enter a book title and author 2. We lookup the stylistic features of that book 3. Our "magic" search algorithm finds texts with similar feature sets in the index 4. You peruse the results ● Diversity features: difficulty, complexity Our Proposed Approach (2)

Slide 9

Slide 9 text

Limitations of our Work ● Will not have time for a proper evaluation ● Ideally: a formal user study to evaluate satisfaction, productivity, performance, etc. ● Would be nice to compare all of the above with commercial recommenders (Amazon) ● Instead: we are shooting for a working system that returns useful results!

Slide 10

Slide 10 text

● Directly comparable systems are highly proprietary (Amazon, Indigo, Barnes & Noble, Chapters) ● Based on massive quantities of customer activity data ● Biased by complex commercial (sales, advertising) agendas ● Academic work mostly focused on authorship recovery/attribution Problems with Existing Work

Slide 11

Slide 11 text

● Bagavandas and Manimanna (2004) [1] develop some useful notions of stylometric features (function words, content words, habitual words) ● Schler et. al. (2005) [4] identify grammatical differences between male and female bloggers of various ages (females use more pronouns, males more articles!) ● Zheng et. al. (2006) [5] develop lexical, syntactic, structural and context-specific writing-style features. ● Fazlican and Patton (2004) [2] investigate changing style across metrics such as vocabulary, word length and word frequencies. Related Work

Slide 12

Slide 12 text

1. Bagavandas, B. and Manimannan, G. Quantification of Stylistic Traits: A Statistical Approach. JADT 2004. 2. Fazlican and Patton, J. Change of Writing Style With Time. Computers and the Humanities 38: 61 - 82, 2004. 3. Burstein, J. and Wolska, M. Towards Evaluation of Writing Style: Finding Overly Repetitive Word Use in Student Essays. Princeton University Technical Report. 4. Schler, J., Koppel, M., Argamon, S., Pennebaker, J. Effects of Age and Gender of Blogging. AAAI, 2005. 5. Zheng, R., Li, J., Chen, H., Huang, Z. A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques. JASIST 57(3): 378-393, 2006. 6. Holmes, D., Kardos, J. Who Was the Author? An Introduction to Stylometry. CHANGE 16 (2) 5-8, 2003. 7. Mendenhall, T. The Characteristic Curves of Composition. Science, 11:237-249, 1887. 8. Apache Lucene. http://lucene.apache.org/core/ References