massive and ever increasing • Journals, conference publications • The web is no longer just for research ◦ Buy and sell books ◦ Classic literature projects ◦ E-books getting popular • How do we find what we want? Motivation
sufficient ◦ Now all search engines must at least have fuzzy full text search ◦ Data mining and recommendations • But we are still limited in the ways we can search ◦ How can we find more authors like Margaret Atwood? ◦ Friends often make better book recommendations than Amazon does The Evolution of Search
and subject matter ◦ A favorite author might write on a variety of subjects ◦ Not all crime dramas are created equal • It's about the way an author writes ◦ A fan of Romeo and Juliet won't necessarily enjoy Twilight ◦ Poets write very differently from journalists ◦ Some research papers are easier to read than others ◦ But even a given author's style may vary Style is Important!
stylistic features ◦ Break free from author and keyword searching ◦ Or can be combined with existing search methods for even better results ◦ User data no longer required! • Identify and detect stylistic features ◦ Word counts, sentence length, POS tagging • Other techniques to be investigated ◦ Further investigation of what defines a style A New Kind of Search
students are used to research papers, but first years generally aren't ◦ Children are still learning how to read • Style can also tell us about the level of reading ability required ◦ Estimate appropriate age group for youth literature ◦ Find research papers accessible to high school students Other Useful Applications
sentence, paragraph lengths ◦ Function word frequencies ◦ Content word frequencies ◦ Habitual words ◦ POS features (TBD) • Acquire the full text (or substantial excerpts) of a large number of books and articles • Compute features for each and index them ◦ After this computation, searching is quick! Our Proposed Approach
features → texts 1. You enter a book title and author 2. We lookup the stylistic features of that book 3. Our "magic" search algorithm finds texts with similar feature sets in the index 4. You peruse the results • Diversity features: difficulty, complexity Our Proposed Approach (2)
a proper evaluation • Ideally: a formal user study to evaluate satisfaction, productivity, performance, etc. • Would be nice to compare all of the above with commercial recommenders (Amazon) • Instead: we are shooting for a working system that returns useful results!
& Noble, Chapters) • Based on massive quantities of customer activity data • Biased by complex commercial (sales, advertising) agendas • Academic work mostly focused on authorship recovery/attribution Problems with Existing Work
of stylometric features (function words, content words, habitual words) • Schler et. al. (2005)  identify grammatical differences between male and female bloggers of various ages (females use more pronouns, males more articles!) • Zheng et. al. (2006)  develop lexical, syntactic, structural and context-specific writing-style features. • Fazlican and Patton (2004)  investigate changing style across metrics such as vocabulary, word length and word frequencies. Related Work
A Statistical Approach. JADT 2004. 2. Fazlican and Patton, J. Change of Writing Style With Time. Computers and the Humanities 38: 61 - 82, 2004. 3. Burstein, J. and Wolska, M. Towards Evaluation of Writing Style: Finding Overly Repetitive Word Use in Student Essays. Princeton University Technical Report. 4. Schler, J., Koppel, M., Argamon, S., Pennebaker, J. Effects of Age and Gender of Blogging. AAAI, 2005. 5. Zheng, R., Li, J., Chen, H., Huang, Z. A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques. JASIST 57(3): 378-393, 2006. 6. Holmes, D., Kardos, J. Who Was the Author? An Introduction to Stylometry. CHANGE 16 (2) 5-8, 2003. 7. Mendenhall, T. The Characteristic Curves of Composition. Science, 11:237-249, 1887. 8. Apache Lucene. http://lucene.apache.org/core/ References