Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ivan Bilan - Automated Author Profiling

Ivan Bilan - Automated Author Profiling

This talk will introduce the concepts of Author Identification, Author Obfuscation, and Author Profiling. The main focus is on Author Profiling, namely cross-genre Author Profiling of textual data gathered from social media. Tweets, blogs, and reviews you leave online can reveal your age, gender, personality traits and a lot more about you. Author Profiling is used to create a socio-demographic profile of an anonymous person by analyzing the texts they write using Natural Language Processing and Machine Learning. It is used in forensic suspect profiling, customer-base analysis, and targeted advertising. This talk will give you an overview of this research field from a perspective of a computational linguist.



May 04, 2017

More Decks by MunichDataGeeks

Other Decks in Research


  1. Automated Author Profiling Ivan Bilan, LMU

  2. Currently studying: M. Sc. Computational Linguistics Honors Degree in Technology

    Management Current position: Data Engineer, TrustYou Research interests: Author Profiling Cross-Lingual Lexical Substitution Neural Machine Translation 2 Ivan Bilan
  3. Author Identification 3 attribute given text to one of the

    known authors
  4. Author Identification Are all the works we attribute to Shakespeare

    written by him? Did some of them belong to Fletcher or Marlowe? 4 Is Joe or Jane the whistleblower? Is ‘The Silmarillion’ fully written by J. R. R. Tolkien or did his son Christopher add new chapters during editing?
  5. Stylometry 5 application of the study of linguistics style for

    author characterization, attribution and similarity detection
  6. 6

  7. Goal of Author Profiling attributing an author of a text

    to a certain socio-demographic class 7 Age Gender Personality Traits Psychological Stability Native Language / Dialect
  8. Real World Applications 8 Suspect Profiling in Forensics Customer-base Analysis

    Targeted Advertising Age Restriction Systems Sexual Predator Identification
  9. 9 http://pan.webis.de Shared Tasks on Digital Text Forensics Shared Tasks:

    Plagiarism Detection - Author Masking - Author Attribution - Author Profiling Author Profiling: PAN13-15 Single Genre (Tweets, Hotel Reviews, Blogs) PAN16 Cross-genre
  10. CAPS: A Cross-genre Author Profiling System in collaboration with Dr.

    Desislava Zhekova Center for Information and Language Processing LMU Munich 10
  11. PAN16 Cross-genre Author Profiling Profile: Gender (Male, Female) Age (18-24

    | 25-34 | 35-49 | 50-64 | 65-xx) Cross-genre author profiling: • train the model on one genre and evaluate on another • adaptable to any unseen genre • merge all existing genres into one training set to overcome data scarcity 11
  12. PAN16 Cross-genre Author Profiling 12 ~200000 ~128000 ~67000 0 50000

    100000 150000 200000 250000 English Spanish Dutch Text samples Language PAN16 Training Set (Text samples) 432 249 379 0 100 200 300 400 500 English Spanish Dutch Authors Language PAN16 Training Set (Authors) Tweets as training set Blogs and Hotel Reviews as test set
  13. Workflow 13 Preprocess HTML and Bulletin Board Code removal normalization

    of all links to [URL] @usernames to [USER] Extract Features Part-of-speech Tagging Lemmatized Representation Cleaned Text Author Profiling Model
  14. TF-IDF - The Term Frequency-Inverse Document Frequency , = ,

    max ∈ , ∙ log • is current token, s is current string • , number of times token appears in string • is number of strings • number of strings in the corpus that include the token 14 In simple words: Emphasize important words (frequent in a text, infrequent in the corpus) ∙ log + 1 + 1 + 1
  15. 15 mole woman DISCLAIMER: Don’t take it too seriously, PAN16

    Dataset is too small Most important word for the male class: TF-IDF - The Term Frequency-Inverse Document Frequency Most important word for the female class:
  16. Topic Modelling with LDA Latent Dirichlet Allocation (LDA) and Hierarchical

    Dirichlet Process (HDP) 16 • Generative statistical model that allows automated grouping of observed words into topics • LDA requires predefined number of topics • HDP calculates the number of topics automatically • do not confuse with Linear Discriminant Analysis (also known as LDA)
  17. Lexical F-Measure This feature calculates how implicit or explicit the

    text is. 17 = 0.5 ∙ + + + − + + + + 100 Heylighen et al. (2002)
  18. Readability Index Formulas Automated Readability Index, SMOG Readability Formula, New

    Dale-Chall etc. Flesch Reading Ease Assesses the difficulty of a reading passage 206.835 − 1.015 − 84.6 0-29 : very confusing 60-69 : standard 90-100 : very easy 18 Flesch (1948)
  19. Dictionary-based Features check how often these words are used 19

    Feature Cluster Examples per Language Dictionary-based Feature Name English Spanish Dutch Connective Words furthermore, firstly … pues, como … zoals, mits … Emotion Words sad, bored, angry … espanto, carino, calma … boos, moe, zielig … Contractions I’d, let’s, I’ll … al, del, desto … m’n, ’t, zo’n … Familial Words wife, husband, gf … esposa, esposo … vriendin, man … Collocations dodgy, awesome, troll … no manches, chido … buffelen, geil … Abbreviations and Acronyms a.m., Inc., asap … art., arch. … gesch., geb. … Stop Words did, we, ours … de, en, que … van, dat, die …
  20. Other Features Text Structure Features Type/Token ratio Average word length

    Usage of punctuation marks Stylistic features (occurrence of adjectival endings) English: -ly, -able, -ic, -il, -less, -ous etc. Spanish: -ito, -ada, -anza, -acho, -acha etc. Dutch: -jes, -iek, -eren etc. 20
  21. Results 21 PAN16 Results, Accuracy (Cross-genre, all represented languages) PAN16

    English Spanish Dutch Class Gender Age Both Gender Age Both Gender Best Score 75.64% 58.97% 39.74% 73.21% 51.79% 42.87% 61.80% CAPS 74.36% 44.87% 33.33% 62.50% 46.43% 37.50% 55.00% PAN14 and PAN15 Results, Accuracy (Single genre, English) PAN14-15 Twitter (PAN15) Blogs (PAN14) Hotel Reviews (PAN14) Class Gender Age Gender Age Gender Age Best Score 85.92% 83.80% 67.95% 46.15% 72.59% 35.02% CAPS 81.69% 73.24% 66.67% 35.90% 71.32% 34.77%
  22. CAPS Tools 22 Python scikit-learn gensim TreeTagger

  23. Author Obfuscation 23 given two documents by the same author,

    paraphrase the designated one so that the author cannot be verified anymore Slide adapted from Potthast et al., 2016
  24. Author Obfuscation 24 Slide adapted from Potthast et al., 2016

    Alice Is known to have written automatically obfuscates text automatically verifies authorship is subject to analysis circumvents obstructs
  25. Author Obfuscation Approaches Paraphrasing Replacing words based on similarity Error

    insertion Changing punctuation Roundtrip translation: English > German > French > English 25
  26. Author Obfuscation 26 Obfuscators flip on average 35% of true

    positive decisions of the state-of-the-art author verification approaches Should you be excited? The texts become rather unreadable Problems with genre specific terminology
  27. Author Obfuscation 27 Obfuscation example from Mansoorizadeh et al., 2016

    … run-time called the JRE. This approach has some advantages and disadvantages and it is worth comparing these three options in order to appreciate the implications for the developer … Slide adapted from Potthast et al., 2016 system organization Java coffee
  28. Author Obfuscation 28 Obfuscation example from Mihaylova et al., 2016

    As of myself I fear that I may not have much time left … As of myself, I am afraid this myself mai not have a great deal off time left … Slide adapted from Potthast et al., 2016
  29. Author Obfuscation 29 I’ve told this story no place till

    this night, and it’s foolish I was here to be talking free … I’ve history has not yet the night, and _ I was leichtgläubig it’s here free to speak … Slide adapted from Potthast et al., 2016 Obfuscation example from Keswani et al., 2016
  30. 30 Martin Potthast: Author obfuscation and author identification are locked

    in an instance of the “Potter-Voldemort Conundrum”: ”Neither can live while the other survives” Slide adapted from Potthast et al., 2016 Author Obfuscation
  31. Automated Author Profiling Ivan Bilan, LMU Relevant Publications, Presentations and

    Overviews made available by PAN http://bit.ly/2oWdRZt
  32. 32 References • Heylighen, F., Dewaele, J.: Variation in the

    Contextuality of Language: An Empirical Measure. Foundations of Science 7(3), 293–340 (2002) • Flesch R. (1948). "A new readability yardstick". Journal of Applied Psychology. 32: 221–233. doi:10.1037/h0057532 • Martin Potthast, Matthias Hagen, and Benno Stein. Author Obfuscation: Attacking the State of the Art in Authorship Verification. In Working Notes Papers of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, September 2016. CLEF and CEUR-WS.org. ISSN 1613-0073 http://www.uni-weimar.de/medien/webis/publications/slides/stein_2016k.pdf • Muharram Mansoorizadeh, Taher Rahgooy, Mohammad Aminiyan, and Mahdy Eskandari. Author Obfuscation using WordNet and Language Models—Notebook for PAN at CLEF 2016. In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors, CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, Évora, Portugal, September 2016. CEUR- WS.org. ISSN 1613-0073 • Tsvetomila Mihaylova, Georgi Karadjov, Preslav Nakov, Yasen Kiprov, Georgi Georgiev, and Ivan Koychev. SU@PAN'2016: Author Obfuscation—Notebook for PAN at CLEF 2016. In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors, CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, Évora, Portugal, September 2016. CEUR-WS.org. ISSN 1613-0073 • Yashwant Keswani, Harsh Trivedi, Parth Mehta, and Prasenjit Majumder. Author Masking through Translation—Notebook for PAN at CLEF 2016. In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors, CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, Évora, Portugal, September 2016. CEUR-WS.org. ISSN 1613-0073 • Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, pp.44–49 (1994) http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ • Ivan Bilan and Desislava Zhekova. CAPS: A Cross-genre Author Profiling System—Notebook for PAN at CLEF 2016. In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors, CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5- 8 September, Évora, Portugal, September 2016. CEUR-WS.org. ISSN 1613-0073
  33. 33 Credits Power Point Template: Maximilian Körner, CDTM Images: Woman:

    Malenirvana, http://malenirvana.com/wp- content/uploads/2013/05/confused-women1-759x569.png Mole: Simply Beautiful, http://www.cosmeticlasersydney.com.au/cosmetic-laser-treatment- sydney/laser-mole-blemish-removal Criminal Minds Wallpaper: Martin Driver, http://wallpapersafari.com/w/OuBRGb/ Icons: CLEF: http://www.clef-initiative.eu LMU: https://www.uni-muenchen.de/index.html TrustYou: http://www.trustyou.com CDTM: http://www.cdtm.de Python: https://www.python.org Scikit-Learn: http://scikit-learn.org gensim: https://radimrehurek.com/gensim/ Customer-base: Leremy, https://www.dreamstime.com/leremy_info Fingerprint: https://iconsmind.com Anonymous man: https://www.tes.com/lessons/gTiKdCYz1N54uA/just-once- vocabulary