Ivan Bilan - Automated Author Profiling

Automated Author Profiling Ivan Bilan, LMU

Currently studying: M. Sc. Computational Linguistics Honors Degree in Technology
Management Current position: Data Engineer, TrustYou Research interests: Author Profiling Cross-Lingual Lexical Substitution Neural Machine Translation 2 Ivan Bilan

Author Identification 3 attribute given text to one of the
known authors

Author Identification Are all the works we attribute to Shakespeare
written by him? Did some of them belong to Fletcher or Marlowe? 4 Is Joe or Jane the whistleblower? Is ‘The Silmarillion’ fully written by J. R. R. Tolkien or did his son Christopher add new chapters during editing?

Stylometry 5 application of the study of linguistics style for
author characterization, attribution and similarity detection

Goal of Author Profiling attributing an author of a text
to a certain socio-demographic class 7 Age Gender Personality Traits Psychological Stability Native Language / Dialect

Real World Applications 8 Suspect Profiling in Forensics Customer-base Analysis
Targeted Advertising Age Restriction Systems Sexual Predator Identification

9 http://pan.webis.de Shared Tasks on Digital Text Forensics Shared Tasks:
Plagiarism Detection - Author Masking - Author Attribution - Author Profiling Author Profiling: PAN13-15 Single Genre (Tweets, Hotel Reviews, Blogs) PAN16 Cross-genre

CAPS: A Cross-genre Author Profiling System in collaboration with Dr.
Desislava Zhekova Center for Information and Language Processing LMU Munich 10

PAN16 Cross-genre Author Profiling Profile: Gender (Male, Female) Age (18-24
| 25-34 | 35-49 | 50-64 | 65-xx) Cross-genre author profiling: • train the model on one genre and evaluate on another • adaptable to any unseen genre • merge all existing genres into one training set to overcome data scarcity 11

PAN16 Cross-genre Author Profiling 12 ~200000 ~128000 ~67000 0 50000
100000 150000 200000 250000 English Spanish Dutch Text samples Language PAN16 Training Set (Text samples) 432 249 379 0 100 200 300 400 500 English Spanish Dutch Authors Language PAN16 Training Set (Authors) Tweets as training set Blogs and Hotel Reviews as test set

Workflow 13 Preprocess HTML and Bulletin Board Code removal normalization
of all links to [URL] @usernames to [USER] Extract Features Part-of-speech Tagging Lemmatized Representation Cleaned Text Author Profiling Model

TF-IDF - The Term Frequency-Inverse Document Frequency , = ,
max ∈ , ∙ log • is current token, s is current string • , number of times token appears in string • is number of strings • number of strings in the corpus that include the token 14 In simple words: Emphasize important words (frequent in a text, infrequent in the corpus) ∙ log + 1 + 1 + 1

15 mole woman DISCLAIMER: Don’t take it too seriously, PAN16
Dataset is too small Most important word for the male class: TF-IDF - The Term Frequency-Inverse Document Frequency Most important word for the female class:

Topic Modelling with LDA Latent Dirichlet Allocation (LDA) and Hierarchical
Dirichlet Process (HDP) 16 • Generative statistical model that allows automated grouping of observed words into topics • LDA requires predefined number of topics • HDP calculates the number of topics automatically • do not confuse with Linear Discriminant Analysis (also known as LDA)

Lexical F-Measure This feature calculates how implicit or explicit the
text is. 17 = 0.5 ∙ + + + − + + + + 100 Heylighen et al. (2002)

Readability Index Formulas Automated Readability Index, SMOG Readability Formula, New
Dale-Chall etc. Flesch Reading Ease Assesses the difficulty of a reading passage 206.835 − 1.015 − 84.6 0-29 : very confusing 60-69 : standard 90-100 : very easy 18 Flesch (1948)

Dictionary-based Features check how often these words are used 19
Feature Cluster Examples per Language Dictionary-based Feature Name English Spanish Dutch Connective Words furthermore, firstly … pues, como … zoals, mits … Emotion Words sad, bored, angry … espanto, carino, calma … boos, moe, zielig … Contractions I’d, let’s, I’ll … al, del, desto … m’n, ’t, zo’n … Familial Words wife, husband, gf … esposa, esposo … vriendin, man … Collocations dodgy, awesome, troll … no manches, chido … buffelen, geil … Abbreviations and Acronyms a.m., Inc., asap … art., arch. … gesch., geb. … Stop Words did, we, ours … de, en, que … van, dat, die …

Other Features Text Structure Features Type/Token ratio Average word length
Usage of punctuation marks Stylistic features (occurrence of adjectival endings) English: -ly, -able, -ic, -il, -less, -ous etc. Spanish: -ito, -ada, -anza, -acho, -acha etc. Dutch: -jes, -iek, -eren etc. 20

Results 21 PAN16 Results, Accuracy (Cross-genre, all represented languages) PAN16
English Spanish Dutch Class Gender Age Both Gender Age Both Gender Best Score 75.64% 58.97% 39.74% 73.21% 51.79% 42.87% 61.80% CAPS 74.36% 44.87% 33.33% 62.50% 46.43% 37.50% 55.00% PAN14 and PAN15 Results, Accuracy (Single genre, English) PAN14-15 Twitter (PAN15) Blogs (PAN14) Hotel Reviews (PAN14) Class Gender Age Gender Age Gender Age Best Score 85.92% 83.80% 67.95% 46.15% 72.59% 35.02% CAPS 81.69% 73.24% 66.67% 35.90% 71.32% 34.77%

CAPS Tools 22 Python scikit-learn gensim TreeTagger

Author Obfuscation 23 given two documents by the same author,
paraphrase the designated one so that the author cannot be verified anymore Slide adapted from Potthast et al., 2016

Author Obfuscation 24 Slide adapted from Potthast et al., 2016
Alice Is known to have written automatically obfuscates text automatically verifies authorship is subject to analysis circumvents obstructs

Author Obfuscation Approaches Paraphrasing Replacing words based on similarity Error
insertion Changing punctuation Roundtrip translation: English > German > French > English 25

Author Obfuscation 26 Obfuscators flip on average 35% of true
positive decisions of the state-of-the-art author verification approaches Should you be excited? The texts become rather unreadable Problems with genre specific terminology

Author Obfuscation 27 Obfuscation example from Mansoorizadeh et al., 2016
… run-time called the JRE. This approach has some advantages and disadvantages and it is worth comparing these three options in order to appreciate the implications for the developer … Slide adapted from Potthast et al., 2016 system organization Java coffee

Author Obfuscation 28 Obfuscation example from Mihaylova et al., 2016
As of myself I fear that I may not have much time left … As of myself, I am afraid this myself mai not have a great deal off time left … Slide adapted from Potthast et al., 2016

Author Obfuscation 29 I’ve told this story no place till
this night, and it’s foolish I was here to be talking free … I’ve history has not yet the night, and _ I was leichtgläubig it’s here free to speak … Slide adapted from Potthast et al., 2016 Obfuscation example from Keswani et al., 2016

30 Martin Potthast: Author obfuscation and author identification are locked
in an instance of the “Potter-Voldemort Conundrum”: ”Neither can live while the other survives” Slide adapted from Potthast et al., 2016 Author Obfuscation

Automated Author Profiling Ivan Bilan, LMU Relevant Publications, Presentations and
Overviews made available by PAN http://bit.ly/2oWdRZt

32 References • Heylighen, F., Dewaele, J.: Variation in the
Contextuality of Language: An Empirical Measure. Foundations of Science 7(3), 293–340 (2002) • Flesch R. (1948). "A new readability yardstick". Journal of Applied Psychology. 32: 221–233. doi:10.1037/h0057532 • Martin Potthast, Matthias Hagen, and Benno Stein. Author Obfuscation: Attacking the State of the Art in Authorship Verification. In Working Notes Papers of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, September 2016. CLEF and CEUR-WS.org. ISSN 1613-0073 http://www.uni-weimar.de/medien/webis/publications/slides/stein_2016k.pdf • Muharram Mansoorizadeh, Taher Rahgooy, Mohammad Aminiyan, and Mahdy Eskandari. Author Obfuscation using WordNet and Language Models—Notebook for PAN at CLEF 2016. In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors, CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, Évora, Portugal, September 2016. CEUR- WS.org. ISSN 1613-0073 • Tsvetomila Mihaylova, Georgi Karadjov, Preslav Nakov, Yasen Kiprov, Georgi Georgiev, and Ivan Koychev. SU@PAN'2016: Author Obfuscation—Notebook for PAN at CLEF 2016. In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors, CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, Évora, Portugal, September 2016. CEUR-WS.org. ISSN 1613-0073 • Yashwant Keswani, Harsh Trivedi, Parth Mehta, and Prasenjit Majumder. Author Masking through Translation—Notebook for PAN at CLEF 2016. In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors, CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, Évora, Portugal, September 2016. CEUR-WS.org. ISSN 1613-0073 • Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, pp.44–49 (1994) http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ • Ivan Bilan and Desislava Zhekova. CAPS: A Cross-genre Author Profiling System—Notebook for PAN at CLEF 2016. In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors, CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5- 8 September, Évora, Portugal, September 2016. CEUR-WS.org. ISSN 1613-0073

33 Credits Power Point Template: Maximilian Körner, CDTM Images: Woman:
Malenirvana, http://malenirvana.com/wp- content/uploads/2013/05/confused-women1-759x569.png Mole: Simply Beautiful, http://www.cosmeticlasersydney.com.au/cosmetic-laser-treatment- sydney/laser-mole-blemish-removal Criminal Minds Wallpaper: Martin Driver, http://wallpapersafari.com/w/OuBRGb/ Icons: CLEF: http://www.clef-initiative.eu LMU: https://www.uni-muenchen.de/index.html TrustYou: http://www.trustyou.com CDTM: http://www.cdtm.de Python: https://www.python.org Scikit-Learn: http://scikit-learn.org gensim: https://radimrehurek.com/gensim/ Customer-base: Leremy, https://www.dreamstime.com/leremy_info Fingerprint: https://iconsmind.com Anonymous man: https://www.tes.com/lessons/gTiKdCYz1N54uA/just-once- vocabulary

Ivan Bilan - Automated Author Profiling

Ivan Bilan - Automated Author Profiling

MunichDataGeeks

More Decks by MunichDataGeeks

Other Decks in Research

Featured

Transcript

Automated Author Profiling Ivan Bilan, LMU

Currently studying: M. Sc. Computational Linguistics Honors Degree in Technology

Author Identification 3 attribute given text to one of the

Author Identification Are all the works we attribute to Shakespeare

Stylometry 5 application of the study of linguistics style for

6

Goal of Author Profiling attributing an author of a text

Real World Applications 8 Suspect Profiling in Forensics Customer-base Analysis

9 http://pan.webis.de Shared Tasks on Digital Text Forensics Shared Tasks:

CAPS: A Cross-genre Author Profiling System in collaboration with Dr.

PAN16 Cross-genre Author Profiling Profile: Gender (Male, Female) Age (18-24

PAN16 Cross-genre Author Profiling 12 ~200000 ~128000 ~67000 0 50000

Workflow 13 Preprocess HTML and Bulletin Board Code removal normalization

TF-IDF - The Term Frequency-Inverse Document Frequency , = ,

15 mole woman DISCLAIMER: Don’t take it too seriously, PAN16

Topic Modelling with LDA Latent Dirichlet Allocation (LDA) and Hierarchical

Lexical F-Measure This feature calculates how implicit or explicit the

Readability Index Formulas Automated Readability Index, SMOG Readability Formula, New

Dictionary-based Features check how often these words are used 19

Other Features Text Structure Features Type/Token ratio Average word length

Results 21 PAN16 Results, Accuracy (Cross-genre, all represented languages) PAN16

CAPS Tools 22 Python scikit-learn gensim TreeTagger

Author Obfuscation 23 given two documents by the same author,

Author Obfuscation 24 Slide adapted from Potthast et al., 2016

Author Obfuscation Approaches Paraphrasing Replacing words based on similarity Error

Author Obfuscation 26 Obfuscators flip on average 35% of true

Author Obfuscation 27 Obfuscation example from Mansoorizadeh et al., 2016

Author Obfuscation 28 Obfuscation example from Mihaylova et al., 2016

Author Obfuscation 29 I’ve told this story no place till

30 Martin Potthast: Author obfuscation and author identification are locked

Automated Author Profiling Ivan Bilan, LMU Relevant Publications, Presentations and

32 References • Heylighen, F., Dewaele, J.: Variation in the

33 Credits Power Point Template: Maximilian Körner, CDTM Images: Woman: