Slide 1

Slide 1 text

Playing hide and seek on the genomic playground: Unveiling biological function from literature Sofie Van Landeghem Promotor: Prof. Dr. Yves Van de Peer Co-promotor: Prof. Dr. Bernard De Baets Co-promotor: Dr. Yvan Saeys Zwijnaarde, April 27th, 2012 Public PhD defense

Slide 2

Slide 2 text

Outline • Introduction • Part 1: Theoretical text mining algorithms • Protein-protein interactions • Event extraction • Non-causal relations • Part 2: Practical applications • EVEX: bibliome-wide text mining • Manual browsing • Database and pathway curation • Future prospects 2 Sofie Van Landeghem, public PhD defense

Slide 3

Slide 3 text

Genes and DNA • DNA: carries genetic information ➔ 4 nucleobases: A, C, G, T Gene: basic unit of hereditary ➔ Codes for a specific function ➔ All genes present in all cells 3 Sofie Van Landeghem, public PhD defense

Slide 4

Slide 4 text

4 Sofie Van Landeghem, public PhD defense Gene prediction agaaaatctgggagctatgtatgtacagctgttggaggaggactgtaaaggggctggtatagacatctacagaaggttgatggttct gtgtctacttcctctgtgaatgtgcagacaagcatctcatagggagcctggggctgctattggtccatagagctagtagttgaggagg agacctgggcatggaatagggagagacagggcaaactggaaaccaccagcacctttgcatctgtctctgactgtctccaaccaaca gtaaccttagaacaaaatgactactgctcactgccacctcccaaatattattcttttggccaagtgtagctgggatccattcagggaag gtgattctgcaaaacatagttcccagaataatcaaggtgaaaatagaataatcttgcacacaggtctttgataagactaggaatatat aatacataacagctaggaaaaaatatataaattttccccaagtgcttataatgaacaaacatatttgggaaccatatccacctagcct agtcaactaaattaaaagctggagtcatcttagatgctttc... (DISC1 gene: 421,458 bp) DNA mRNA MPGGGPQGAPAAAGGGGVSHRAGSRDCLPPAACFRRRRLARRPGYMRSSTGPGIGFLSPAVGTLF RFPGGVSGEESHHSESRARQCGLDSRGLLVRSPVSKSAAAPTVTSVRGTSAHFGIQLRGGTRLPDRLS WPCGPGSAGWQQEFAAMDSSETLDASWEAACSDGARRVRAAGSLPSAELSSNSCSPGCGPEVPP TPPGSHSAFTSSFSFIRLSLGSAGERGEAEGCPPSREAESHCQSPQEMGAKAASLDGPHEDPRCLSRP FSLLATRVSADLAQAARNSSRPERDMHSLPDMDPGSSSSLDPSLAGCGGDGSSGSGDAHSWDTLLR KWEPVLRDCLLRNRRQMEVISLRLKLQKLQEDAVENDDYDKAETLQQRLEDLEQEKISLHFQLPSRQ PALSSFLGHLAAQVQAALRRGATQQASGDDTHTPLRMEPRLLEPTAQDSLHVSITRRDWLLQEKQQ LQKEIEALQARMFVLEAKDQQLRREIEEQEQQLQWQGCDLTPLVGQLSLGQLQEVSKALQDTLASA GQIPFHAEPPETIRRYC (547 aa) Protein Transcription Translation Gene expression

Slide 5

Slide 5 text

5 Sofie Van Landeghem, public PhD defense Function of Disc1? “ ... we have cloned and sequenced the breakpoints of a (1;11)(q42.1;q14.3) translocation linked to schizophrenia ...” Millar et al. Human Molecular Genetics, 2000 “ ... results of karyotypic, clinical, and ERP investigations ...” Blackwood et al. American Journal of Human Genetics, 2001 “... microarray data analysis ... ontological profiling ...” Glatt et al. PNAS, 2005 “... using a combination of recombinant and neuronal cell models ...” Wang et al. Molecular Psychiatry, 2011 MPGGGPQGAPAAAGGGGVSHRAGSRDCLPPAACFRRRRLARRPGYMRSSTGPGIGFLSPAVGTLFRFPGGVSGEESHH SESRARQCGLDSRGLLVRSPVSKSAAAPTVTSVRGTSAHFGIQLRGGTRLPDRLSWPCGPGSAGWQQEFAAMDSSETLD ASWEAACSDGARRVRAAGSLPSAELSSNSCSPGCGPEVPPTPPGSHSAFTSSFSFIRLSLGSAGERGEAEGCPPSREAESHC QSPQEMGAKAASLDGPHEDPRCLSRPFSLLATRVSADLAQAARNSSRPERDMHSLPDMDPGSSSSLDPSLAGCGGDGS SGSGDAHSWDTLLRKWEPVLRDCLLRNRRQMEVISLRLKLQKLQEDAVENDDYDKAETLQQRLEDLEQEKISLHFQLPSR QPALSSFLGHLAAQVQAALRRGATQQASGDDTHTPLRMEPRLLEPTAQDSLHVSITRRDWLLQEKQQLQKEIEALQAR MFVLEAKDQQLRREIEEQEQQLQWQGCDLTPLVGQLSLGQLQEVSKALQDTLASAGQIPFHAEPPETIRRYC

Slide 6

Slide 6 text

“ ... DISC1 and DISC2 should be considered formal candidate genes for susceptibility to psychiatric illness ...” Millar et al. Human Molecular Genetics, 2000 “ ... the recently described genes DISC1 and DISC2, ..., may have a role in the development of a disease phenotype that includes schizophrenia as well as unipolar and bipolar affective disorders ...” Blackwood et al. American Journal of Human Genetics, 2001 “... TNIK mRNA expression was increased in the dorsolateral prefrontal cortex of schizophrenia subjects ... ” Glatt et al. PNAS, 2005 “... DISC1 and TNIK interact to regulate synapse composition and function ...” Wang et al. Molecular Psychiatry, 2011 6 Sofie Van Landeghem, public PhD defense Function revealed!

Slide 7

Slide 7 text

“ ... 砫粍 榃 斪昮朐 玾珆玸 橍殧澞 泏狔狑 賌輈鄍 趍 椻楒 嬽 騩鰒 鰔 狅妵妶 漀 鷵鷕 橚橍 膣 濇燖燏 ...” 齞齝囃 狅妵妶 跠跬 滘 “ ... 稢綌 幋暕楋 糲蘥蠩 慖 酳銪 鶷鷇鶾 蒠蓔蜳 濷瓂癚 蛶 蛃袚觙 麷劻穋 糋罶羬 墏 檎檦 澂 釢髟偛 翍脝艴 麷劻穋 臷菨 輐銛 ...” 譾躒鑅 嬏嶟 劁, 蝑蝞 “... 潧潣瑽 溗煂獂 酳 笢笣雗雘雝 齈龘墻 馻噈嫶 斖蘱 蔏 礛嬨嶵 徲 倱哻圁 毚丮厹 鬄鵊 蜸 躆 ... ” 轕 媝寔嵒 鋧鋓頠 茺苶 “... 轖嫀嵥嵧 枅杺枙 慔 浶洯 轖轕 刲匊呥 鶷鷇鶾 鉌 蜭 鋄銶 邆錉 霋 裍裚詷 軿鉯頏 藙藨蠈 ...” 犤繵 沀皯竻 歅 筩筡 譒蹸 烺焆琀 7 Sofie Van Landeghem, public PhD defense Function revealed?

Slide 8

Slide 8 text

8 Sofie Van Landeghem, public PhD defense Playing hide and seek Human vs. computer: 1-0

Slide 9

Slide 9 text

BioNLP ● Natural Language Processing for Biomedical texts ● Exponential growth of available literature ● Goal: formal summarization, hypothesis generation ● Challenge 1: Highly ambiguous gene/protein symbols ○ Synonymy: ESR1 = NR3A1 ○ Lexical variants: Esr-1, ESR1, Era, ESRα ○ Abbreviations ○ Ambiguity: Wasp, diabetes, CAT ● Challenge 2: Complexity of natural language ○ Complex grammatical structures ○ Speculation, negation 9 Sofie Van Landeghem, public PhD defense

Slide 10

Slide 10 text

10 Sofie Van Landeghem, public PhD defense Natural language “Time flies like an arrow” “Fruit flies like a banana” “I once shot an elephant in my pajamas. How he got in my pajamas, I’ll never know” - Groucho Marx “The government plans to raise taxes were defeated” “You can always count on the Americans to do the right thing – after they have tried everything else” - Winston Churchill

Slide 11

Slide 11 text

“ ... DISC1 and DISC2 should be considered formal candidate genes for susceptibility to psychiatric illness ...” Millar et al. Human Molecular Genetics, 2000 “ ... the recently described genes DISC1 and DISC2, ..., may have a role in the development of a disease phenotype that includes schizophrenia as well as unipolar and bipolar affective disorders ...” Blackwood et al. American Journal of Human Genetics, 2001 “... TNIK mRNA expression was increased in the dorsolateral prefrontal cortex of schizophrenia subjects ... ” Glatt et al. PNAS, 2005 “... DISC1 and TNIK interact to regulate synapse composition and function ...” Wang et al. Molecular Psychiatry, 2011 11 Sofie Van Landeghem, public PhD defense BioNLP target

Slide 12

Slide 12 text

Toolbox: machine learning • Learning complex properties from input data • Supervised learning • Training data with known class labels • Hidden test data with unknown class labels • Goal: Automatically predict these unknown lables • Example: classification of horses • Class labels: positive (horse) or negative (not a horse) 12 Sofie Van Landeghem, public PhD defense

Slide 13

Slide 13 text

TRAINING positive examples negative examples 13 Sofie Van Landeghem, public PhD defense Machine learning: example features: brown, hooves, mane, 4 legs brown, paws, mane, 4 legs black, hooves, mane, 4 legs black, paws, no mane, 4 legs white, hooves, mane, 4 legs white, hooves, no mane, 4 legs ... ...

Slide 14

Slide 14 text

TESTING unlabeled instances features: white, hooves, mane, 4 legs white, paws, mane, 4 legs 14 Sofie Van Landeghem, public PhD defense Machine learning: example HORSE NOT A HORSE

Slide 15

Slide 15 text

Outline • Introduction • Part 1: Theoretical text mining algorithms • Protein-protein interactions • Event extraction • Non-causal relations • Part 2: Practical applications • EVEX: bibliome-wide text mining • Manual browsing • Database and pathway curation • Future prospects 15 Sofie Van Landeghem, public PhD defense

Slide 16

Slide 16 text

NLP framework 16 Nfklsqdjfhsiqfs Sfqdffzfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sdgSgfs dgsdhdh Dhshdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Nfklsqdjfhsiqfs Sfqdffzrtdfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sfqfsqdksjqhDfsg gsdg Sgfs dgsdhdh Dhshsdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Nfklsqdjfhsiqfs Sfqdffzrtdfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sfqfsqdsfksjqhDfsg gsdg Sgfs dgsdhdh Dhshsdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Information retrieval xxxxxx xxxx xxxxxxxx xx CDC42 xxxxx xxx xxx xxxxx PAK4 KTN1 xxx xx xxxxxxxx xx Named entity recognition xxxxx CDC42 PAK4 …. Structured text interaction(CDC42, PAK4) interaction(CDC42, KTN1) … Relation extraction Network construction ...

Slide 17

Slide 17 text

Relation extraction • Feature generation • Lexical, syntactic, grammatical features • Stanford parser: constituency tree and dependency parse • Feature selection • Identifying subset of automatically generated features • Reducing model complexity • Avoiding overfitting of the classifier • Classification: WEKA implementation of LibSVM • Evaluation • Precision: percentage of predictions that are correct • Recall: percentage of statements that are identified • F-score: harmonic mean between precision and recall 17 Sofie Van Landeghem, public PhD defense

Slide 18

Slide 18 text

18 Sofie Van Landeghem, public PhD defense Constituency tree • Tree represents the full syntax of a sentence • Phrase chuncking: constituents • Part-of-speech (POS) tags "The tyrosine phosphorylation of STAT1 was enhanced significantly".

Slide 19

Slide 19 text

19 Sofie Van Landeghem, public PhD defense Dependency parsing • Grammatical relations represented as a graph • More compact • Robust to syntactic variation "The tyrosine phosphorylation of STAT1 was enhanced significantly".

Slide 20

Slide 20 text

PPI: feature generation 20 Sofie Van Landeghem, public PhD defense Bag-of-words from the sentence “we”, “show”, “that”, "binds" ... Vertex walks from connecting path PROTX nsubj “binds” “binds” prep_as “heterodimer” “heterodimer” prep_with PROTX Edge walks from connecting path nsubj “binds” prep_as prep_as “heterodimer” prep_with Root "binds" PROTX PROTX "We show here that c-Rel binds to kappa B sites as heterodimers with p50" Van Landeghem et al. 2008. SMBM

Slide 21

Slide 21 text

Van Landeghem et al. 2008. SMBM Van Landeghem et al. 2008. BeneLearn 21 Sofie Van Landeghem, public PhD defense • PPI datasets: performance discrepancy (20-27 pp.) • Different initial selection of corpus scope and size • Some only annotate entities involved in relations • Some only include sentences with interactions F-score AIMed HPRD50 IEPA LLL Co-occurrence 30 55 58 66 Inst. CV 62 71 71 82 Abstr. CV 46 55 67 73 Cross-dataset (test) 38 57 41 40 → PPI corpora and evaluations are highly biased Protein-protein interactions

Slide 22

Slide 22 text

22 Sofie Van Landeghem, public PhD defense Shared Task on Event Extraction • BioNLP Shared Task corpus • 800 training, 150 development and 260 test abstracts • 9 event types (“task 1”) • 6 different "physical" event types involving genes/proteins: gene expression, localization, transcription, binding, protein catabolism, phosphorylation • 3 regulation events : may also take events as arguments: Positive regulation, Negative regulation, Regulation • Additional information (“task 2”) • E.g. phosphorylation site, localization information, ... • Negation and speculation (“task 3”) • Participation: 24 international teams

Slide 23

Slide 23 text

23 Sofie Van Landeghem, public PhD defense Event characterization • Trigger word: refers to event type • "phosphorylated", "interaction", "mediates", ... • Theme (affectee) and cause (affector) arguments

Slide 24

Slide 24 text

• Separate SVM pipeline for each event type • Trigger detection • Instance creation: trigger + co-occurring genes/proteins • Feature generation (cf. next slide) • Classification using LibSVM • Post-processing • Parallellization • 6 physical event types • 3 regulatory event types • Task 3 • Negation & speculation • Rule-based system 24 Sofie Van Landeghem, public PhD defense Extended framework Van Landeghem et al. 2009. BioNLP

Slide 25

Slide 25 text

Feature generation 25 Sofie Van Landeghem, public PhD defense Bag-of-words from the sub-graph “binds”, “heterodimer”, ... Trigrams from the sub-sentence “PROTX binds to”, “binds to kappa”, “to kappa B”, ... Vertex walks from the sub-graph PROTX nsubj TRIGGERX TRIGGERX prep_as “heterodimer” “heterodimer” prep_with PROTX Trigger "binds" Verb PROTX PROTX We show here that c-Rel binds to kappa B sites as heterodimers with p50 TRIGGERX Van Landeghem et al. 2009. BioNLP

Slide 26

Slide 26 text

26 Sofie Van Landeghem, public PhD defense F-score Task1 Task2 Task3 UTurku 51.95 JULIELab 46.66 ConcordU 44.62 42.52 UT+DBCLS 44.35 43.12 VIBGhent 40.54 37.80 UTokyo 36.88 Event extraction: results recall precision F-score Physical events 50.75 67.24 57.85 Regulatory events 17.36 31.61 22.41 All events 33.41 51.55 40.54 Task 3 30.55 49.57 37.80 Van Landeghem et al. 2009. BioNLP

Slide 27

Slide 27 text

Van Landeghem, Abeel et al. 2010. Bioinformatics Feature selection • High dimensional datasets • Between 1.800 and 30.000 features • Automatically generated • High complexity for classification algorithm • State-of-the art performance is still limited (~ 65% F) • We need to understand what is going on • Supervised machine learning (ML) systems dominate the top ranked systems • Black box behaviour • Difficult to understand the nature of the predictions 27 Sofie Van Landeghem, public PhD defense

Slide 28

Slide 28 text

Van Landeghem, Abeel et al. 2010. Bioinformatics • Aggregation of multiple weak feature selectors • Baseline: 65.02 F-score (physical events) • 100 runs: small but consistent improvement 28 Sofie Van Landeghem, public PhD defense Ensemble feature selection Feature space Min. F Max. F Avg. F 75% 64.85 65.33 65.26 50% 65.60 66.43 65.88 30% 64.94 66.60 65.86 25% 65.51 66.82 66.14 20% 65.08 66.56 65.85 10% 61.75 64.90 63.59

Slide 29

Slide 29 text

Van Landeghem, Abeel et al. 2010. Bioinformatics Feature clouds Visualisation of most discrimative features 29 Sofie Van Landeghem, public PhD defense binding vertex walks

Slide 30

Slide 30 text

Non-causal relations • Relations between named entities (genes/gene products or "GGPs") and general domain terms • Termed "non-causal", "static" or "entity" relations • Provide a more detailed view on the information in the sentence, on top of event extraction 30 Sofie Van Landeghem, public PhD defense

Slide 31

Slide 31 text

Van Landeghem et al. 2011. BioNLP Van Landeghem, Björne et al. 2012. BMC Bioinformatics • BioNLP'11 Shared Task: 4 participants (for this task) • Analysed 16 pp. discrepancy between (1) and (2) • Hybrid system • Turku term detection + Ghent relation detection • Conclusion: rule-based term detection module of Ghent system is responsible for performance gap 31 Sofie Van Landeghem, public PhD defense recall precision F-score UTurku 50.10 68.04 57.71 Ghent 47.48 37.04 41.62 ConcordiaU 24.35 46.85 32.04 UoS 15.69 23.26 18.74 Non-causal relations: results

Slide 32

Slide 32 text

Sofie Van Landeghem, public PhD defense If you can’t beat them, join them 32

Slide 33

Slide 33 text

Outline •Introduction •Part 1: Theoretical text mining algorithms • Protein-protein interactions • Event extraction • Non-causal relations •Part 2: Practical applications • EVEX: bibliome-wide text mining • Manual browsing • Database and pathway curation • Future prospects 33 Sofie Van Landeghem, public PhD defense

Slide 34

Slide 34 text

NLP framework 34 Nfklsqdjfhsiqfs Sfqdffzfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sdgSgfs dgsdhdh Dhshdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Nfklsqdjfhsiqfs Sfqdffzrtdfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sfqfsqdksjqhDfsg gsdg Sgfs dgsdhdh Dhshsdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Nfklsqdjfhsiqfs Sfqdffzrtdfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sfqfsqdsfksjqhDfsg gsdg Sgfs dgsdhdh Dhshsdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Information retrieval xxxxxx xxxx xxxxxxxx xx CDC42 xxxxx xxx xxx xxxxx PAK4 KTN1 xxx xx xxxxxxxx xx Named entity recognition xxxxx CDC42 PAK4 …. Structured text interaction(CDC42, PAK4) interaction(CDC42, KTN1) … Relation extraction Network construction ...

Slide 35

Slide 35 text

Björne et al. 2010. BioNLP Björne, Van Landeghem et al. 2012. Bioinformatics (under review) Bibliome-scale event extraction • Information retrieval • 21M PM abstracts and 400K PMC Open-Access full-texts • Named entity recognition • BANNER (Leaman and Gonzalez, 2008) • Relation extraction • Turku Event Extraction System (TEES) • First in the ST'09, further improved for the ST'11 • Automatically assigned confidences • Analysis: full-text vs. abstracts • 13.5M events from full text, 20.8M events from abstracts • Full text more difficult: 50.72% F vs. 54.37% F • Only 37% of the full-text events are also found in abstracts 35 Sofie Van Landeghem, public PhD defense

Slide 36

Slide 36 text

Van Landeghem et al. 2011. BioNLP • Original data distributed as millions of flat files • Not trivial to search through • No notion of “equality” of events across articles • Distribution of MySQL database: "EVEX" • Normalization of text symbols: From textual strings to unique identities • Aggregation of equivalent event structures: Identifying equal events within and across articles • Integration with gene families: Enables information retrieval of homologs • Future work: release of API 36 Sofie Van Landeghem, public PhD defense EVEX

Slide 37

Slide 37 text

Van Landeghem et al. 2012. Advances in Bioinformatics "Ang II induces a rapid increase in MAPK activity." E1: Positive-Regulation(T: MAPK ) E2: Positive-Regulation(C:Ang II, T:E1) Final structure: Pos-Reg(C:Ang II, T:Pos-Reg(T:MAPK)) 37 Sofie Van Landeghem, public PhD defense Refining event structures Refined to: Pos-Reg(C:Ang II, T:MAPK) → Helps to determine equivalence with similar events: Number of distinct event structures reduced with 60% → Original structures are preserved

Slide 38

Slide 38 text

Van Landeghem et al. 2012. Advances in Bioinformatics 38 Sofie Van Landeghem, public PhD defense Pairwise relations from events "Thrombin augmented EGF-stimulated Akt phosphorylation" E1: Phosphorylation(T:Akt) E2: Positive-Regulation(C:EGF, T:E1) E3: Positive-Regulation(C:Thrombin, T:E2) Final structure: Pos-Reg(C:Thrombin,T:Pos-Reg(C:EGF,Pho(T:Akt))) Pairwise relations • EGF → Akt • Thrombin → Akt Indirect relations • Co-regulators: Thrombin and EGF

Slide 39

Slide 39 text

Van Landeghem et al. 2011. BioNLP 39 Sofie Van Landeghem, public PhD defense Canonicalization Automatically tagged gene/protein symbols • Often whole noun phrase, e.g. “human Esr-1 gene” Need to identify common affixes for removal • Dictionary of affixes, listed by occurrence count • Recognition of organism names (Linnaeus + SimString) “human anti-inflammatory IL-10 gene” ● -ORG- anti-inflammatory IL-10 gene ● anti-inflammatory IL-10 gene ● anti-inflammatory IL-10 ● final canonical form = “il10”

Slide 40

Slide 40 text

Canonicalization results 40 Sofie Van Landeghem, public PhD defense • Originally 67.3M gene/protein mentions • 3235K canonical forms • 2-5% can be linked to gene families (cf. next slides), accounting for 52-60% of all occurrences • Long tail of infrequent gene symbols! • Evaluation on the ST'09 gene/protein mentions • Aims at identifying mentions likely to match databases • Original BANNER matches: 50.2% F-score • After canonicalization: 61.1% F-score Van Landeghem et al. 2011. BioNLP

Slide 41

Slide 41 text

Björne, Van Landeghem et al. 2012. Bioinformatics (under review) Gene normalization • Map textual, ambiguous gene symbols to unique IDs • E.g. “Esr-1” → “AT1G12980” • Context: Arabidopsis thaliana article • Highly ambiguous gene/protein symbols • GenNorm system, developed by Wei et al. 2011 • Ranked first on several criteria in the BioCreative 3 task • 28.6M (43%) gene/protein symbols could be normalized • 120 thousand unique genes • More than 4800 species (bacteria, plants, animals, ...) 41 Sofie Van Landeghem, public PhD defense

Slide 42

Slide 42 text

Gene family assignment • Retrieve gene families from HomoloGene / Ensembl • Assign a family according to normalized gene mention • Rely on canonical form when there is still ambiguity • Inter-species: resolve to the same gene family • Intra-species: resolve to the most commonly used family • Resolve "esr1" to a default family? • Reliable synonym for "estrogen receptor" • Not a reliable synonym for "enhancer of shoot regeneration" • Manual evaluation: 72% of a set of correct event occurrences had both of their arguments assigned to the correct family 42 Sofie Van Landeghem, public PhD defense Van Landeghem et al. 2011. BioNLP

Slide 43

Slide 43 text

Van Landeghem et al. 2011. BioNLP 43 Sofie Van Landeghem, public PhD defense Event generalizations • Aggregate multiple occurrences of the same event • Define equivalence of events: same type, same structure, equivalent arguments • ≠ definitions of "equivalence" of arguments: Events Occurrences Occ % Canonical form 2953K 34.3M 100.0% Entrez Gene normalization 748K 15.8M 46.2% HomoloGene 1006K 21.8M 63.5% Ensembl 1042K 23.5M 68.5% Ensembl Genomes 1001K 21.4M 62.5%

Slide 44

Slide 44 text

Outline •Introduction •Part 1: Theoretical text mining algorithms • Protein-protein interactions • Event extraction • Non-causal relations •Part 2: Practical applications • EVEX: bibliome-wide text mining • Manual browsing • Database and pathway curation • Future prospects 44 Sofie Van Landeghem, public PhD defense

Slide 45

Slide 45 text

http://www.evexdb.org Van Landeghem et al. 2012. Advances in Bioinformatics 45 Sofie Van Landeghem, public PhD defense EVEX website

Slide 46

Slide 46 text

46 Sofie Van Landeghem, public PhD defense EVEX website http://www.evexdb.org Van Landeghem et al. 2012. Advances in Bioinformatics

Slide 47

Slide 47 text

Pathway reconstruction 47 Björne, Van Landeghem et al. 2012. Bioinformatics (under review) Sofie Van Landeghem, public PhD defense • For each pair in the pathway, retreive EVEX events • Visualisation of highest ranked pair for each interaction

Slide 48

Slide 48 text

Hypothesis generation • Use case on NADP(H)-metabolism in E. coli • Integration of EVEX with microarray expression data • Automated text analysis: high recall, fast retrieval • Manual expert validation: ensure high precision 48 Kaewphan et al. 2012. LREC Sofie Van Landeghem, public PhD defense

Slide 49

Slide 49 text

Conclusions (1) • Developed a novel NLP framework for extraction of... • Protein-protein interactions • Various other biomolecular events • Non-causal relations • Thorough evaluations on publicly available corpora • Ensemble feature selection techniques in combination with a text mining framework • Producing more cost-effective classifiers • Clues for enhanced feature generation modules • Offering insight into the black-box behaviour of the SVM 49 Sofie Van Landeghem, public PhD defense

Slide 50

Slide 50 text

Conclusions (2) • EVEX: a large-scale text mining resource • 34.3M events • Gene normalization • Integration with gene families • Applications and future work • Data integration: locating inconsistencies, aggregating confidences, ... • Network construction • Pathway curation • Hypothesis generation • Automated analysis + manual evaluation! 50 Sofie Van Landeghem, public PhD defense Playing hide and seek Human vs. computer: 1-1

Slide 51

Slide 51 text

Acknowledgments Promotor: Yves Van de Peer Co-promotors: Bernard De Baets, Yvan Saeys Ensemble FS: Thomas Abeel Manual evaluation: Zuzanna Drebert EVEX DB: Filip Ginter, Jari Björne, Sampo Pyysalo, Chih-Hsuan Wei EVEX website & API: Kai Hakala E. coli use-case: Suwisa Kaewphan, Sanna Kreula Technical support: Marijn Vandevoorde, Michiel Van Bel, IT Funding: BOF, FWO Bioinformatics group @ UGent 51 Sofie Van Landeghem, public PhD defense

Slide 52

Slide 52 text

52

Slide 53

Slide 53 text

Additional slides 53

Slide 54

Slide 54 text

PPI corpora • AIMed • 225 abstracts from DIP : protein-protein interactions • Around 1000 annotated interactions • Hprd50 • 50 abstracts from HPRD, with 92 distinct relations • IEPA • 303 abstracts with protein-protein interactions • LLL • 164 sentences with protein-gene interactions • Training + testing : 164 + 106 interactions 54

Slide 55

Slide 55 text

PPI corpora properties • Gene-protein or protein-protein interactions • Symmetrical interactions or agent/target roles • Including homodimers or not • Negative instances specified or Closed World Assumption • Explicit test set available or only CV possible • Complete abstracts included or merely a collection of sentences • Different dataformats 55

Slide 56

Slide 56 text

Van Landeghem et al. 2011. Computational Intelligence 56 Sofie Van Landeghem, public PhD defense Event extraction: Precision-recall

Slide 57

Slide 57 text

Van Landeghem et al. 2011. Computational Intelligence 57 Sofie Van Landeghem, public PhD defense Event extraction: learning curve

Slide 58

Slide 58 text

Van Landeghem et al. 2010. BioNLP Non-causal relations and events • Provide clues to extract events: • Improvement of event extraction with entity relations: • Overall only marginal increases of performance • Behaviour dependent on specific event types • Phosphorylation and localization are most relevant 58 Sofie Van Landeghem, public PhD defense

Slide 59

Slide 59 text

Van Landeghem et al. 2011. BioNLP • Based on the PPI extraction framework • Named entities (genes, proteins) annotated • Non-named entites • Dictionary approach • Rule-based recognition • Semantic clustering using LSA and MCL • Types • Protein-Component • Subunit-Complex • Equivalence • Member-Collection 59 Sofie Van Landeghem, public PhD defense Non-causal relations: classification

Slide 60

Slide 60 text

PLEV evaluation • 1176 Arabidopsis articles: 1792 of 7691 events evaluated • Judge correctness of extracted statements: precision • Recall not as easy: requires annotation of full documents • Classifier perfectly capable of generalizing results • Confidence scores can be used for ranking 60 Sofie Van Landeghem, public PhD defense