Slide 1

Slide 1 text

Automated Extraction of Chemical Information from the Scientific Literature Ma# Swain Cavendish Laboratory, University of Cambridge Vernalis (R&D) Ltd

Slide 2

Slide 2 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Overview • ChemDataExtractor: Open source toolkit • Natural language processing • HolisAc informaAon extracAon • Text mining at Vernalis 2

Slide 3

Slide 3 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain The Goal 3 name 2-[2-[4-(dimethylamino)phenyl]diazenyl]-benzoic acid label 3a uv-vis - solvent acetonitrile - wavelength 448 units nm - extinction 29000 units M-1cm-1 - apparatus Agilent8453 diode array spectrophotometer Figure 2: UV-vis absorption spectra of 3a in acetonitrile. UV-vis spectra were recorded using an Agilent8453 diode array spectrophotometer. λ max /nm (ε/M−1 cm−1) 3b 448 (29,000) Dye 3a 415 (48,000) The dye 2-[2-[4-(dimethylamino)phenyl]diazenyl]-benzoic acid (3a) was added…

Slide 4

Slide 4 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain System Overview 4 Tables Abstract PDF HTML XML Full Text Database CapAons Document Readers Table Processor Natural Language Processor Interdependency Resolver

Slide 5

Slide 5 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Document Model • Process all input formats into a consistent internal document model • Publisher-specific (ACS, RSC, USPTO) and generic format readers • Result: Simplified linear stream of document elements 5 PDF HTML XML Title Abstract Heading Paragraph Table … Document Document Readers

Slide 6

Slide 6 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Natural Language Processing 6 • All text-based elements make use of the same natural language processing pipeline: • Variety of techniques: Rules, DicAonaries, Machine Learning Sentence Spli`ng Word TokenizaAon Part-of-speech Tagging EnAty RecogniAon Phrase Parsing

Slide 7

Slide 7 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Tokenization 7 • Split sentences using Punkt method • Custom rule-based word tokenizer • NormalizaAon performed on the content of each token Fig. 3 m.p. 185 °C M. Swain et al. equiv. 1.8 mg ~ ~1.8mg (+/−)-chiraline (+/−)-chiraline buffer (pH7.4) ( pH buffer 7.4 )

Slide 8

Slide 8 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Part-Of-Speech Tagging 8 • Assign a category to each token - 47 tags in total • CRF trained on newspaper arAcles (WSJ) and biomedical abstracts (GENIA) • Used as a feature in machine learning enAty recogniAon • Used to construct parsing rules NN Noun NNS Plural Noun JJ Adjec/ve CD Cardinal Number CC Conjunc/on DT Determiner VB Verb FW Foreign Word >>> cpt = ChemCrfPosTagger() >>> cpt.tag(['NMR', 'spectra', 'were', 'recorded', 'on', 'a', '300', 'MHz', 'BRUKER', ‘spectrometer']) [('NMR', 'NN'), ('spectra', 'NNS'), ('were', 'VBD'), ('recorded', 'VBN'), ('on', 'IN'), ('a', 'DT'), ('300', 'CD'), ('MHz', 'NNP'), ('BRUKER', 'NNP'), ('spectrometer', ‘NN')]

Slide 9

Slide 9 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Entity Recognition 9 • Tagging chemical names using “IOB” scheme: • Combined dicAonary, machine learning and regular expression methods • Lowe & Sayle (2015) J. Cheminf. - DicAonary creaAon techniques • AbbreviaAon DetecAon with dimethyl sulfoxide as solvent O B-CM I-CM O O Tetrahydrofuran ( THF ) B-CM O B-CM O

Slide 10

Slide 10 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain • Brown clustering: Hierarchical clustering of words • Bitstring prefixes used as features in CRF to improve POS tagging and enAty recogniAon performance Word Clusters 10 000000000: ethanol, methanol, acetone, glycerol, brine 0000000010: THF, DMSO, CH2Cl2, toluene, DMF 0000000011: H2O, MeOH, EtOAc, hexane, TFA 0 1 1 1111000001110: subs@tuted, func@onalized, terminated 1111000001111: conjugated, interac@ng, cross-linked, fused 10011011100: NPs, NCs, NRs, NWs, NTs, NSs, NBs 0 1 10011011101: nanopar@cles, nanocrystals, nanowires 0 1 0 1 0 0 1 h#ps:/ /github.com/percyliang/brown-cluster

Slide 11

Slide 11 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Rule-Based Parsing 11 • Parsers transform tagged tokens into a tree structure • Rules defined in Python code from building blocks: • W: Match word text • T: Match POS tag or enAty tag • R: Match regular expression • And, Or, ZeroOrMore, OneOrMore, OpAonal, … • Built in parsers: Chemical Names, MelAng Points, Quantum Yield, Fluorescence lifeAmes, NMR, UV-vis, IR spectra, etc. name = T('B-CM') + ZeroOrMore(T('I-CM')) label = R('[1-9][a-z]?') | R('[IVX]+') cem = name + W('(') + label + W(')') cem name label 1,2-dihydroxybenzene 3a

Slide 12

Slide 12 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Table Processing • Process each table cell with a specialised natural language processing pipeline • Parse headings to determine column type and extract contextual informaAon • Parse cells in each row to extract values • Merge contextual informaAon from capAon, footnotes, and headings 12

Slide 13

Slide 13 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Data Schema 13 Compound name label role mel/ng points uvvis spectra Melting Point value units UV-vis Spectrum peaks solvent temperature apparatus Peak wavelength ex/nc/on shape

Slide 14

Slide 14 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Interdependency Resolution 14 • Assign orphan properAes to relevant compound • Merge equivalent chemical idenAfiers to a single record • Assign global contextual informaAon {"name": "4-(10-phenylanthracene-9-yl)pyridine", "role": "product"} {"melting_points": [{"value": "280.7", "units": "°C"}]} {"name": "4-(10-phenylanthracene-9-yl)pyridine", "label": "3a"} {"label": "3a", "melting_points": [{"value": "280.7", "units": "°C"}]} {"uv_vis": [{"solvent": "acetonitrile", "apparatus": "Agilent 8453 spectrometer"}]}

Slide 15

Slide 15 text

chemdataextractor.org

Slide 16

Slide 16 text

chemdataextractor.org

Slide 17

Slide 17 text

chemdataextractor.org

Slide 18

Slide 18 text

chemdataextractor.org

Slide 19

Slide 19 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain • ExperAse • Fragments and structure-based drug discovery (Protein Science, Structural Biology, Chemistry) • TherapeuAc areas • Oncology, CNS, infecAous diseases • LocaAon • Based in Granta Park, just outside Cambridge 19

Slide 20

Slide 20 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Text Mining at Vernalis • Patents: USPTO, WIPO and EPO • NextMove Sotware: spelling correcAon, translaAon, reacAon extracAon • PatSearch: In-house patent searching and alerAng plauorm • PatExplore: Patent chemistry explorer 20

Slide 21

Slide 21 text

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain PatSearch • Python Flask web app with Celery task queues for downloading, processing, indexing patents • ElasAcSearch index with ReactJS frontend • Extract enAAes using LeadMine and store in database • From publicaAon to a scienAst’s inbox in under 1 hour 21

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

ChemDataExtractor Info • chemdataextractor.org - Try it out! • Python package: • JCIM ArAcle: $ pip install chemdataextractor $ conda install -c chemdataextractor chemdataextractor ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature Matthew C. Swain and Jacqueline M. Cole* Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K. ABSTRACT: The emergence of “big data” initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate Article pubs.acs.org/jcim