Automated Extraction of Chemical Information from the Scientific Literature

Automated Extraction of Chemical Information from the Scientiﬁc Literature Ma#
Swain Cavendish Laboratory, University of Cambridge Vernalis (R&D) Ltd

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Overview • ChemDataExtractor: Open
source toolkit • Natural language processing • HolisAc informaAon extracAon • Text mining at Vernalis 2

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain The Goal 3 name
2-[2-[4-(dimethylamino)phenyl]diazenyl]-benzoic acid label 3a uv-vis - solvent acetonitrile - wavelength 448 units nm - extinction 29000 units M-1cm-1 - apparatus Agilent8453 diode array spectrophotometer Figure 2: UV-vis absorption spectra of 3a in acetonitrile. UV-vis spectra were recorded using an Agilent8453 diode array spectrophotometer. λ max /nm (ε/M−1 cm−1) 3b 448 (29,000) Dye 3a 415 (48,000) The dye 2-[2-[4-(dimethylamino)phenyl]diazenyl]-benzoic acid (3a) was added…

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain System Overview 4 Tables
Abstract PDF HTML XML Full Text Database CapAons Document Readers Table Processor Natural Language Processor Interdependency Resolver

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Document Model • Process
all input formats into a consistent internal document model • Publisher-speciﬁc (ACS, RSC, USPTO) and generic format readers • Result: Simpliﬁed linear stream of document elements 5 PDF HTML XML Title Abstract Heading Paragraph Table … Document Document Readers

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Natural Language Processing 6
• All text-based elements make use of the same natural language processing pipeline: • Variety of techniques: Rules, DicAonaries, Machine Learning Sentence Spli`ng Word TokenizaAon Part-of-speech Tagging EnAty RecogniAon Phrase Parsing

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Tokenization 7 • Split
sentences using Punkt method • Custom rule-based word tokenizer • NormalizaAon performed on the content of each token Fig. 3 m.p. 185 °C M. Swain et al. equiv. 1.8 mg ~ ~1.8mg (+/−)-chiraline (+/−)-chiraline buﬀer (pH7.4) ( pH buﬀer 7.4 )

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Part-Of-Speech Tagging 8 •
Assign a category to each token - 47 tags in total • CRF trained on newspaper arAcles (WSJ) and biomedical abstracts (GENIA) • Used as a feature in machine learning enAty recogniAon • Used to construct parsing rules NN Noun NNS Plural Noun JJ Adjec/ve CD Cardinal Number CC Conjunc/on DT Determiner VB Verb FW Foreign Word >>> cpt = ChemCrfPosTagger() >>> cpt.tag(['NMR', 'spectra', 'were', 'recorded', 'on', 'a', '300', 'MHz', 'BRUKER', ‘spectrometer']) [('NMR', 'NN'), ('spectra', 'NNS'), ('were', 'VBD'), ('recorded', 'VBN'), ('on', 'IN'), ('a', 'DT'), ('300', 'CD'), ('MHz', 'NNP'), ('BRUKER', 'NNP'), ('spectrometer', ‘NN')]

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Entity Recognition 9 •
Tagging chemical names using “IOB” scheme: • Combined dicAonary, machine learning and regular expression methods • Lowe & Sayle (2015) J. Cheminf. - DicAonary creaAon techniques • AbbreviaAon DetecAon with dimethyl sulfoxide as solvent O B-CM I-CM O O Tetrahydrofuran ( THF ) B-CM O B-CM O

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain • Brown clustering: Hierarchical
clustering of words • Bitstring preﬁxes used as features in CRF to improve POS tagging and enAty recogniAon performance Word Clusters 10 000000000: ethanol, methanol, acetone, glycerol, brine 0000000010: THF, DMSO, CH2Cl2, toluene, DMF 0000000011: H2O, MeOH, EtOAc, hexane, TFA 0 1 1 1111000001110: subs@tuted, func@onalized, terminated 1111000001111: conjugated, interac@ng, cross-linked, fused 10011011100: NPs, NCs, NRs, NWs, NTs, NSs, NBs 0 1 10011011101: nanopar@cles, nanocrystals, nanowires 0 1 0 1 0 0 1 h#ps:/ /github.com/percyliang/brown-cluster

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Rule-Based Parsing 11 •
Parsers transform tagged tokens into a tree structure • Rules deﬁned in Python code from building blocks: • W: Match word text • T: Match POS tag or enAty tag • R: Match regular expression • And, Or, ZeroOrMore, OneOrMore, OpAonal, … • Built in parsers: Chemical Names, MelAng Points, Quantum Yield, Fluorescence lifeAmes, NMR, UV-vis, IR spectra, etc. name = T('B-CM') + ZeroOrMore(T('I-CM')) label = R('[1-9][a-z]?') | R('[IVX]+') cem = name + W('(') + label + W(')') cem name label 1,2-dihydroxybenzene 3a

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Table Processing • Process
each table cell with a specialised natural language processing pipeline • Parse headings to determine column type and extract contextual informaAon • Parse cells in each row to extract values • Merge contextual informaAon from capAon, footnotes, and headings 12

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Data Schema 13 Compound
name label role mel/ng points uvvis spectra Melting Point value units UV-vis Spectrum peaks solvent temperature apparatus Peak wavelength ex/nc/on shape

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Interdependency Resolution 14 •
Assign orphan properAes to relevant compound • Merge equivalent chemical idenAﬁers to a single record • Assign global contextual informaAon {"name": "4-(10-phenylanthracene-9-yl)pyridine", "role": "product"} {"melting_points": [{"value": "280.7", "units": "°C"}]} {"name": "4-(10-phenylanthracene-9-yl)pyridine", "label": "3a"} {"label": "3a", "melting_points": [{"value": "280.7", "units": "°C"}]} {"uv_vis": [{"solvent": "acetonitrile", "apparatus": "Agilent 8453 spectrometer"}]}

chemdataextractor.org

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain • ExperAse • Fragments
and structure-based drug discovery (Protein Science, Structural Biology, Chemistry) • TherapeuAc areas • Oncology, CNS, infecAous diseases • LocaAon • Based in Granta Park, just outside Cambridge 19

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Text Mining at Vernalis
• Patents: USPTO, WIPO and EPO • NextMove Sotware: spelling correcAon, translaAon, reacAon extracAon • PatSearch: In-house patent searching and alerAng plauorm • PatExplore: Patent chemistry explorer 20

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain PatSearch • Python Flask
web app with Celery task queues for downloading, processing, indexing patents • ElasAcSearch index with ReactJS frontend • Extract enAAes using LeadMine and store in database • From publicaAon to a scienAst’s inbox in under 1 hour 21

ChemDataExtractor Info • chemdataextractor.org - Try it out! • Python
package: • JCIM ArAcle: $ pip install chemdataextractor $ conda install -c chemdataextractor chemdataextractor ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature Matthew C. Swain and Jacqueline M. Cole* Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K. ABSTRACT: The emergence of “big data” initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate Article pubs.acs.org/jcim

Automated Extraction of Chemical Information fr...

Automated Extraction of Chemical Information from the Scientific Literature

Matt Swain

More Decks by Matt Swain

Other Decks in Science

Featured

Transcript

Automated Extraction of Chemical Information from the Scientiﬁc Literature Ma#

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Overview • ChemDataExtractor: Open

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain The Goal 3 name

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain System Overview 4 Tables

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Document Model • Process

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Natural Language Processing 6

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Tokenization 7 • Split

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Part-Of-Speech Tagging 8 •

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Entity Recognition 9 •

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain • Brown clustering: Hierarchical

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Rule-Based Parsing 11 •

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Table Processing • Process

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Data Schema 13 Compound

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Interdependency Resolution 14 •

chemdataextractor.org

chemdataextractor.org

chemdataextractor.org

chemdataextractor.org

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain • ExperAse • Fragments

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Text Mining at Vernalis

Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain PatSearch • Python Flask

ChemDataExtractor Info • chemdataextractor.org - Try it out! • Python