Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automated Extraction of Chemical Information from the Scientific Literature

Matt Swain
December 07, 2016

Automated Extraction of Chemical Information from the Scientific Literature

ChemDataExtractor and other tools for chemical text mining.

Matt Swain

December 07, 2016
Tweet

More Decks by Matt Swain

Other Decks in Science

Transcript

  1. Automated Extraction of Chemical Information from the Scientific Literature Ma#

    Swain Cavendish Laboratory, University of Cambridge Vernalis (R&D) Ltd
  2. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Overview • ChemDataExtractor: Open

    source toolkit • Natural language processing • HolisAc informaAon extracAon • Text mining at Vernalis 2
  3. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain The Goal 3 name

    2-[2-[4-(dimethylamino)phenyl]diazenyl]-benzoic acid label 3a uv-vis - solvent acetonitrile - wavelength 448 units nm - extinction 29000 units M-1cm-1 - apparatus Agilent8453 diode array spectrophotometer Figure 2: UV-vis absorption spectra of 3a in acetonitrile. UV-vis spectra were recorded using an Agilent8453 diode array spectrophotometer. λ max /nm (ε/M−1 cm−1) 3b 448 (29,000) Dye 3a 415 (48,000) The dye 2-[2-[4-(dimethylamino)phenyl]diazenyl]-benzoic acid (3a) was added…
  4. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain System Overview 4 Tables

    Abstract PDF HTML XML Full Text Database CapAons Document Readers Table Processor Natural Language Processor Interdependency Resolver
  5. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Document Model • Process

    all input formats into a consistent internal document model • Publisher-specific (ACS, RSC, USPTO) and generic format readers • Result: Simplified linear stream of document elements 5 PDF HTML XML Title Abstract Heading Paragraph Table … Document Document Readers
  6. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Natural Language Processing 6

    • All text-based elements make use of the same natural language processing pipeline: • Variety of techniques: Rules, DicAonaries, Machine Learning Sentence Spli`ng Word TokenizaAon Part-of-speech Tagging EnAty RecogniAon Phrase Parsing
  7. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Tokenization 7 • Split

    sentences using Punkt method • Custom rule-based word tokenizer • NormalizaAon performed on the content of each token Fig. 3 m.p. 185 °C M. Swain et al. equiv. 1.8 mg ~ ~1.8mg (+/−)-chiraline (+/−)-chiraline buffer (pH7.4) ( pH buffer 7.4 )
  8. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Part-Of-Speech Tagging 8 •

    Assign a category to each token - 47 tags in total • CRF trained on newspaper arAcles (WSJ) and biomedical abstracts (GENIA) • Used as a feature in machine learning enAty recogniAon • Used to construct parsing rules NN Noun NNS Plural Noun JJ Adjec/ve CD Cardinal Number CC Conjunc/on DT Determiner VB Verb FW Foreign Word >>> cpt = ChemCrfPosTagger() >>> cpt.tag(['NMR', 'spectra', 'were', 'recorded', 'on', 'a', '300', 'MHz', 'BRUKER', ‘spectrometer']) [('NMR', 'NN'), ('spectra', 'NNS'), ('were', 'VBD'), ('recorded', 'VBN'), ('on', 'IN'), ('a', 'DT'), ('300', 'CD'), ('MHz', 'NNP'), ('BRUKER', 'NNP'), ('spectrometer', ‘NN')]
  9. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Entity Recognition 9 •

    Tagging chemical names using “IOB” scheme: • Combined dicAonary, machine learning and regular expression methods • Lowe & Sayle (2015) J. Cheminf. - DicAonary creaAon techniques • AbbreviaAon DetecAon with dimethyl sulfoxide as solvent O B-CM I-CM O O Tetrahydrofuran ( THF ) B-CM O B-CM O
  10. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain • Brown clustering: Hierarchical

    clustering of words • Bitstring prefixes used as features in CRF to improve POS tagging and enAty recogniAon performance Word Clusters 10 000000000: ethanol, methanol, acetone, glycerol, brine 0000000010: THF, DMSO, CH2Cl2, toluene, DMF 0000000011: H2O, MeOH, EtOAc, hexane, TFA 0 1 1 1111000001110: subs@tuted, func@onalized, terminated 1111000001111: conjugated, interac@ng, cross-linked, fused 10011011100: NPs, NCs, NRs, NWs, NTs, NSs, NBs 0 1 10011011101: nanopar@cles, nanocrystals, nanowires 0 1 0 1 0 0 1 h#ps:/ /github.com/percyliang/brown-cluster
  11. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Rule-Based Parsing 11 •

    Parsers transform tagged tokens into a tree structure • Rules defined in Python code from building blocks: • W: Match word text • T: Match POS tag or enAty tag • R: Match regular expression • And, Or, ZeroOrMore, OneOrMore, OpAonal, … • Built in parsers: Chemical Names, MelAng Points, Quantum Yield, Fluorescence lifeAmes, NMR, UV-vis, IR spectra, etc. name = T('B-CM') + ZeroOrMore(T('I-CM')) label = R('[1-9][a-z]?') | R('[IVX]+') cem = name + W('(') + label + W(')') cem name label 1,2-dihydroxybenzene 3a
  12. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Table Processing • Process

    each table cell with a specialised natural language processing pipeline • Parse headings to determine column type and extract contextual informaAon • Parse cells in each row to extract values • Merge contextual informaAon from capAon, footnotes, and headings 12
  13. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Data Schema 13 Compound

    name label role mel/ng points uvvis spectra Melting Point value units UV-vis Spectrum peaks solvent temperature apparatus Peak wavelength ex/nc/on shape
  14. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Interdependency Resolution 14 •

    Assign orphan properAes to relevant compound • Merge equivalent chemical idenAfiers to a single record • Assign global contextual informaAon {"name": "4-(10-phenylanthracene-9-yl)pyridine", "role": "product"} {"melting_points": [{"value": "280.7", "units": "°C"}]} {"name": "4-(10-phenylanthracene-9-yl)pyridine", "label": "3a"} {"label": "3a", "melting_points": [{"value": "280.7", "units": "°C"}]} {"uv_vis": [{"solvent": "acetonitrile", "apparatus": "Agilent 8453 spectrometer"}]}
  15. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain • ExperAse • Fragments

    and structure-based drug discovery (Protein Science, Structural Biology, Chemistry) • TherapeuAc areas • Oncology, CNS, infecAous diseases • LocaAon • Based in Granta Park, just outside Cambridge 19
  16. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain Text Mining at Vernalis

    • Patents: USPTO, WIPO and EPO • NextMove Sotware: spelling correcAon, translaAon, reacAon extracAon • PatSearch: In-house patent searching and alerAng plauorm • PatExplore: Patent chemistry explorer 20
  17. Cambridge CheminformaAcs MeeAng 07/12/2016 Ma# Swain PatSearch • Python Flask

    web app with Celery task queues for downloading, processing, indexing patents • ElasAcSearch index with ReactJS frontend • Extract enAAes using LeadMine and store in database • From publicaAon to a scienAst’s inbox in under 1 hour 21
  18. ChemDataExtractor Info • chemdataextractor.org - Try it out! • Python

    package: • JCIM ArAcle: $ pip install chemdataextractor $ conda install -c chemdataextractor chemdataextractor ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature Matthew C. Swain and Jacqueline M. Cole* Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K. ABSTRACT: The emergence of “big data” initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate Article pubs.acs.org/jcim