Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automated Extraction of Chemical Information from the Scientific Literature

Matt Swain
December 07, 2016

Automated Extraction of Chemical Information from the Scientific Literature

ChemDataExtractor and other tools for chemical text mining.

Matt Swain

December 07, 2016
Tweet

More Decks by Matt Swain

Other Decks in Science

Transcript

  1. Automated Extraction of
    Chemical Information from
    the Scientific Literature
    Ma# Swain
    Cavendish Laboratory, University of Cambridge
    Vernalis (R&D) Ltd

    View Slide

  2. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Overview
    • ChemDataExtractor: Open source toolkit
    • Natural language processing
    • HolisAc informaAon extracAon
    • Text mining at Vernalis
    2

    View Slide

  3. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    The Goal
    3
    name 2-[2-[4-(dimethylamino)phenyl]diazenyl]-benzoic acid
    label 3a
    uv-vis
    - solvent acetonitrile
    - wavelength 448 units nm
    - extinction 29000 units M-1cm-1
    - apparatus Agilent8453 diode array spectrophotometer
    Figure 2: UV-vis absorption spectra of 3a in acetonitrile.
    UV-vis spectra were recorded using an Agilent8453 diode array spectrophotometer.
    λ
    max
    /nm (ε/M−1 cm−1)
    3b
    448 (29,000)
    Dye
    3a
    415 (48,000)
    The dye 2-[2-[4-(dimethylamino)phenyl]diazenyl]-benzoic acid (3a) was added…

    View Slide

  4. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    System Overview
    4
    Tables
    Abstract
    PDF
    HTML
    XML
    Full Text
    Database
    CapAons
    Document
    Readers
    Table
    Processor
    Natural
    Language
    Processor
    Interdependency
    Resolver

    View Slide

  5. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Document Model
    • Process all input formats into a consistent internal document model
    • Publisher-specific (ACS, RSC, USPTO) and generic format readers
    • Result: Simplified linear stream of document elements
    5
    PDF
    HTML
    XML
    Title
    Abstract
    Heading
    Paragraph
    Table

    Document
    Document
    Readers

    View Slide

  6. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Natural Language Processing
    6
    • All text-based elements make use of the same natural
    language processing pipeline:
    • Variety of techniques: Rules, DicAonaries, Machine
    Learning
    Sentence
    Spli`ng
    Word
    TokenizaAon
    Part-of-speech
    Tagging
    EnAty
    RecogniAon
    Phrase
    Parsing

    View Slide

  7. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Tokenization
    7
    • Split sentences using Punkt method
    • Custom rule-based word tokenizer
    • NormalizaAon performed on the content of each token
    Fig. 3 m.p. 185 °C M. Swain et al. equiv.
    1.8 mg
    ~
    ~1.8mg
    (+/−)-chiraline
    (+/−)-chiraline
    buffer (pH7.4) ( pH
    buffer 7.4 )

    View Slide

  8. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Part-Of-Speech Tagging
    8
    • Assign a category to each token - 47 tags in total
    • CRF trained on newspaper arAcles (WSJ) and
    biomedical abstracts (GENIA)
    • Used as a feature in machine learning enAty
    recogniAon
    • Used to construct parsing rules
    NN Noun
    NNS Plural Noun
    JJ Adjec/ve
    CD Cardinal Number
    CC Conjunc/on
    DT Determiner
    VB Verb
    FW Foreign Word
    >>> cpt = ChemCrfPosTagger()
    >>> cpt.tag(['NMR', 'spectra', 'were', 'recorded', 'on', 'a', '300', 'MHz', 'BRUKER', ‘spectrometer'])
    [('NMR', 'NN'), ('spectra', 'NNS'), ('were', 'VBD'), ('recorded', 'VBN'), ('on', 'IN'),
    ('a', 'DT'), ('300', 'CD'), ('MHz', 'NNP'), ('BRUKER', 'NNP'), ('spectrometer', ‘NN')]

    View Slide

  9. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Entity Recognition
    9
    • Tagging chemical names using “IOB” scheme:
    • Combined dicAonary, machine learning and regular
    expression methods
    • Lowe & Sayle (2015) J. Cheminf. - DicAonary creaAon techniques
    • AbbreviaAon DetecAon
    with dimethyl sulfoxide as solvent
    O B-CM I-CM O O
    Tetrahydrofuran ( THF )
    B-CM O B-CM O

    View Slide

  10. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    • Brown clustering: Hierarchical clustering of words
    • Bitstring prefixes used as features in CRF to improve POS
    tagging and enAty recogniAon performance
    Word Clusters
    10
    000000000: ethanol, methanol, acetone, glycerol, brine
    0000000010: THF, DMSO, CH2Cl2, toluene, DMF
    0000000011: H2O, MeOH, EtOAc, hexane, TFA
    0
    1
    1
    1111000001110: [email protected], [email protected], terminated
    1111000001111: conjugated, [email protected], cross-linked, fused
    10011011100: NPs, NCs, NRs, NWs, NTs, NSs, NBs
    0
    1
    10011011101: [email protected], nanocrystals, nanowires
    0
    1
    0
    1
    0
    0
    1
    h#ps:/
    /github.com/percyliang/brown-cluster

    View Slide

  11. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Rule-Based Parsing
    11
    • Parsers transform tagged tokens into a tree structure
    • Rules defined in Python code from building blocks:
    • W: Match word text
    • T: Match POS tag or enAty tag
    • R: Match regular expression
    • And, Or, ZeroOrMore, OneOrMore, OpAonal, …
    • Built in parsers: Chemical Names, MelAng Points, Quantum Yield,
    Fluorescence lifeAmes, NMR, UV-vis, IR spectra, etc.
    name = T('B-CM') + ZeroOrMore(T('I-CM'))
    label = R('[1-9][a-z]?') | R('[IVX]+')
    cem = name + W('(') + label + W(')')
    cem
    name label
    1,2-dihydroxybenzene 3a

    View Slide

  12. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Table Processing
    • Process each table cell with a specialised natural language
    processing pipeline
    • Parse headings to determine column type and extract contextual
    informaAon
    • Parse cells in each row to extract values
    • Merge contextual informaAon from capAon, footnotes, and headings
    12

    View Slide

  13. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Data Schema
    13
    Compound
    name
    label
    role
    mel/ng points
    uvvis spectra
    Melting Point
    value
    units
    UV-vis Spectrum
    peaks
    solvent
    temperature
    apparatus
    Peak
    wavelength
    ex/nc/on
    shape

    View Slide

  14. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Interdependency Resolution
    14
    • Assign orphan properAes to relevant compound
    • Merge equivalent chemical idenAfiers to a single record
    • Assign global contextual informaAon
    {"name": "4-(10-phenylanthracene-9-yl)pyridine", "role": "product"}
    {"melting_points": [{"value": "280.7", "units": "°C"}]}
    {"name": "4-(10-phenylanthracene-9-yl)pyridine", "label": "3a"}
    {"label": "3a", "melting_points": [{"value": "280.7", "units": "°C"}]}
    {"uv_vis": [{"solvent": "acetonitrile", "apparatus": "Agilent 8453 spectrometer"}]}

    View Slide

  15. chemdataextractor.org

    View Slide

  16. chemdataextractor.org

    View Slide

  17. chemdataextractor.org

    View Slide

  18. chemdataextractor.org

    View Slide

  19. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    • ExperAse
    • Fragments and structure-based drug discovery
    (Protein Science, Structural Biology, Chemistry)
    • TherapeuAc areas
    • Oncology, CNS, infecAous diseases
    • LocaAon
    • Based in Granta Park, just outside Cambridge
    19

    View Slide

  20. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    Text Mining at Vernalis
    • Patents: USPTO, WIPO and EPO
    • NextMove Sotware: spelling correcAon, translaAon,
    reacAon extracAon
    • PatSearch: In-house patent searching and alerAng
    plauorm
    • PatExplore: Patent chemistry explorer
    20

    View Slide

  21. Cambridge CheminformaAcs MeeAng 07/12/2016
    Ma# Swain
    PatSearch
    • Python Flask web app with Celery task queues for
    downloading, processing, indexing patents
    • ElasAcSearch index with ReactJS frontend
    • Extract enAAes using LeadMine and store in database
    • From publicaAon to a scienAst’s inbox in under 1 hour
    21

    View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. ChemDataExtractor Info
    • chemdataextractor.org - Try it out!
    • Python package:
    • JCIM ArAcle:
    $ pip install chemdataextractor
    $ conda install -c chemdataextractor chemdataextractor
    ChemDataExtractor: A Toolkit for Automated Extraction of Chemical
    Information from the Scientific Literature
    Matthew C. Swain and Jacqueline M. Cole*
    Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.
    ABSTRACT: The emergence of “big data” initiatives has led
    to the need for tools that can automatically extract valuable
    chemical information from large volumes of unstructured data,
    such as the scientific literature. Since chemical information can
    be present in figures, tables, and textual paragraphs, successful
    information extraction often depends on the ability to interpret
    all of these domains simultaneously. We present a complete
    toolkit for the automated extraction of chemical entities and
    their associated properties, measurements, and relationships
    from scientific documents that can be used to populate
    Article
    pubs.acs.org/jcim

    View Slide