all input formats into a consistent internal document model • Publisher-specific (ACS, RSC, USPTO) and generic format readers • Result: Simplified linear stream of document elements 5 PDF HTML XML Title Abstract Heading Paragraph Table … Document Document Readers
• All text-based elements make use of the same natural language processing pipeline: • Variety of techniques: Rules, DicAonaries, Machine Learning Sentence Spli`ng Word TokenizaAon Part-of-speech Tagging EnAty RecogniAon Phrase Parsing
sentences using Punkt method • Custom rule-based word tokenizer • NormalizaAon performed on the content of each token Fig. 3 m.p. 185 °C M. Swain et al. equiv. 1.8 mg ~ ~1.8mg (+/−)-chiraline (+/−)-chiraline buffer (pH7.4) ( pH buffer 7.4 )
Tagging chemical names using “IOB” scheme: • Combined dicAonary, machine learning and regular expression methods • Lowe & Sayle (2015) J. Cheminf. - DicAonary creaAon techniques • AbbreviaAon DetecAon with dimethyl sulfoxide as solvent O B-CM I-CM O O Tetrahydrofuran ( THF ) B-CM O B-CM O
Parsers transform tagged tokens into a tree structure • Rules defined in Python code from building blocks: • W: Match word text • T: Match POS tag or enAty tag • R: Match regular expression • And, Or, ZeroOrMore, OneOrMore, OpAonal, … • Built in parsers: Chemical Names, MelAng Points, Quantum Yield, Fluorescence lifeAmes, NMR, UV-vis, IR spectra, etc. name = T('B-CM') + ZeroOrMore(T('I-CM')) label = R('[1-9][a-z]?') | R('[IVX]+') cem = name + W('(') + label + W(')') cem name label 1,2-dihydroxybenzene 3a
each table cell with a specialised natural language processing pipeline • Parse headings to determine column type and extract contextual informaAon • Parse cells in each row to extract values • Merge contextual informaAon from capAon, footnotes, and headings 12
name label role mel/ng points uvvis spectra Melting Point value units UV-vis Spectrum peaks solvent temperature apparatus Peak wavelength ex/nc/on shape
and structure-based drug discovery (Protein Science, Structural Biology, Chemistry) • TherapeuAc areas • Oncology, CNS, infecAous diseases • LocaAon • Based in Granta Park, just outside Cambridge 19
web app with Celery task queues for downloading, processing, indexing patents • ElasAcSearch index with ReactJS frontend • Extract enAAes using LeadMine and store in database • From publicaAon to a scienAst’s inbox in under 1 hour 21
package: • JCIM ArAcle: $ pip install chemdataextractor $ conda install -c chemdataextractor chemdataextractor ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature Matthew C. Swain and Jacqueline M. Cole* Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K. ABSTRACT: The emergence of “big data” initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate Article pubs.acs.org/jcim