Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Text Analytics for Health to Extract Insights from COVID Papers Dataset

Using Text Analytics for Health to Extract Insights from COVID Papers Dataset

In this presentation, I show how different Azure resources can be used to extract insights from dataset of COVID scientific papers. The process includes the following:
- Extracting entities using Text Analytics for Health on Azure Batch
- Storing semi-structured data in Cosmos DB and doing SQL Queries
- Using SQL Queries in Cosmos DB Notebooks to gain further insights into data
- Adding Power BI dashboard for no-code exploration of data

Dmitri Soshnikov

March 26, 2021
Tweet

More Decks by Dmitri Soshnikov

Other Decks in Technology

Transcript

  1. Using Text Analytics for Health to Extract Insights from CORD

    COVID Papers Dataset Dmitry Soshnikov, Cloud Developer Advocate, Microsoft http://soshnikov.com Victoria Soshnikova, Student, Phystech Lyceum [email protected]
  2. CORD Papers Dataset https://allenai.org/data/cord-19 CORD-19 Dataset Contains over 400,000 scholarly

    articles about COVID-19 and the coronavirus family of viruses for use by the global research community 200,000 articles with full text https://www.kaggle.com/allen-institute-for-ai/ CORD-19-research-challenge
  3. Main Idea: Use NLP tools to extract semi-structured data from

    papers, to enable semantically rich queries over the paper corpus. CORD Corpus Extracted JSON Cosmos DB Database Power BI Dashboard SQL Queries Azure Semantic Search NER Relations Text Analytics for Health
  4. Text Analytics for Health (Preview)  Currently in Preview 

    Gated service, need to apply for usage (apply at https://aka.ms/csgate)  Should not be implemented or deployed in any production use.  Can be used through Web API or Container Service  Supports:  Named Entity Recognition (NER)  Relation Extraction  Entity Linking (Ontology Mapping)  Negation Detection
  5. Using Text Analytics for Health Pip Install the Azure TextAnalytics

    SDK: pip install azure.ai.textanalytics==5.1.0b5 from azure.core.credentials import AzureKeyCredential from azure.ai.textanalytics import TextAnalyticsClient client = TextAnalyticsClient(endpoint=endpoint, credential=AzureKeyCredential(key), api_version="v3.1-preview.3") Create the client: documents = ["I have not been administered any aspirin, just 300 mg or favipiravir daily."] poller = client.begin_analyze_healthcare_entities(documents) result = poller.result() Do the call:
  6. Analysis Result I have not been administered any aspirin, just

    300 mg or favipiravir daily. HealthcareEntity(text=300 mg, category=Dosage, subcategory=None, length=6, offset=47, confidence_score=1.0, data_sources=None, related_entities={HealthcareEntity(text=favipiravir, category=MedicationName, subcategory=None, length=11, offset=57, confidence_score=1.0, data_sources=[HealthcareEntityDataSource(entity_id=C1138226, name=UMLS), HealthcareEntityDataSource(entity_id=J05AX27, name=ATC), HealthcareEntityDataSource(entity_id=DB12466, name=DRUGBANK), HealthcareEntityDataSource(entity_id=398131, name=MEDCIN), HealthcareEntityDataSource(entity_id=C462182, name=MSH), HealthcareEntityDataSource(entity_id=C81605, name=NCI), HealthcareEntityDataSource(entity_id=EW5GL2X7E0, name=NCI_FDA)], related_entities={}): 'DosageOfMedication'}) aspirin (C0004057) [MedicationName] 300 mg [Dosage] --DosageOfMedication--> favipiravir (C1138226) [MedicationName] favipiravir (C1138226) [MedicationName] daily [Frequency] --FrequencyOfMedication--> favipiravir (C1138226) [MedicationName]
  7. Analyzing CORD Abstracts • All abstracts contained in CSV metadata

    file • Split 400k papers into chunks of 500 • Id, Title, Journal, Authors, Publication Date • Shuffle by date in order to get representative sample in each chunk • Enrich each json file with text analytics data • Entities, Relations • Parallel processing using VM pool / Azure Batch
  8. Results of Text Analytics Processing { "gh690dai": { "id": "gh690dai",

    "title": "Beef and Pork Marketing Margins and Price Spreads during COVID-19", "authors": "Lusk, Jayson L.; Tonsor, Glynn T.; Schulz, Lee L.", "journal": "Appl Econ Perspect Policy", "abstract": "...", "publish_time": "2020-10-02", "entities": [ { "offset": 0, "length": 16, "text": "COVID-19-related", "category": "Diagnosis", "confidenceScore": 0.79, "isNegated": false },..] "relations": [ { "relationType": "TimeOfTreatment", "bidirectional": false, "source": { "uri": "#/documents/0/entities/15", "text": "previous year", "category": "Time", "isNegated": false, "offset": 704 }, "target": { "uri": "#/documents/0/entities/13", "text": "beef", "category": "TreatmentName", "isNegated": false, "offset": 642 }}]}, …
  9. Storing Semi-Structured Data into Cosmos DB Cosmos DB – NoSQL

    universal solution Querying semi-structured data with SQL-like language Paper Paper Entity Entity Relation Collection … …
  10. Cosmos DB SQL Queries Get mentioned dosages of a particular

    medication and papers they are mentioned in SELECT p.title, r.source.text FROM papers p JOIN r IN p.relations WHERE r.relationType='DosageOfMedication’ AND CONTAINS(r.target.text,'hydro')
  11. Further Exploration: Jupyter in Cosmos DB SQL in Cosmos DB

    is somehow limited Good strategy: make query in Cosmos DB, export to Pandas Dataframe, final exploration in Python Jupyter support is built into Cosmos DB Makes exporting query results to DataFrame easy! %%sql --database CORD --container Papers --output meds SELECT e.text, e.isNegated, p.title, p.publish_time, ARRAY (SELECT VALUE l.id FROM l IN e.links WHERE l.dataSource='UMLS')[0] AS umls_id FROM papers p JOIN e IN p.entities WHERE e.category = 'MedicationName'
  12. Unique Medications by UMLS ID { 'C0020336': 'hydroxychloroquine', 'C0008269': 'chloroquine',

    'C1609165': 'Tocilizumab', 'C4726677': 'remdesivir', 'C0052796': 'azithromycin', 'C0674432': 'lopinavir', 'C0292818': 'ritonavir', 'C0042866': 'vitamin D', 'C0011777': 'dexamethasone', 'C0019134': 'Heparin', 'C1138226': 'favipiravir', 'C0021641': 'insulin', 'C0030106': 'ozone', 'C0025815': 'methylprednisolone', 'C0023810': 'lipopolysaccharide’ } Top 15 by count
  13. Conclusions Text Mining for Medical Texts can be very valuable

    resource for gaining insights into large text corpus. ❶ ❷ A Range of Microsoft Technologies can be used to effectively make this a reality: • Text Analytics for Health to do NER and ontology mapping • Azure Batch / VM Pool to perform large-scale processing • Cosmos DB to store and query semistructured data • Power BI to explore the data interactively to gain insights • Cosmos DB Jupyter Notebooks to do deep dive into the data w/Python