Prediction During holidays, I like to ______ with my dog. It is so cute. 0.85 Play 0.05 Sleep 0.09 Fight 0.80 YES 0.20 NO BERT contains 345 million parameters => very difficult to train from scratch! In most of the cases it makes sense to use pre-trained language model.
papers, to enable semantically rich queries over the paper corpus. Extracted JSON Cosmos DB Database Power BI Dashboard SQL Queries Azure Semantic Search NER Relations Text Analytics for Health CORD Corpus
Kaggle Medical NER: • ~40 papers • ~300 entities Generic BC5CDR Dataset • 1500 papers • 5000 entities • Disease / Chemical Generic BERT Model Pre-training BERT on Medical texts PubMedBERT pre-trained model by Microsoft Research Huggingface Transformer Library: https://huggingface.co/
infant. 6794356|a|A newborn with massive tricuspid regurgitation, atrial flutter, congestive heart failure, and a high serum lithium level is described. This is the first patient to initially manifest tricuspid regurgitation and atrial flutter, and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy. Sixty-three percent of these infants had tricuspid valve involvement. Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy. It also causes neurologic depression, cyanosis, and cardiac arrhythmia when consumed prior to delivery. 6794356 0 29 Tricuspid valve regurgitation Disease D014262 6794356 34 51 lithium carbonate Chemical D016651 6794356 52 60 toxicity Disease D064420 6794356 105 128 tricuspid regurgitation Disease D014262 6794356 130 144 atrial flutter Disease D001282 6794356 146 170 congestive heart failure Disease D006333 6794356 189 196 lithium Chemical D008094 6794356 265 288 tricuspid regurgitation Disease D014262 6794356 293 307 atrial flutter Disease D001282 6794356 345 360 cardiac disease Disease D006331 6794356 386 393 lithium Chemical D008094 6794356 511 528 Lithium carbonate Chemical D016651 6794356 576 600 congenital heart disease Disease D006331
Some other categories would be helpful (pharmacokinetics, biologic fluids, etc.) • Common entities are also needed (quantity, temperature, etc.) Get trained model: $ az ml job download -n $ID --outputs
SDK: pip install azure.ai.textanalytics from azure.core.credentials import AzureKeyCredential from azure.ai.textanalytics import TextAnalyticsClient client = TextAnalyticsClient(endpoint=endpoint, credential=AzureKeyCredential(key), api_version="v3.1") Create the client: documents = ["I have not been administered any aspirin, just 300 mg or favipiravir daily."] poller = client.begin_analyze_healthcare_entities(documents) result = poller.result() Do the call:
file • Split 400k papers into chunks of 500 • Id, Title, Journal, Authors, Publication Date • Shuffle by date in order to get representative sample in each chunk • Enrich each json file with text analytics data • Entities, Relations • Parallel processing using Azure ML
fast read and write latencies globally, and throughput and consistency all backed by SLAs • Multi-region writes and data distribution to any Azure region with the click of a button. • Independently and elastically scale storage and throughput across any Azure region – even during unpredictable traffic bursts – for unlimited scale worldwide.
is somehow limited Good strategy: make query in Cosmos DB, export to Pandas Dataframe, final exploration in Python Jupyter support is built into Cosmos DB Makes exporting query results to DataFrame easy! %%sql --database CORD --container Papers --output meds SELECT e.text, e.isNegated, p.title, p.publish_time, ARRAY (SELECT VALUE l.id FROM l IN e.links WHERE l.dataSource='UMLS') AS umls_id FROM papers p JOIN e IN p.entities WHERE e.category = 'MedicationName'
• Connect to data, including multiple data sources. • Shape the data with queries that build insightful, compelling data models. • Use the data models to create visualizations and reports. • Share your report files for others to leverage, build upon, and share.
resource for gaining insights into large text corpus. ❶ ❷ A Range of Microsoft Technologies can be used to effectively make this a reality: • Azure ML for Custom NER training / Parallel Sweep Jobs • Text Analytics for Health to do NER and ontology mapping • Cosmos DB to store and query semi-structured data • Power BI to explore the data interactively to gain insights • Cosmos DB Jupyter Notebooks to do deep dive into the data w/Python