JournalismAI | Context Cards | Dec 2022

ML to create and distribute short-form context for long-lasting news
cycles alongside articles. Ritvvij Parrikh | Times of India | Director, Product Karn Bhushan | Times of India | Data Scientist Amanda Strydom | Code for Africa | Senior Programme Manager Context Cards Find us at https://context-cards.com

Problem • Misinformation spreads faster than the debunks 🚀 •
There is an under-used/forgotten database of already-existing mythbusters and explainers ☝ • Without formal institutional memory about how issues have evolved over the years, stories can lack historical context. Opportunity • News packaged with context and/or data can increase newsroom trust as a source of reliable information 🤝 • Proactive debunking rather than reacting to mis/disinformation 🚫 • Context Cards offers internal timelines and context that has been approved by an editor, speeding up editorial processes ✅ Why

Solution Extract (and serve to audience) context … from the
topic or news cycle … that the article is talking about Detailed case study on discovery of solution at https://context-cards.com

What is context In the previous iteration of JournalismAI, Clwstwr,
Deutsche Welle, Il Sole 24 Ore, and Maharat Foundation, identiﬁed SIXTY user needs questions that audiences have from journalism. Find them at ModularJournalism.com Find us at https://context-cards.com

User Experience Detailed case study on discovery of solution at
https://context-cards.com

What Find us at https://context-cards.com Nudges Get the audience to
ask questions

Introduction Q-1017: Can you tell me what happened in very
few words? • Headline of the topic • 3-line description of the topic • Follow button What… Find us at https://context-cards.com

What… Find us at https://context-cards.com Timeline Q-1010: What has got
us here? • List of related stories in descending order

What… Find us at https://context-cards.com Expert Speak Q-1008: What do
key people say? Q-1029: How many points of view are there on this topic? • Pull quotes from articles • Viewpoints of different parties • List of opinion articles

What… Find us at https://context-cards.com Data Q-1010: What has got
us here? • Data snippets • Charts from related articles • Articles tagged as data dives

What… Find us at https://context-cards.com FAQs Google’s ‘People Also Ask’

What… Find us at https://context-cards.com Mentions Q-1004: Who is it
about? Q-1005: Where did it happen? Q-1028: Who is involved?

Option #2: GPT-3 Step 3: Prepare to publish to audiences
Option #1: Task speciﬁc models Step 2: Generate text for context cards How Find us at https://context-cards.com Topic Modeling Option #3: Google's T5 and FLAN-T5 Newscards Step 1: Find relevant articles from archive

Algorithms we tried: Top2Vec and BERTopic. Out of the box,
BERTopic gave higher accuracy (52.97%) than Top2Vec (50%). Refining accuracy. • We tried new embedding methods for BERTopic and Top2Vec. Doc2Vec embeddings with Top2Vec gave us higher accuracy. • Most of the failed cases belonged to one sub-product (ETimes) • Articles in Entertainment section usually can’t be clustered into specific news cycles or meaningful topic • Implemented Top2Vec algo with joint embeddings of words and documents using Doc2Vec model without Etimes stories and achieved accuracy of 73.55%. Data for training. We trained on 70,730 articles from TOI. Data for testing: Historically, editors were tagging related articles manually in our CMS. Algorithms: Top2Vec and BERTopic. BERTopic gave higher accuracy (52.97%) than Top2Vec (50%). Refining accuracy. • We tried new embedding methods for BERTopic and Top2Vec. Doc2Vec embeddings with Top2Vec gave us higher accuracy. • Most of the failed cases belonged to one sub-product (ETimes) • Articles in Entertainment section usually can’t be clustered into specific news cycles or meaningful topic • Implemented Top2Vec algo with joint embeddings of words and documents using Doc2Vec model without Etimes stories and achieved accuracy of 73.55%. How Step 1: Topic Modeling Detailed case study on topic modeling at https://context-cards.com

Dashboard. • We’ve set up an editorial team to review
the output of the algorithm. • Editors can also mark a story as False Positives, i.e., stories that the algorithm said is part of a topic but isn’t. • Editors can also ﬁx False Negatives, i.e. stories that the Editor feels is part of the topic but the algorithm did not catch it. They do this by adding the story’s ID. How Step 1: Topic Modeling… Detailed case study on topic modeling at https://context-cards.com

Mentions Use Spacy to get Named Entity Recognition and used
Wikipedia to reduce noise. FAQs Questgen.ai is a ready API to achieve this. However, it accepts only 1000-words per input. Hence, we’ll build our own. Data Train Spacy to pull out data snippets List all charts used within stories in topic. Timeline List stories in topic in descending order. OR run topic modeling to ﬁnd events and summarize them. How Step 2: Task speciﬁc small language models Expert Speak In previous iteration of JournalismAI, Guardian wrote a model to extract quotes from articles.

Data 🤔 It can extract data snippets. Requires sufﬁcient fact-checking.
FAQs 👍 It is able to generate good FAQs and answer them. Mentions ✅ It can extract entities and write bios for them. However, in its attempt to contextual, the bios are parochial. Expert Speak ✅ It can extract quotes and viewpoints of the various parties involved. Timeline 🤔 While it generates a timeline, it tends to get dates wrong. Requires sufﬁcient fact-checking. How Step 2: GPT-3

Conclusion GPT-3 is a good option to get POC of
context-cards up. We’ll need to build in editorial review for fact-checking and reﬁnement before it gets published to audiences! How Step 2: GPT-3 Detailed evaluation on GPT3 at https://context-cards.com

Thank you We’ve blogged ~10,000 words on our exploration from
a technical point of view. You can ﬁnd all those blogs at https://context-cards.com We’ll publish our pre-trained models on Hugging Face next week.

JournalismAI | Context Cards | Dec 2022

JournalismAI | Context Cards | Dec 2022

Ritvvij Parrikh

Other Decks in Design

Featured

Transcript

ML to create and distribute short-form context for long-lasting news

Problem • Misinformation spreads faster than the debunks 🚀 •

Solution Extract (and serve to audience) context … from the

What is context In the previous iteration of JournalismAI, Clwstwr,

User Experience Detailed case study on discovery of solution at

What Find us at https://context-cards.com Nudges Get the audience to

Introduction Q-1017: Can you tell me what happened in very

What… Find us at https://context-cards.com Timeline Q-1010: What has got

What… Find us at https://context-cards.com Expert Speak Q-1008: What do

What… Find us at https://context-cards.com Data Q-1010: What has got

What… Find us at https://context-cards.com FAQs Google’s ‘People Also Ask’

What… Find us at https://context-cards.com Mentions Q-1004: Who is it

Option #2: GPT-3 Step 3: Prepare to publish to audiences

Algorithms we tried: Top2Vec and BERTopic. Out of the box,

Dashboard. • We’ve set up an editorial team to review

Mentions Use Spacy to get Named Entity Recognition and used

Data 🤔 It can extract data snippets. Requires sufﬁcient fact-checking.

Conclusion GPT-3 is a good option to get POC of

Thank you We’ve blogged ~10,000 words on our exploration from