Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JournalismAI | Context Cards | Dec 2022

Avatar for Ritvvij Parrikh Ritvvij Parrikh
December 07, 2022

JournalismAI | Context Cards | Dec 2022

Avatar for Ritvvij Parrikh

Ritvvij Parrikh

December 07, 2022
Tweet

Other Decks in Design

Transcript

  1. ML to create and distribute short-form context for long-lasting news

    cycles alongside articles. Ritvvij Parrikh | Times of India | Director, Product Karn Bhushan | Times of India | Data Scientist Amanda Strydom | Code for Africa | Senior Programme Manager Context Cards Find us at https://context-cards.com
  2. Problem • Misinformation spreads faster than the debunks 🚀 •

    There is an under-used/forgotten database of already-existing mythbusters and explainers ☝ • Without formal institutional memory about how issues have evolved over the years, stories can lack historical context. Opportunity • News packaged with context and/or data can increase newsroom trust as a source of reliable information 🤝 • Proactive debunking rather than reacting to mis/disinformation 🚫 • Context Cards offers internal timelines and context that has been approved by an editor, speeding up editorial processes ✅ Why
  3. Solution Extract (and serve to audience) context … from the

    topic or news cycle … that the article is talking about Detailed case study on discovery of solution at https://context-cards.com
  4. What is context In the previous iteration of JournalismAI, Clwstwr,

    Deutsche Welle, Il Sole 24 Ore, and Maharat Foundation, identified SIXTY user needs questions that audiences have from journalism. Find them at ModularJournalism.com Find us at https://context-cards.com
  5. Introduction Q-1017: Can you tell me what happened in very

    few words? • Headline of the topic • 3-line description of the topic • Follow button What… Find us at https://context-cards.com
  6. What… Find us at https://context-cards.com Timeline Q-1010: What has got

    us here? • List of related stories in descending order
  7. What… Find us at https://context-cards.com Expert Speak Q-1008: What do

    key people say? Q-1029: How many points of view are there on this topic? • Pull quotes from articles • Viewpoints of different parties • List of opinion articles
  8. What… Find us at https://context-cards.com Data Q-1010: What has got

    us here? • Data snippets • Charts from related articles • Articles tagged as data dives
  9. What… Find us at https://context-cards.com Mentions Q-1004: Who is it

    about? Q-1005: Where did it happen? Q-1028: Who is involved?
  10. Option #2: GPT-3 Step 3: Prepare to publish to audiences

    Option #1: Task specific models Step 2: Generate text for context cards How Find us at https://context-cards.com Topic Modeling Option #3: Google's T5 and FLAN-T5 Newscards Step 1: Find relevant articles from archive
  11. Algorithms we tried: Top2Vec and BERTopic. Out of the box,

    BERTopic gave higher accuracy (52.97%) than Top2Vec (50%). Refining accuracy. • We tried new embedding methods for BERTopic and Top2Vec. Doc2Vec embeddings with Top2Vec gave us higher accuracy. • Most of the failed cases belonged to one sub-product (ETimes) • Articles in Entertainment section usually can’t be clustered into specific news cycles or meaningful topic • Implemented Top2Vec algo with joint embeddings of words and documents using Doc2Vec model without Etimes stories and achieved accuracy of 73.55%. Data for training. We trained on 70,730 articles from TOI. Data for testing: Historically, editors were tagging related articles manually in our CMS. Algorithms: Top2Vec and BERTopic. BERTopic gave higher accuracy (52.97%) than Top2Vec (50%). Refining accuracy. • We tried new embedding methods for BERTopic and Top2Vec. Doc2Vec embeddings with Top2Vec gave us higher accuracy. • Most of the failed cases belonged to one sub-product (ETimes) • Articles in Entertainment section usually can’t be clustered into specific news cycles or meaningful topic • Implemented Top2Vec algo with joint embeddings of words and documents using Doc2Vec model without Etimes stories and achieved accuracy of 73.55%. How Step 1: Topic Modeling Detailed case study on topic modeling at https://context-cards.com
  12. Dashboard. • We’ve set up an editorial team to review

    the output of the algorithm. • Editors can also mark a story as False Positives, i.e., stories that the algorithm said is part of a topic but isn’t. • Editors can also fix False Negatives, i.e. stories that the Editor feels is part of the topic but the algorithm did not catch it. They do this by adding the story’s ID. How Step 1: Topic Modeling… Detailed case study on topic modeling at https://context-cards.com
  13. Mentions Use Spacy to get Named Entity Recognition and used

    Wikipedia to reduce noise. FAQs Questgen.ai is a ready API to achieve this. However, it accepts only 1000-words per input. Hence, we’ll build our own. Data Train Spacy to pull out data snippets List all charts used within stories in topic. Timeline List stories in topic in descending order. OR run topic modeling to find events and summarize them. How Step 2: Task specific small language models Expert Speak In previous iteration of JournalismAI, Guardian wrote a model to extract quotes from articles.
  14. Data 🤔 It can extract data snippets. Requires sufficient fact-checking.

    FAQs 👍 It is able to generate good FAQs and answer them. Mentions ✅ It can extract entities and write bios for them. However, in its attempt to contextual, the bios are parochial. Expert Speak ✅ It can extract quotes and viewpoints of the various parties involved. Timeline 🤔 While it generates a timeline, it tends to get dates wrong. Requires sufficient fact-checking. How Step 2: GPT-3
  15. Conclusion GPT-3 is a good option to get POC of

    context-cards up. We’ll need to build in editorial review for fact-checking and refinement before it gets published to audiences! How Step 2: GPT-3 Detailed evaluation on GPT3 at https://context-cards.com
  16. Thank you We’ve blogged ~10,000 words on our exploration from

    a technical point of view. You can find all those blogs at https://context-cards.com We’ll publish our pre-trained models on Hugging Face next week.