Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards Full-Duplex Dialogue Quality Assurance ...

Towards Full-Duplex Dialogue Quality Assurance for High-Stakes Assessment Agents

This is the slide for TAI talk: https://luma.com/eup1o3yh

This talk frames the challenges unique to such multimodal agents through the lens of DevOps and MLOps and shares practical lessons learned. It also outlines key requirements for high-stakes assessment agents and introduces parts of the research frameworks we use to meet them.

Avatar for Sadahiro Yoshikawa

Sadahiro Yoshikawa

February 10, 2026
Tweet

More Decks by Sadahiro Yoshikawa

Other Decks in Research

Transcript

  1. 2 City of Gifu InteLLA Intelligent Language Learning Assistant being

    used by K-12 and university students across Japan Assessing Language Proficiency with AI agents at Scale
  2. 3

  3. 7 MLOps: The perspective of production [Sculley et al. 2015]

    D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 2503–2511, Cambridge, MA, USA, 2015. Most of all is Engineering.
  4. 8 DialOps: The perspective of DevOps/MLOps DialOps (Dialogue system Operations)

    is mostly DevOps/MLOps. DevOps • Utilization of communication tools • Introduction of agile development • Utilization of progress management tools • Source code version control • Unit testing • Pre-approval code testing • Environment setup as code • Automated build/deployment • Application/infrastructure monitoring • Alert notification • Incident templates • etc. MLOps • Training data and annotation management • Feature extractor development • Source code and ML model version control • Hyperparameter management • Offline and online evaluation • Infrastructure as Code (IaC) and GPU provisioning • Server health monitoring • Automated build and deployment (CI/CD) • etc. [Yoshikawa et al. 2025] Sadahiro Yoshikawa, Mao Saeki, Hiroaki Takatsu, Fuma Kurata, and Yoichi Matsuyama. 2024. Dialops: Con- tinuous development and operational management framework for large-scale dialogue systems. In JSAI SIG-SLUD (Language and Speech Understanding and Dialogue), Meeting 102. [In Japanese].
  5. 10 ML models: Inference latency – mostly the faster, the

    better Dialogue Systems: Inferring context-appropriate utterance content and timing is essential • Example 1: Responding too quickly can overlap the current utterance [Raux and Eskenazi, 2009] • Example 2: Predicting utterance continuation/speaker turn-taking [Sacks 1974, Skantze 2021, Kurata 2023, Inoue 2024] • Example 3: A 0-second response is not always natural [Yoshikawa 2024] -> We should understand not only latency measurement but also its timing and preceding/following context. DialOps Specific Problems | Dialogue Real-Time
  6. 11 ML models: Evaluation based on the time series of

    input/output data Dialogue Systems: Evaluation Across Different Time Scales • Basic Units: Syllables, words, clauses, sentences [Jurafsky 2000] • Discourse Units: Turns, adjacency pairs, initiative, topic [Sacks 1974] • Non-verbal/Paralinguistic: Gaze, nodding [Ward 2000, Kawahara 2013, Ishii 2013, Kobayashi 2013] • Short-to-Long Term Dynamics: Engagement, rapport, emotion, intimacy [Bickmore 2005, Pecune 2018, Arimoto 2024, Kurata 2024, Jiang 2024] -> Many options for analysis targets for a single improvement DialOps-Specific Problems | Evaluation Across Various Time Granularities
  7. 13 Impact of External Factors: Data Pipeline Metrics and Monitoring

    Pipeline is the matter. ? 通信影響 会話AI視点 +言語学習者視点の 多ドメイン解析 特徴抽出部
  8. Emulating users has huge impact. 14 Dialogue in Real-Time: User

    emulation User Emulator Conversational AI Agent x 10,000 dialogs x 1,000 pattern in a day x 600 dialogs x 300 pattern in a month
  9. Can disfluency be simulated? 15 Dialogue in Real-Time: User emulation

    End-clause pause 節外 ポーズ Mid-clause pause 節内 ポーズ Mid-clause pause 節内 ポーズ End-clause pause ratio Mid-clause pause ratio User Emulator Conversational AI Agent Conversational AI Agent User [Obi 2026] Takao Obi, Sadahiro Yoshikawa, Mao Saeki, Masaki Eguchi, and Yoichi Matsuyama, Reproducing Proficiency-Conditioned Dialogue Features with Full-duplex Spoken Dialogue Models, International Workshop on Spoken Dialogue Systems (IWSDS 2026) [Matsuura 2022] Ryuki Matsuura, Shungo Suzuki, Mao Saeki, Tetsuji Ogawa, and Yoichi Matsuyama. "Refinement of utterance fluency feature extraction and automated scoring of L2 oral fluency with dialogic features." In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1312-1320. IEEE, 2022. Fine-tune Feature Extraction Analysis
  10. 16 Evaluation Across Various Time Granularitiess: Data Pipeline Data Pipeline

    is the key for short/middle/long-term analysis.. Multi dialog analysis
  11. 19 Q: What Defines High-Quality Dialogue? Assessment quality is the

    measurement of our dialogue quality. InteLLA System Interaction Video Data Stream [Takatsu et al. 2024] Hiroaki Takatsu, Shungo Suzuki, Masaki Eguchi, Ryuki Matsuuram, Mao Saeki and Yoichi Matsuyama, Gnowsis: Multimodal Multitask Learning for Oral Proficiency Assessments, Computer Speech & Language (in review)
  12. 20 Q: What Defines High-Quality Assessment? Assessment Use Argument (AUA)

    defines the assessment quality (and stakes) . Linked
  13. 21 Q: What Defines “High-stakes” Assessment? EU AI Act (2024)

    defines our services as “High-risk” AI usage. ref. https://www.slideshare.net/slideshow/10-deploys-per-day-dev-and-ops-cooperation-at-flickr/1628368
  14. 23 Q: What “Proves” High-stakes Assessment? Argumentation is the key.

    • AUA is defined on the Toulmin model of argumentation.
  15. 24 Q: What “Proves” High-stakes Assessment? Data analysis is the

    key. • Recently, Data analysis can be done by LLM agents ref. https://openai.com/ja-JP/index/inside-our-in-house-data-agent/