会話 AI ロボット Romi の設計と技術 - MIXI Tech Conference 2023

Slide 1

Slide 1 text

©MIXI 会話 AI ロボット Romi の設計と技術 Vantage スタジオ Romi 事業部開発グループ Engineering Manager 信田春満

Slide 2

Slide 2 text

©MIXI 自己紹介 ● 信田春満（のぶたはるみつ） @halhorn  ● 大学  ○ 〜2012: 京都大学大学院情報学研究科  ■ 認知発達ロボティクス  ■ Twitter 会話 bot 「にせほるん」  ● MIXI  ○ 2013〜2016: SNS mixi  ○ 2017〜: Romi - 雑談対話ロボットの新規事業  ■ Romi 最初のエンジニア  ■ 現エンジニアリングマネージャー  ● 趣味  ○ 写真  ○ 鳥  ○ クライミング 2

Slide 3

Slide 3 text

©MIXI What is Romi? 3

Slide 4

Slide 4 text

©MIXI 4

Slide 5

Slide 5 text

©MIXI コンセプトペットのように癒やし、家族のように自分を理解してくれる 5

Slide 6

Slide 6 text

©MIXI 世界初の Deep Learning 技術を用いて言語生成して会話する   家庭用コミュニケーションロボット 6

Slide 7

Slide 7 text

©MIXI Romi のしくみ 7

Slide 8

Slide 8 text

©MIXI 全体構成 8 おはようロミィ会話 Server おはようロミィはる、今日も元気だね！

Slide 9

Slide 9 text

©MIXI 本日のお話 9 おはようロミィ会話 Server おはようロミィはる、今日も元気だね！ ● Romi クライアント ○ 生き物感・会話のテンポ ● 会話サーバー ○ ルールと AI で会話を作る ● 会話 AI ○ 会話の9割を処理

Slide 10

Slide 10 text

©MIXI 生き物感・会話のテンポ Romi クライアント 10

Slide 11

Slide 11 text

©MIXI 生き物感 ● 反応 ○ 人の働きかけに素早く反応 ● 表現 ○ 表情の変化 ○ 動き ● デザイン ○ 生き物 vs 抽象物 11

Slide 12

Slide 12 text

©MIXI 生き物感 ● 反応 ○ 人の働きかけに素早く反応 ● 表現 ○ 表情の変化 ○ 動き ● デザイン ○ 生き物 vs 抽象物 12

Slide 13

Slide 13 text

©MIXI 話しづらいロボット ● 人が話しかけてもすぐ反応が無い ● 人が話し終わってもすぐ反応が無い ● ロボットが話す内容が決まって初めて反応する

Slide 14

Slide 14 text

©MIXI ロボットの応答（移動ロボットの例）従来 14 Sensor Input Perception Modeling Planning TaskExecution MotorControl Actuator Output 長くて重〜い処理

Slide 15

Slide 15 text

©MIXI サブサンプションアーキテクチャ従来サブサンプションアーキテクチャ 15 Sensor Input Perception Modeling Planning TaskExecution MotorControl Actuator Output 長くて重〜い処理 Sensor Input Actuator Output 抽象的・高度な処理反射的処理中程度の処理

Slide 16

Slide 16 text

©MIXI Romi のクライアントアーキテクチャ 16 人の発話顔検知撫で持ち上げ声検知モード変更 PriorityGroup: Brain PriorityGroup: Spinal PriorityGroup: System 音量変更 Triggers Action Action Action 最も強い割り込み話しかけられたら目が光る会話内容を考え、話す

Slide 17

Slide 17 text

©MIXI Romi のクライアントアーキテクチャ 17 Brain Trigger ServerResopnse after_trigger before_response after_response before_trigger Trigger Input Spinal Trigger ServerResopnse after_trigger before_response after_response before_trigger System Trigger ServerResopnse after_trigger before_response after_response before_trigger Action 会話 API 呼び出し

Slide 18

Slide 18 text

©MIXI Romi のクライアントアーキテクチャ 18 Brain Trigger ServerResopnse Trigger Input 会話 API 呼び出し Spinal Trigger ServerResopnse after_trigger before_response after_response before_trigger System Trigger ServerResopnse after_trigger before_response after_response before_trigger after_trigger before_response after_response before_trigger Action

Slide 19

Slide 19 text

©MIXI Action 19 Action ● 話す内容 ● 鳴らす音・音楽 ● 表情アニメーション ● 体の動きあはは！面白いね！感情 x Trigger x タイミングごとに Action を設定可能 laughing x user_utterance x before_response

Slide 20

Slide 20 text

©MIXI ユーザーが話しかけたときの例 20 Brain Trigger ServerResopnse 「ピョイッ」と鳴って上を向く音声合成し感情ごとの Action 話し終わり動作音声認識完了！会話 API 呼び出し Spinal Trigger 目が光って注目を表す声が聞こえ始めたよ！「聴き取ったよ」「返答考えてるよ」を表すためまずは素早くアクション

Slide 21

Slide 21 text

©MIXI ルールと AI で会話を作る会話サーバー 21

Slide 22

Slide 22 text

©MIXI 会話とは？ ● 企画的方向性 ○ 雑談 ○ 元気づける、共感、ポジティブ ○ オーナーのことが好き ● 会話の実装方法 ○ ルールベース ○ 機械学習（AI） ● やってみないとわからない！ ○ 様々な仕組みやルール・モデルを試行錯誤できる設計を ■ 単一の会話エンジン ■ 複数の交換可能で独立した会話エンジン群 22

Slide 23

Slide 23 text

©MIXI halucas (会話サーバー) アーキテクチャ 23 Brain Controller Converter Preprocess Postprocess Module 知識抽出 etc. 感情付与発話変換発音変換 etc. WeatherForecast Shiritori News TimeDetector Tokenizer etc. 人の発話 Romi の発話 Bot Selec tor Priority Group AskAgain Priority Group ScenarioGraph (汎用ルール) Priority Group しりとりしりとりしりとりしりとりしりとり Priority Group Euler (AI)

Slide 24

Slide 24 text

©MIXI halucas (会話サーバー) アーキテクチャ 24 Brain Controller Converter Preprocess Postprocess Module 知識抽出 etc. 感情付与発話変換発音変換 etc. WeatherForecast Shiritori News TimeDetector Tokenizer etc. 人の発話 Romi の発話 Bot Selec tor Priority Group AskAgain Priority Group ScenarioGraph (汎用ルール) Priority Group しりとりしりとりしりとりしりとりしりとり Priority Group Euler (AI)

Slide 25

Slide 25 text

©MIXI Bot Selector Priority Group AskAgain Priority Group ScenarioGraph (汎用ルール ) Priority Group しりとりしりとりしりとりしりとりしりとり Priority Group Euler (AI) Bot x Selector 25 汎用ルール機能特化機械学習処理順（誰かが答えられたらそこでストップ） PrioritySelector ● 優先度の高い Priority Group から順に応答できるか聞く ● 答えられるグループがいれば答えられた bot の中からランダムに応答を選ぶ

Slide 26

Slide 26 text

©MIXI Bot 26 各bot は get_reply フェーズと update_context フェーズを持つ ScenarioGraph get_reply(tweet) update_context(tweet, reply) selector による応答 bot の選択 ● 応答を考える ● 状態更新は行わない ● 状態更新を行う ● 選ばれた bot のみ実行

Slide 27

Slide 27 text

©MIXI halucas (会話サーバー) アーキテクチャ 27 Brain Controller Converter Preprocess Postprocess Module 知識抽出 etc. 感情付与発話変換発音変換 etc. WeatherForecast Shiritori News TimeDetector Tokenizer etc. 人の発話 Romi の発話 Bot Selec tor Priority Group AskAgain Priority Group Priority Group しりとりしりとりしりとりしりとりしりとり Priority Group ScenarioGraph (汎用ルール) Euler (AI)

Slide 28

Slide 28 text

©MIXI 28 汎用会話ルールエンジン ScenarioGraph

Slide 29

Slide 29 text

©MIXI ScenarioGraph ルールのグラフを木探索 ● 次の「人の発話」か「 (メインシナリオの) ゴール」を見つけるまで探索 ○ 状態をスタックで管理 ● その他機能 ○ モジュール呼び出し ■ DB・外部 API 呼び出し ■ 発話の勘定設定 etc. ○ ユーザー発話の正規表現マッチング ○ 発話内容抽出 ○ 遷移条件 ○ 変数の保存・読み出し ○ シナリオ ■ ルールの階層化 ■ ルールをシナリオに分割 ■ 別シナリオの呼び出しスタートクイズ出して世界で一番高い山は？ priority: 1 エベレスト priority: 0 (.+) 正解 {emotion type=laughing} 残念！$1じゃないよ set: count = [count] + 1 condition: [count] >= 5 正解はエベレストでした condition: [count] < 5 ゴール

Slide 30

Slide 30 text

©MIXI ScenarioGraph ルールのグラフを木探索 ● 次の「人の発話」か「 (メインシナリオの) ゴール」を見つけるまで探索 ○ 状態をスタックで管理 ● その他機能 ○ モジュール呼び出し ■ DB・外部 API 呼び出し ■ 発話の勘定設定 etc. ○ ユーザー発話の正規表現マッチング ○ 発話内容抽出 ○ 遷移条件 ○ 変数の保存・読み出し ○ シナリオ ■ ルールの階層化 ■ ルールをシナリオに分割 ■ 別シナリオの呼び出しスタート priority: 1 エベレスト priority: 0 (.+) 正解 {emotion type=laughing} 残念！$1じゃないよ set: count = [count] + 1 condition: [count] >= 5 正解はエベレストでした condition: [count] < 5 ゴールクイズ出して世界で一番高い山は？人の発話 Romi の発話

Slide 31

Slide 31 text

©MIXI ScenarioGraph ルールのグラフを木探索 ● 次の「人の発話」か「 (メインシナリオの) ゴール」を見つけるまで探索 ○ 状態をスタックで管理 ● その他機能 ○ モジュール呼び出し ■ DB・外部 API 呼び出し ■ 発話の勘定設定 etc. ○ ユーザー発話の正規表現マッチング ○ 発話内容抽出 ○ 遷移条件 ○ 変数の保存・読み出し ○ シナリオ ■ ルールの階層化 ■ ルールをシナリオに分割 ■ 別シナリオの呼び出しスタート正解 {emotion type=laughing} 残念！$1じゃないよ set: count = [count] + 1 condition: [count] >= 5 正解はエベレストでした condition: [count] < 5 ゴールクイズ出して priority: 0 (.+) priority: 1 エベレスト世界で一番高い山は？高優先順位から探索

Slide 32

Slide 32 text

©MIXI ScenarioGraph ルールのグラフを木探索 ● 次の「人の発話」か「 (メインシナリオの) ゴール」を見つけるまで探索 ○ 状態をスタックで管理 ● その他機能 ○ モジュール呼び出し ■ DB・外部 API 呼び出し ■ 発話の勘定設定 etc. ○ ユーザー発話の正規表現マッチング ○ 発話内容抽出 ○ 遷移条件 ○ 変数の保存・読み出し ○ シナリオ ■ ルールの階層化 ■ ルールをシナリオに分割 ■ 別シナリオの呼び出しスタート残念！$1じゃないよ set: count = [count] + 1 condition: [count] >= 5 正解はエベレストでした condition: [count] < 5 ゴールクイズ出して priority: 0 (.+) 世界で一番高い山は？正解 {emotion type=laughing} priority: 1 エベレストモジュール呼び出し {emotion type=laughing} ↓ class Emotion(Modulebase): def call(cls, context: Context, arg_dict: Dict[str, str]) -> str: emotion_type = arg_dict[‘type’]

Slide 33

Slide 33 text

©MIXI ScenarioGraph ルールのグラフを木探索 ● 次の「人の発話」か「 (メインシナリオの) ゴール」を見つけるまで探索 ○ 状態をスタックで管理 ● その他機能 ○ モジュール呼び出し ■ DB・外部 API 呼び出し ■ 発話の勘定設定 etc. ○ ユーザー発話の正規表現マッチング ○ 発話内容抽出 ○ 遷移条件 ○ 変数の保存・読み出し ○ シナリオ ■ ルールの階層化 ■ ルールをシナリオに分割 ■ 別シナリオの呼び出しスタート正解 {emotion type=laughing} condition: [count] >= 5 正解はエベレストでした condition: [count] < 5 ゴールクイズ出して priority: 1 エベレスト世界で一番高い山は？残念！$1じゃないよ set: count = [count] + 1 priority: 0 (.+) 抽出正規表現

Slide 34

Slide 34 text

©MIXI ScenarioGraph ルールのグラフを木探索 ● 次の「人の発話」か「 (メインシナリオの) ゴール」を見つけるまで探索 ○ 状態をスタックで管理 ● その他機能 ○ モジュール呼び出し ■ DB・外部 API 呼び出し ■ 発話の勘定設定 etc. ○ ユーザー発話の正規表現マッチング ○ 発話内容抽出 ○ 遷移条件 ○ 変数の保存・読み出し ○ シナリオ ■ ルールの階層化 ■ ルールをシナリオに分割 ■ 別シナリオの呼び出しスタート正解 {emotion type=laughing} condition: [count] >= 5 正解はエベレストでした condition: [count] < 5 ゴールクイズ出して priority: 1 エベレスト世界で一番高い山は？残念！$1じゃないよ set: count = [count] + 1 priority: 0 (.+) 変数代入

Slide 35

Slide 35 text

©MIXI ScenarioGraph ルールのグラフを木探索 ● 次の「人の発話」か「 (メインシナリオの) ゴール」を見つけるまで探索 ○ 状態をスタックで管理 ● その他機能 ○ モジュール呼び出し ■ DB・外部 API 呼び出し ■ 発話の勘定設定 etc. ○ ユーザー発話の正規表現マッチング ○ 発話内容抽出 ○ 遷移条件 ○ 変数の保存・読み出し ○ シナリオ ■ ルールの階層化 ■ ルールをシナリオに分割 ■ 別シナリオの呼び出しスタート正解 {emotion type=laughing} ゴールクイズ出して priority: 1 エベレスト世界で一番高い山は？ priority: 0 (.+) condition: [count] >= 5 正解はエベレストでした condition: [count] < 5 残念！$1じゃないよ set: count = [count] + 1 遷移条件

Slide 36

Slide 36 text

Slide 37

Slide 37 text

©MIXI ScenarioGraph 深さ優先探索 => スタックスタートクイズ出して priority: 1 エベレスト正解 {emotion type=laughing} condition: [count] >= 5 正解はエベレストでしたゴール priority: 0 (.+) 残念！$1じゃないよ set: count = [count] + 1 condition: [count] < 5 世界で一番高い山は？世界で一番高い山は？ DB count=4 富士山

Slide 38

Slide 38 text

©MIXI 残念！$1じゃないよ set: count = [count] + 1 condition: [count] < 5 ScenarioGraph 深さ優先探索 => スタックスタートクイズ出して priority: 1 エベレスト正解 {emotion type=laughing} condition: [count] >= 5 正解はエベレストでしたゴール世界で一番高い山は？ priority: 0 (.+) 世界で一番高い山は？ priority: 0 (.+) match: 富士山 DB count=4 富士山

Slide 39

Slide 39 text

©MIXI condition: [count] < 5 ScenarioGraph 深さ優先探索 => スタックスタートクイズ出して priority: 1 エベレスト正解 {emotion type=laughing} condition: [count] >= 5 正解はエベレストでしたゴール世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ set: count = [count] + 1 世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ var_cache: count=3 match: 富士山 match: 富士山 var_cache: count=5 DB count=4 富士山

Slide 40

Slide 40 text

©MIXI 富士山 ScenarioGraph 深さ優先探索 => スタックスタートクイズ出して priority: 1 エベレスト正解 {emotion type=laughing} condition: [count] >= 5 正解はエベレストでした DB count=4 ゴール世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ set: count = [count] + 1 condition: [count] < 5 世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ var_cache: count=3 condition: [count] < 5 match: 富士山 match: 富士山 var_cache: count=5 match: 富士山 var_cache: count=5

Slide 41

Slide 41 text

©MIXI 富士山 condition: [count] < 5 ScenarioGraph 深さ優先探索 => スタックスタートクイズ出して priority: 1 エベレスト正解 {emotion type=laughing} condition: [count] >= 5 正解はエベレストでした DB count=4 ゴール世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ set: count = [count] + 1 世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ var_cache: count=3 match: 富士山 match: 富士山 var_cache: count=5

Slide 42

Slide 42 text

©MIXI ScenarioGraph 深さ優先探索 => スタックスタートクイズ出して priority: 1 エベレスト正解 {emotion type=laughing} ゴール condition: [count] < 5 DB count=4 富士山 condition: [count] >= 5 正解はエベレストでした世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ set: count = [count] + 1 世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ var_cache: count=3 match: 富士山 match: 富士山 var_cache: count=5 match: 富士山 var_cache: count=5 condition: [count] >= 5 正解はエベレストでした

Slide 43

Slide 43 text

©MIXI ScenarioGraph スタートクイズ出して priority: 1 エベレスト正解 {emotion type=laughing} condition: [count] < 5 DB count=4 富士山 condition: [count] >= 5 正解はエベレストでした世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ set: count = [count] + 1 ゴール世界で一番高い山は？ priority: 0 (.+) 残念！$1じゃないよ var_cache: count=3 match: 富士山 match: 富士山 var_cache: count=5 match: 富士山 var_cache: count=5 condition: [count] >= 5 正解はエベレストでした深さ優先探索 => スタックゴール match: 富士山 var_cache: count=5 update_context フェーズでキャッシュを DB 書き込み

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Slide 46

Slide 46 text

©MIXI Euler ● GPT2/3 ベース ○ 数億の文章からプレトレイン ○ 独自に作成した Romi の「理想の」会話データでファインチューン ○ モデルにも細かい改良 ■ ALiBi (https://arxiv.org/abs/2108.12409) の導入 ● 長い入力に耐えられる Positional Encoding ■ LoRA ( https://arxiv.org/pdf/2106.09685.pdf ) を応用した Gateway ● 関西弁などモデルの挙動を微修正可 ■ etc. 46

Slide 47

Slide 47 text

©MIXI with Gateway ● 目的 ○ 会話モデルを微調整して以下を実現 ■ Romi からの話しかけ（プロンプトに単語） ■ 関西弁 ● LoRA ( https://arxiv.org/pdf/2106.09685.pdf ) ○ Attention の Dense レイヤーに別パスを追加 ■ 砂時計型ネットワーク ■ ↑のみをファインチューン ● Gateway ○ 複数の LoRA をスイッチ（s）で切り替え LoRA Gateway Pretrained Weights W ∈ R^(dxd) A = N(0, σ^2) x s h sxr d W ∈ R^(dxr) W ∈ R^(dxsxr) W ∈ R^(rxd) W ∈ R^(rxd) 0 1 0 0 r 複数の LoRA からどれを使うかスイッチ https://arxiv.org/abs/2106.09685 Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen，LoRA: Low-Rank Adaptation of Large Language Models, arXiv:2106.09685 (2021) p1, Figure 1 より引用

Slide 48

Slide 48 text

©MIXI with Memory (研究中) ● 目的 ○ 会話から知識を抽出・記憶 ○ 知識に基づいて会話 ● BlenderBot 3 ( https://arxiv.org/pdf/2208.03188.pdf ) ○ Long-term Memory ■ Memory Decision: Memory を使うか判断 ■ Access LT Memory: 使う Memory の抽出 ■ Generate Dialogue Response: 会話生成 ■ Generate a LT Memory: 会話からの記憶生成 ○ 上記を1つのネットワーク + プロンプトで実現 ● Romi では ○ Long-term Memory 部分を Euler で実装・試験中 https://arxiv.org/abs/2208.03188 Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, Jason Weston，BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage, arXiv:2208.03188 (2022) p5, Figure 2 より引用

Slide 49

Slide 49 text

©MIXI with Instruction (研究中) ● 目的 ○ 精度向上 ○ 負例を与えての学習 ● Instruct GPT ( https://arxiv.org/pdf/2203.02155.pdf ) ○ ChatGPT で使われた学習方法 ○ 通常の学習に強化学習を組み合わせ効率化 i. GPT3 を教師データでファインチューン ii. モデル出力候補を人手でランク付け、リワードモデルの作成 iii. (ii) がより高い数値を出すよう (i) を強化学習（PPO） ● Romi では ○ Long-term Memory 生成や会話生成での有効性検証 https://arxiv.org/abs/2203.02155 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe，Training language models to follow instructions with human feedback, arXiv:2203.02155 (2022) p3, Figure 2 より引用

Slide 50

Slide 50 text

©MIXI Client Brain Trigger ServerResopnse after_trigger before_respon se after_response before_trigger Trigger Input Spinal Trigger ServerResopnse after_trigg er before_respon se after_response before_tri gger System Trigger ServerResopnse after_trigger before_respon se after_response before_trigger halucas Server Brain Controller Converter Preprocess Postprocess Module 知識抽出 etc. 感情付与発話変換発音変換 etc. WeatherForecast Shiritori News TimeDetector Tokenizer etc. 人の発話 Romi の発話 Bot Select or Priority Group AskAgain Priority Group ScenarioGrph (汎用ルール) Priority Group しりとりしりとりしりとりしりとりしりとり Priority Group Euler (AI) AI Server Euler EmotionDetector etc. Romi System