Integration of Multiple Sound Source Localization Results for Speaker Identification in Multi-party Dialogue System

Slide 1

Slide 1 text

*OUFHSBUJPO PG .VMUJQMF 4PVOE 4PVSDFe -PDBMJ[BUJPO 3FTVMUT GPS 4QFBLFS *EFOUJGJDBUJPO JO .VMUJQBSUZ %JBMPHVF 4ZTUFN Graduate School of Engineering, Nagoya University Taichi Nakashima, Kazunori Komatani, Satoshi Sato

Slide 2

Slide 2 text

(PBM “Implementing multi-party dialogue system” interacts with more than two users ❶ only use sensors equipped on a robot ❷ use two robots Multiple users sit around a table → Simplify the problem to decide the positions of users

Slide 3

Slide 3 text

Heading toward the user to answer his/her questions We use sound source localization results to identify a speaker head Hello Hello → to understand role of addressee [Mutlu, 2009] [Bennewitz, 2005] 4QFBLFS *EFOUJGJDBUJPO → to feel involved in conversation This behavior enables users

Slide 4

Slide 4 text

$POTUSVDUJPO PG %FNP 4ZTUFN Demo system identifies a speaker

Slide 5

Slide 5 text

3FMBUFE 8PSL 6TJOH WJTVBM JOGPSNBUJPO 6TJOH TPVOE TPVSDF MPDBMJ[BUJPO Detecting lip movements [Faish, 2012] Recognizing gestures [Bohus, 2009] → It’s difficult to identify speakers when they are out of the field of the system camera - It’s difficult to keep track of users in the field of the robot’s camera In our situation - The robot cannot always look around (the angle of robot’s camera is narrow) (the robot is a participant in the conversation) → Using localization results enable us to identify speakers who are out of the field of the robot’s camera 4QFBLFS *EFOUJGJDBUJPO

Slide 6

Slide 6 text

&BTZ User A User B User C Robot 1SPCMFNT PG 4PVOE 4PVSDF -PDBMJ[BUJPO 1. Some positions of users are difficult to localize 2. Environmental noise may cause incorrect localization → Localization results do not always indicate the direction of speaker

Slide 7

Slide 7 text

%J⒏DVMU User A User B User C Robot 1SPCMFNT PG 4PVOE 4PVSDF -PDBMJ[BUJPO 1. Some positions of users are difficult to localize 2. Environmental noise may cause incorrect localization → Localization results do not always indicate the direction of speaker

Slide 8

Slide 8 text

4PMVUJPOT 0WFSWJFX Robot B User A User B Multiple users sit around a table Robot A 1. Placing robots on a table to opposite each other so as to compensate each other’s capabilities 2. Integrating sound source localization results from the robots

Slide 9

Slide 9 text

4PMVUJPOT 0WFSWJFX User A User B %J⒏DVMU Multiple users sit around a table Angular difference is small )J &BTZ Angular difference is large Robot A Robot B 1. Placing robots on a table to opposite each other so as to compensate each other’s capabilities 2. Integrating sound source localization results from the robots

Slide 10

Slide 10 text

User A User B Multiple users sit around a table ʁ Environmental Noise Noise ? Utterance ? Robot B Robot A Power = a confidence measure of localization results º 4PMVUJPOT 0WFSWJFX 1. Placing robots on a table to opposite each other so as to compensate each other’s capabilities 2. Integrating sound source localization results from the robots

Slide 11

Slide 11 text

p θ *OQVUT BOE 0VUQVUT PG 0VS .FUIPE 4FUUJOHT 4PVOE TPVSDF MPDBMJ[BUJPO Robot audition software .JDSPQIPOFT four microphones in head. outputs Localization results: [deg] Power: [dB] every 1 frame (=0.01 second) Angular resolution = 10 [deg] { Impulse response for calculating the transfer function →recorded at 36 points, at intervals of 10 [deg] based on MUltiple SIgnal Classification (MUSIC) method developed in Kyoto Univ.

Slide 12

Slide 12 text

θA θA θB θmix θB pB pA pmix Localization results Power: : Localization results Power: : Localization results Power: : Integrated Robot A Robot B *OQVUT *OQVUT 0VUQVUT { *OQVUT BOE 0VUQVUT PG 0VS .FUIPE

Slide 13

Slide 13 text

fr (θ) = 1 2πσ2 r exp(− (θ − θr )2 2σ2 r ) fr (θr ) = 1 2πσ2 r = 1 C pr *OUFHSBUJPO PG .VMUJQMF -PDBMJ[BUJPO 3FTVMUT 1. Define probability density function from θr "TTVNQUJPO UIF BNCJHVJUZ PG MPDBMJ[BUJPO SFTVMUT GPMMPXT B OPSNBM EJTUSJCVUJPO 2. Define the maximum probability is proportioned to pr "TTVNQUJPO UIF QPXFS PG MPDBMJ[BUJPO SFTVMUT DBVTFE CZ OPJTF JT MPX C is a constant value and determined empirically

Slide 14

Slide 14 text

θA θB f A (θ) f B (θ) -PDBMJ[BUJPO SFTVMU 1SPCBCJMJUZ f A (θ) + f B (θ) θmix p mix C p A C p B C θmix = arg max θ (fA (θ) + fB (θ)) Threshold Reduce the number of incorrect localization results due to noise *OUFHSBUJPO PG .VMUJQMF -PDBMJ[BUJPO 3FTVMUT Robot A Robot B

Slide 15

Slide 15 text

! " # $ % Robot A Robot B 150 [cm] loudspeaker 30 [cm] 75 [cm] 20° 25° 33° 107° &WBMVBUJPO &YQFSJNFOUT Evaluated whether using two robots improved speaker identification 4FUUJOHT 46° 72° The range of loudspeaker from Robot B

Slide 16

Slide 16 text

! " # $ % Robot A Robot B 150 [cm] loudspeaker 30 [cm] 75 [cm] 20° 25° 33° 107° &WBMVBUJPO &YQFSJNFOUT Evaluated whether using two robots improved speaker identification 4FUUJOHT ! 46° 72° *ODPSSFDU $PSSFDU The range of loudspeaker from Robot B

Slide 17

Slide 17 text

%BUB &WBMVBUJPO .FBTVSF /VNCFS PG GSBNFT XIFO MPDBMJ[BUJPO SFTVMU XBT DPSSFDU /VNCFS PG GSBNFT XIFO MPDBMJ[BUJPO SFTVMU XBT DPSSFDU /VNCFS PG BMM EFUFDUFE GSBNFT /VNCFS PG TQFFDI GSBNFT 5 utterances × 5 points × 4 speakers = 100 data One audio file includes one utterance whose duration is 1.0 second &WBMVBUJPO &YQFSJNFOUT 4FUUJOHT

Slide 18

Slide 18 text

! " # $ % Robot A Robot B 41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- Robot A Robot B Evaluated whether using two robots improved speaker identification 3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST 6TJOH POMZ POF SPCPU

Slide 19

Slide 19 text

! " # $ % Robot A Robot B 41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- Robot A Robot B It’s difficult to identify loudspeakers that were far from the robots. Evaluated whether using two robots improved speaker identification 3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST 6TJOH POMZ POF SPCPU

Slide 20

Slide 20 text

! " # $ % Robot A Robot B 41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- Robot A Robot B The performances differed between two robots. It’s difficult to identify loudspeakers that were far from the robots. Evaluated whether using two robots improved speaker identification 3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST 6TJOH POMZ POF SPCPU

Slide 21

Slide 21 text

41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- Robot A Robot B Integration Evaluated whether using two robots improved speaker identification 3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST *OUFHSBUJPO C = 800 thresh = 25.5 800

Slide 22

Slide 22 text

Slide 23

Slide 23 text

C = 800 thresh = 25.5 800 Robot A Robot B Integration 41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- Evaluated whether using two robots improved speaker identification 3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST *OUFHSBUJPO In particular, the loudspeaker at C get correctly identified, for which neither robots cannot System can identify the areas that only one robot cannot

Slide 24

Slide 24 text

&SSPS3BUF "WFSBHF&SSPS 1PXFS Evaluated whether integrated power was valid as a confidence measure -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS

Slide 25

Slide 25 text

&SSPS3BUF "WFSBHF&SSPS 1PXFS Evaluated whether integrated power was valid as a confidence measure -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS When power is high, average errors are small

Slide 26

Slide 26 text

&SSPS3BUF "WFSBHF&SSPS 1PXFS Evaluated whether integrated power was valid as a confidence measure -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS When power is high, average errors are small When power is low, error rates are high and average errors are large

Slide 27

Slide 27 text

&SSPS3BUF "WFSBHF&SSPS 1PXFS When power is high, average errors are small Power can distinguish correct localization results from incorrect ones When power is low, error rates are high and average errors are large Evaluated whether integrated power was valid as a confidence measure -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS

Slide 28

Slide 28 text

When power is low, robot checks whether a speaker exists or not by executing face detection How to use power Evaluated whether integrated power was valid as a confidence measure -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS

Slide 29

Slide 29 text

$PODMVTJPO 'VUVSF 8PSLT Integrate multiple sound source localization results Implement demo system → identifying a speaker and heading toward to answer → executing face detection to check whether a speaker exists on the basis of power Use other evidence of speaker’s existence e.g. image processing → improve performance of speaker identification → improve performance compared with using only one robot