Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Integration of Multiple Sound Source Localization Results for Speaker Identification in Multi-party Dialogue System

Integration of Multiple Sound Source Localization Results for Speaker Identification in Multi-party Dialogue System

taichi nakashima

December 05, 2012
Tweet

More Decks by taichi nakashima

Other Decks in Research

Transcript

  1. *OUFHSBUJPO PG .VMUJQMF 4PVOE 4PVSDFe -PDBMJ[BUJPO 3FTVMUT GPS 4QFBLFS *EFOUJGJDBUJPO

    JO .VMUJQBSUZ %JBMPHVF 4ZTUFN Graduate School of Engineering, Nagoya University Taichi Nakashima, Kazunori Komatani, Satoshi Sato
  2. (PBM “Implementing multi-party dialogue system” interacts with more than two

    users ❶ only use sensors equipped on a robot ❷ use two robots Multiple users sit around a table → Simplify the problem to decide the positions of users
  3. Heading toward the user to answer his/her questions We use

    sound source localization results to identify a speaker head Hello Hello → to understand role of addressee [Mutlu, 2009] [Bennewitz, 2005] 4QFBLFS *EFOUJGJDBUJPO → to feel involved in conversation This behavior enables users
  4.  3FMBUFE 8PSL 6TJOH WJTVBM JOGPSNBUJPO 6TJOH TPVOE TPVSDF MPDBMJ[BUJPO

    Detecting lip movements [Faish, 2012] Recognizing gestures [Bohus, 2009] → It’s difficult to identify speakers when they are out of the field of the system camera - It’s difficult to keep track of users in the field of the robot’s camera In our situation - The robot cannot always look around (the angle of robot’s camera is narrow) (the robot is a participant in the conversation) → Using localization results enable us to identify speakers who are out of the field of the robot’s camera 4QFBLFS *EFOUJGJDBUJPO
  5. &BTZ User A User B User C Robot  1SPCMFNT

    PG 4PVOE 4PVSDF -PDBMJ[BUJPO 1. Some positions of users are difficult to localize 2. Environmental noise may cause incorrect localization → Localization results do not always indicate the direction of speaker
  6. %J⒏DVMU User A User B User C Robot  1SPCMFNT

    PG 4PVOE 4PVSDF -PDBMJ[BUJPO 1. Some positions of users are difficult to localize 2. Environmental noise may cause incorrect localization → Localization results do not always indicate the direction of speaker
  7.  4PMVUJPOT 0WFSWJFX Robot B User A User B Multiple

    users sit around a table Robot A 1. Placing robots on a table to opposite each other so as to compensate each other’s capabilities 2. Integrating sound source localization results from the robots
  8.  4PMVUJPOT 0WFSWJFX User A User B %J⒏DVMU Multiple users

    sit around a table Angular difference is small )J &BTZ Angular difference is large Robot A Robot B 1. Placing robots on a table to opposite each other so as to compensate each other’s capabilities 2. Integrating sound source localization results from the robots
  9. User A User B Multiple users sit around a table

    ʁ Environmental Noise Noise ? Utterance ? Robot B Robot A Power = a confidence measure of localization results º  4PMVUJPOT 0WFSWJFX 1. Placing robots on a table to opposite each other so as to compensate each other’s capabilities 2. Integrating sound source localization results from the robots
  10. p θ  *OQVUT BOE 0VUQVUT PG 0VS .FUIPE 

    4FUUJOHT  4PVOE TPVSDF MPDBMJ[BUJPO Robot audition software  .JDSPQIPOFT four microphones in head. outputs Localization results: [deg] Power: [dB] every 1 frame (=0.01 second) Angular resolution = 10 [deg] { Impulse response for calculating the transfer function →recorded at 36 points, at intervals of 10 [deg] based on MUltiple SIgnal Classification (MUSIC) method developed in Kyoto Univ.
  11. θA θA θB θmix θB pB pA pmix Localization results

    Power: : Localization results Power: : Localization results Power: : Integrated Robot A Robot B *OQVUT *OQVUT 0VUQVUT {  *OQVUT BOE 0VUQVUT PG 0VS .FUIPE 
  12. fr (θ) = 1 ￿ 2πσ2 r exp(− (θ −

    θr )2 2σ2 r ) fr (θr ) = 1 ￿ 2πσ2 r = 1 C pr  *OUFHSBUJPO PG .VMUJQMF -PDBMJ[BUJPO 3FTVMUT  1. Define probability density function from θr "TTVNQUJPO UIF BNCJHVJUZ PG MPDBMJ[BUJPO SFTVMUT GPMMPXT B OPSNBM EJTUSJCVUJPO 2. Define the maximum probability is proportioned to pr "TTVNQUJPO UIF QPXFS PG MPDBMJ[BUJPO SFTVMUT DBVTFE CZ OPJTF JT MPX C is a constant value and determined empirically
  13. θA θB f A (θ) f B (θ) -PDBMJ[BUJPO SFTVMU

    <EFH> 1SPCBCJMJUZ f A (θ) + f B (θ) θmix p mix C p A C p B C θmix = arg max θ (fA (θ) + fB (θ)) Threshold Reduce the number of incorrect localization results due to noise  *OUFHSBUJPO PG .VMUJQMF -PDBMJ[BUJPO 3FTVMUT  Robot A Robot B
  14. ! " # $ % Robot A Robot B 150

    [cm] loudspeaker 30 [cm] 75 [cm] 20° 25° 33° 107°  &WBMVBUJPO &YQFSJNFOUT Evaluated whether using two robots improved speaker identification 4FUUJOHT   46° 72° The range of loudspeaker from Robot B
  15. ! " # $ % Robot A Robot B 150

    [cm] loudspeaker 30 [cm] 75 [cm] 20° 25° 33° 107°  &WBMVBUJPO &YQFSJNFOUT Evaluated whether using two robots improved speaker identification 4FUUJOHT   ! 46° 72° *ODPSSFDU $PSSFDU The range of loudspeaker from Robot B
  16.  %BUB  &WBMVBUJPO .FBTVSF /VNCFS PG GSBNFT XIFO MPDBMJ[BUJPO

    SFTVMU XBT DPSSFDU /VNCFS PG GSBNFT XIFO MPDBMJ[BUJPO SFTVMU XBT DPSSFDU /VNCFS PG BMM EFUFDUFE GSBNFT /VNCFS PG TQFFDI GSBNFT 5 utterances × 5 points × 4 speakers = 100 data One audio file includes one utterance whose duration is 1.0 second  &WBMVBUJPO &YQFSJNFOUT 4FUUJOHT  
  17. ! " # $ % Robot A Robot B 41,

    QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    Robot A Robot B Evaluated whether using two robots improved speaker identification  3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST 6TJOH POMZ POF SPCPU
  18. ! " # $ % Robot A Robot B 41,

    QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    Robot A Robot B It’s difficult to identify loudspeakers that were far from the robots. Evaluated whether using two robots improved speaker identification  3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST 6TJOH POMZ POF SPCPU
  19. ! " # $ % Robot A Robot B 41,

    QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    Robot A Robot B The performances differed between two robots. It’s difficult to identify loudspeakers that were far from the robots. Evaluated whether using two robots improved speaker identification  3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST 6TJOH POMZ POF SPCPU
  20. 41, QSFDJTJPO SFDBMM ' "    # 

      $    %    &    "--    41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    Robot A Robot B Integration Evaluated whether using two robots improved speaker identification  3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST *OUFHSBUJPO C = 800 thresh = 25.5 800
  21. 41, QSFDJTJPO SFDBMM ' "    # 

      $    %    &    "--    41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    Robot A Robot B Integration Evaluated whether using two robots improved speaker identification  3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST *OUFHSBUJPO C = 800 thresh = 25.5 800 System can identify the areas that only one robot cannot
  22. C = 800 thresh = 25.5 800 Robot A Robot

    B Integration 41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    41, QSFDJTJPO SFDBMM ' "    #    $    %    &    "--    Evaluated whether using two robots improved speaker identification  3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST *OUFHSBUJPO In particular, the loudspeaker at C get correctly identified, for which neither robots cannot System can identify the areas that only one robot cannot
  23. &SSPS3BUF "WFSBHF&SSPS<EFH> 1PXFS<E#>       

            Evaluated whether integrated power was valid as a confidence measure  -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS 
  24. &SSPS3BUF "WFSBHF&SSPS<EFH> 1PXFS<E#>       

            Evaluated whether integrated power was valid as a confidence measure  -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS  When power is high, average errors are small
  25. &SSPS3BUF "WFSBHF&SSPS<EFH> 1PXFS<E#>       

            Evaluated whether integrated power was valid as a confidence measure  -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS  When power is high, average errors are small When power is low, error rates are high and average errors are large
  26. &SSPS3BUF "WFSBHF&SSPS<EFH> 1PXFS<E#>       

            When power is high, average errors are small Power can distinguish correct localization results from incorrect ones When power is low, error rates are high and average errors are large Evaluated whether integrated power was valid as a confidence measure  -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS 
  27. When power is low, robot checks whether a speaker exists

    or not by executing face detection How to use power Evaluated whether integrated power was valid as a confidence measure  -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS 
  28.  $PODMVTJPO 'VUVSF 8PSLT Integrate multiple sound source localization results

    Implement demo system → identifying a speaker and heading toward to answer → executing face detection to check whether a speaker exists on the basis of power Use other evidence of speaker’s existence e.g. image processing → improve performance of speaker identification → improve performance compared with using only one robot