Audio-visual speaker detection and localization



Joint work with a lot of people


Keywords: EM algorithm, Model selection, Weighted-data, Audio-visual fusion.

Natural human–robot interaction (HRI) in complex and unpredictable environments is important with many potential applications. While vision-based HRI has been thoroughly investigated, robot hearing and audio-based HRI are emerging research topics in robotics. In typical real-world scenarios, humans are at some distance from the robot and, hence, the sensory (microphone) data are strongly impaired by background noise, reverberations and competing auditory sources. In this context, the detection and localization of speakers plays a key role that enables several tasks, such as improving the signal-to-noise ratio for speech recognition, speaker recognition, speaker tracking, etc. In this series of works, we address the problem of how to detect and localize people that are both seen and heard. We introduce a hybrid deterministic/probabilistic model. The deterministic component allows us to map 3D visual data onto a 1D auditory space. The probabilistic component of the model enables the visual features to guide the grouping of the auditory features in order to form audiovisual (AV) objects. The proposed model and the associated algorithms are implemented in real-time (17 FPS) using a stereoscopic camera pair and two microphones embedded into the head of the humanoid robot NAO. We perform experiments with (i) synthetic data, (ii) publicly available data gathered with an audiovisual robotic head, and (iii) data acquired using the NAO robot. The results validate the approach and are an encouragement to investigate how vision and hearing could be further combined for robust HRI. The main associated publications are [1, 2, 3]. Other associated publications are [4, 5, 6].

Videos

This video illustrates the basic methodology that was developed in [3] and that associates visual and auditory events using a mixture model. In practice, the method is able to dynamically detect the number of events and to localize them.

This video shows the audiovisual speaker localization principle based on 3D face localization using a stereoscopic camera pair and time difference of arrival (TDOA) between two microphones [4]. Left: the scene viewed by the robot; the black circle indicates the detected speaking face. Right: Top view of the scene where the circles correspond to the head positions. Both the videos and the audio track are those recorded with the robot’s cameras and microphones.

This video shows an active and interactive behavior of the robot based on face detection and recognition, and sound localization and word recognition [5]. The robot selects one active speaker and synthesize appropriate behavior.

These videos show the speaker detection methodology with individual per-sample weight, included in the GMM probabilistic model, see [6].

Publications

  1. X. Alameda-Pineda and R. Horaud, “Vision-Guided Robot Hearing,” International Journal of Robotics Research, vol. 34, iss. 4-5, pp. 437-456, 2015. [ bib pdf code arxiv ]
    @article{Alameda-IJRR-2014,
      author    = {Xavier Alameda-Pineda and Radu Horaud},
      title     = {Vision-Guided Robot Hearing},
      journal   = {{International Journal of Robotics Research}},
      volume  = {34},
      number  = {4-5},
      pages    = {437--456},
      year    = {2015},
      arxiv = {http://arxiv.org/abs/1311.2460},
      soft = {https://code.humavips.eu},
      pdf={http://xavirema.eu/wp-content/papercite-data/pdf/Alameda-IJRR-2014.pdf}
    }
  2. I. Gebru, X. Alameda-Pineda, F. Forbes, and R. Horaud, “EM algorithms for weighted-data clustering with application to audio-visual scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, iss. 12, pp. 2402-2415, 2016. [ bib pdf code data arxiv ]
    @article{Gebru-TPAMI-2016,
       title = {{EM} algorithms for weighted-data clustering with application to audio-visual scene analysis},
       author = {Israel-Dejene Gebru and Xavier Alameda-Pineda and Florence Forbes and Radu Horaud},
       journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
       year={2016},
       volume={38},
       number={12},
       pages={2402-2415},
       url = {http://arxiv.org/abs/1509.01509},
       arxiv = {http://arxiv.org/abs/1509.01509},
       doi={10.1109/TPAMI.2016.2522425},
       soft={http://perception.inrialpes.fr/people/Gebru/code/WD-EM.zip},
       data={https://team.inria.fr/perception/avtrack1/},
      pdf={http://xavirema.eu/wp-content/papercite-data/pdf/Gebru-TPAMI-2016.pdf}
    }
  3. X. Alameda-Pineda, V. Khalidov, R. Horaud, and F. Forbes, “Finding Audio-Visual Events in Informal Social Gatherings,” in IEEE/ACM International Conference on Multimodal Interfaces, Alicante, Spain, 2011, pp. 247-254. [ bib pdf code ] Award Oustanding Paper Award
    @InProceedings{Alameda-ICMI-2011,
      author       = "Alameda-Pineda, Xavier and Khalidov, Vasil and Horaud, Radu and Forbes, Florence",
      title        = "Finding Audio-Visual Events in Informal Social Gatherings",
      booktitle    = "IEEE/ACM International Conference on Multimodal Interfaces",
      year         = "2011",
      award         = "Oustanding Paper Award",
      pages = {247--254},
      address = {Alicante, Spain},
      soft = {https://code.humavips.eu},
      pdf={http://xavirema.eu/wp-content/papercite-data/pdf/Alameda-ICMI-2011.pdf}
    }
  4. J. Sanchez-Riera, X. Alameda-Pineda, J. Wienke, A. Deleforge, S. Arias, J. Cech, S. Wrede, and R. Horaud, “Online Multimodal Speaker Detection for Humanoid Robots,” in IEEE-RAS International Conference on Humanoid Robotics, Osaka, Japan, 2012, pp. 126-133. [ bib pdf code ]
    @InProceedings{Sanchez-Humanoids-2012,
      author       = "Sanchez-Riera, Jordi and Alameda-Pineda, Xavier and Wienke, Johannes and Deleforge, Antoine and Arias, Soraya and Cech, Jan and Wrede, Sebastian and Horaud, Radu",
      title        = "Online Multimodal Speaker Detection for Humanoid Robots",
      booktitle    = "IEEE-RAS International Conference on Humanoid Robotics",
      year         = "2012",
      soft         = "http://code.humavips.eu",
      address = {Osaka, Japan},
      pages = {126--133},
      pdf={http://xavirema.eu/wp-content/papercite-data/pdf/Sanchez-Humanoids-2012.pdf}
    }
  5. J. Cech, R. Mittal, A. Deleforge, J. Sanchez-Riera, X. Alameda-Pineda, and R. Horaud, “Active-Speaker Detection and Localization with Microphones and Cameras Embedded into a Robotic Head,” in IEEE-RAS International Conference on Humanoid Robots, Atlanta, USA, 2013, pp. 203-210. [ bib pdf ]
    @inproceedings{Cech-Humanoids-2013, 
      author = {Cech, Jan and Mittal, Ravi and Deleforge, Antoine and Sanchez-Riera, Jordi and Alameda-Pineda, Xavier and Horaud, Radu}, 
      title = {{Active-Speaker Detection and Localization with Microphones and Cameras Embedded into a Robotic Head}}, 
      booktitle = {{IEEE-RAS International Conference on Humanoid Robots}},
      year = {2013},
      pages = {203--210},
      address = {Atlanta, USA},
      pdf={http://xavirema.eu/wp-content/papercite-data/pdf/Cech-Humanoids-2013.pdf}
    }
  6. I. Gebru, X. Alameda-Pineda, R. Horaud, and F. Forbes, “Audio-Visual Speaker Localization via Weighted Clustering,” in IEEE Workshop on Machine Learning for Signal Processing, Reims, France, 2014, pp. 1-6. [ bib pdf ]
    @inproceedings{Gebru-MLSP-2014,
        title = {{Audio-Visual Speaker Localization via Weighted Clustering}},
        author = {Gebru, Israel-Dejene and Alameda-Pineda, Xavier and Horaud, Radu and Forbes, Florence},
        booktitle = {{IEEE Workshop on Machine Learning for Signal Processing}},
        year = {2014},
        pages = {1--6},
        address = {Reims, France},
      pdf={http://xavirema.eu/wp-content/papercite-data/pdf/Gebru-MLSP-2014.pdf}
    }

Category: Research

No responses yet.

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>