Audio-visual speaker detection and localization

Joint work with a lot of people

Keywords: EM algorithm, Model selection, Weighted-data, Audio-visual fusion.

[Could not find the bibliography file(s)
Natural human–robot interaction (HRI) in complex and unpredictable environments is important with many potential applications. While vision-based HRI has been thoroughly investigated, robot hearing and audio-based HRI are emerging research topics in robotics. In typical real-world scenarios, humans are at some distance from the robot and, hence, the sensory (microphone) data are strongly impaired by background noise, reverberations and competing auditory sources. In this context, the detection and localization of speakers plays a key role that enables several tasks, such as improving the signal-to-noise ratio for speech recognition, speaker recognition, speaker tracking, etc. In this series of works, we address the problem of how to detect and localize people that are both seen and heard. We introduce a hybrid deterministic/probabilistic model. The deterministic component allows us to map 3D visual data onto a 1D auditory space. The probabilistic component of the model enables the visual features to guide the grouping of the auditory features in order to form audiovisual (AV) objects. The proposed model and the associated algorithms are implemented in real-time (17 FPS) using a stereoscopic camera pair and two microphones embedded into the head of the humanoid robot NAO. We perform experiments with (i) synthetic data, (ii) publicly available data gathered with an audiovisual robotic head, and (iii) data acquired using the NAO robot. The results validate the approach and are an encouragement to investigate how vision and hearing could be further combined for robust HRI. The main associated publications are [?]. Other associated publications are [?].


This video illustrates the basic methodology that was developed in [?] and that associates visual and auditory events using a mixture model. In practice, the method is able to dynamically detect the number of events and to localize them.

This video shows the audiovisual speaker localization principle based on 3D face localization using a stereoscopic camera pair and time difference of arrival (TDOA) between two microphones [?]. Left: the scene viewed by the robot; the black circle indicates the detected speaking face. Right: Top view of the scene where the circles correspond to the head positions. Both the videos and the audio track are those recorded with the robot’s cameras and microphones.

This video shows an active and interactive behavior of the robot based on face detection and recognition, and sound localization and word recognition [?]. The robot selects one active speaker and synthesize appropriate behavior.

These videos show the speaker detection methodology with individual per-sample weight, included in the GMM probabilistic model, see [?].


Leave a Reply

Your email address will not be published. Required fields are marked *