Presented at the 1999 IEEE Workshop on Nonlinear Signal and Image Processing

Speaker Localization for Far-field and Near-field Wideband Sources Using Neural Networks

Guner Arslan (1), F. Ayhan Sakarya (2)(3), and Brian L. Evans (1)

(1) Department of Electrical and Computer Engineering, Engineering Science Building, The University of Texas at Austin, Austin, TX 78712-1084 USA
arslan@ece.utexas.edu - bevans@ece.utexas.edu

(2) Department of Electronics and Telecommunication Engineering, Yildiz Technical University, 80750 Istanbul, Turkey
sakarya@ana.cc.yildiz.edu.tr

(3) Wireless Technology Laboratory, Lucent Technologies, Holmdel, NJ 07733-3030 USA
sakarya@lucent.com

Paper - Talk

Abstract

Many applications such as hands-free videoconferencing, speech processing in large rooms, and acoustic echo cancellation, use microphone arrays to track speaker locations in real-time. A speaker is a wideband source which may be in the near field or far field of the array. Current source localization approaches based on neural networks can meet real-time constraints but assume far-field narrowband sources. In this paper, we (1) apply neural networks for determining direction-of-arrival for near-field and far-field wideband speaker localization, and (2) compute the instantaneous cross-power spectra between adjacent pairs of sensors to form the feature vector. We optimized the overall speaker localization system off-line to yield an absolute error of less than 6 degrees at an SNR of 10 dB and a sampling rate of 8000 Hz at each sensor. When performing speaker localization in real-time, the system would require 1 MFLOP/s.

Questions

Can the array determine whether a source is in the far field or near field of the array?
We are not determining where the source is. The neural network is taking care of it. We calculate the cross-correlations between sensors. In the far field, the time delay (or phase difference for the cross-correlations) is constant for all sensors. In the near field, it is not. So there is a unique pattern in the cross-correlation coefficients and the neural network can decide which approximation it should use. Actually the far field case is an approximation of the near field case. If you have a technique which works in the near field, then it also should work for the far field.
Will the method work on real data if the real data has additional harmonics?
Our speech model is a sum of several sinusoids that models the strong harmonics in the speech signal. So that should not be a problem.
Will the method break down when implemented in a room?
One problem with our model is that it does not account for echo. Our model assumes that speech is coming from only one direction. Most of the applications you would use a speaker localization system would have echos.

Last Updated 08/07/99.