13th Speech in Noise Workshop, 20-21 January 2022, Virtual Conference 13th Speech in Noise Workshop, 20-21 January 2022, Virtual Conference

P57 Assessing the generalization gap of a deep neural network-based binaural speech enhancement system in noisy and reverberant conditions

Philippe Gonzalez, Tommy Sonne Alstrøm, Tobias May
Technical University of Denmark

(a) Presenting

Noisy and reverberant speech signals are influenced by a plethora of factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the room acoustics, the signal-to-noise ratio (SNR) and the position of the different sources in the acoustic scene. This large variability of acoustic conditions poses a major challenge for deep neural network (DNN)-based speech enhancement systems, since any mismatch between training and testing can substantially reduce their performance. In addition, the generalization capability of DNN-based systems is typically assessed by testing the system with an arbitrarily chosen speech, noise or binaural room impulse response (BRIR) database that was not seen during training. This poses a problem, as the difficulty of the speech enhancement task can substantially vary across databases, which strongly influences the results and complicates a comparison across studies. The present study systematically investigates the influence of six acoustic scene dimensions on the generalization capability of a binaural DNN-based speech enhancement system, namely the target speaker, the noise type, the room, the SNR, the target direction and the mixture level. We propose a new measure of generalization, which is referred to as the generalization gap. The generalization gap is expressed in percentage and is defined as the performance distance to a reference model trained on each test condition. To reduce the influence of the test condition on the generalization assessment, the generalization gap is measured using a cross-validation framework over multiple speech, noise and BRIR databases. We find that while a speech mismatch between training and testing affects generalization the most (generalization gap of 49% in terms of the mean squared error (MSE)), other dimensions such as the noise type (36%) and the room (30%) can also induce a substantial generalization gap. The SNR, direction and level dimensions can potentially induce significant generalization gaps, but these can be substantially reduced by training on diverse datasets that present a wide range of SNRs, directions and levels. The generalization gap can be measured for any learning-based system and facilitates a comparison across studies.

Last modified 2022-01-24 16:11:02