P57 Assessing the generalization gap of a deep neural network-based binaural speech enhancement system in noisy and reverberant conditions
Noisy and reverberant speech signals are influenced by a plethora of factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the room acoustics, the signal-to-noise ratio (SNR) and the position of the different sources in the acoustic scene. This large variability of acoustic conditions poses a major challenge for deep neural network (DNN)-based speech enhancement systems, since any mismatch between training and testing can substantially reduce their performance. In addition, the generalization capability of DNN-based systems is typically assessed by testing the system with an arbitrarily chosen speech, noise or binaural room impulse response (BRIR) database that was not seen during training. This poses a problem, as the difficulty of the speech enhancement task can substantially vary across databases, which strongly influences the results and complicates a comparison across studies. The present study systematically investigates the influence of six acoustic scene dimensions on the generalization capability of a binaural DNN-based speech enhancement system, namely the target speaker, the noise type, the room, the SNR, the target direction and the mixture level. We propose a new measure of generalization, which is referred to as the generalization gap. The generalization gap is expressed in percentage and is defined as the performance distance to a reference model trained on each test condition. To reduce the influence of the test condition on the generalization assessment, the generalization gap is measured using a cross-validation framework over multiple speech, noise and BRIR databases. We find that while a speech mismatch between training and testing affects generalization the most (generalization gap of 49% in terms of the mean squared error (MSE)), other dimensions such as the noise type (36%) and the room (30%) can also induce a substantial generalization gap. The SNR, direction and level dimensions can potentially induce significant generalization gaps, but these can be substantially reduced by training on diverse datasets that present a wide range of SNRs, directions and levels. The generalization gap can be measured for any learning-based system and facilitates a comparison across studies.