P35 Foreground speech robustness for audio scene classification
Identification of the background acoustic scene can significantly benefit speech enhancement in hearables. Existing acoustic scene classification (ASC) techniques show competitive performance when the characterising sound, namely the background signal, is dominant. However, the classification accuracy degrades significantly when foreground speech is present. We present investigations of two classes of techniques to alleviate this degradation. The investigations were carried out on a classical iVector-based system for two reasons (a) this baseline shows competitive performance and (b) since the working of this approach is explainable, the benefits of the techniques can be better understood. Firstly, a noise-floor based iVector system was proposed, where the Mel frequency cepstral coefficient (MFCC) features were derived from an estimate of the power spectral density of the background signal. In combination with Multi-Condition Training (MCT), this system improved the ASC performance when the foreground speech was predominant, but at the cost of poorer performance in the absence of foreground speech and in low speech to background ratio (SBR) conditions. To improve this trade-off, we consider the integration of a soft Voice Activity Detector (softVAD) in the classical iVector system. Based on MFCC features extracted from the microphone signals, a frame-level speech-absence probability is calculated using a universal background model (UBM), respectively, for speech and background. Based on this probability, weighted Baum-Welch statistics are computed and used in the iVector extraction stage. Thereby the background-dominant frames are emphasised while speech-dominant frames are disregarded. Experiments show that this system outperforms the noise-floor based system in a wide range of SBRs. Yet, we believe, the information in the speech-dominant frames can be exploited by using the noise-floor-based features in these conditions. This can further improve performance. Therefore, we present a third system where the score of the noise-floor based ASC system is combined with the score of the second system in a weighted manner. To allow the noise-floor based ASC system to focus on the information in the speech-dominant frames, the system is modified to incorporate the speech presence probability when computing the Baum-Welch statistics. Further, the weights for the score fusion are obtained from the average background probability of each segment. This weighted score fusion system achieves overall the best accuracy in tested SBRs. These findings indicate that temporal frame attention is important for robust ASC solutions and this is applicable to DNN-based frameworks as well. Extension to DNN-based systems will be the focus of future work.