P39 Investigations on the optimal estimation of speech envelopes in speech enhancement
Speech envelope plays an important role in speech intelligibility. Model-based speech envelope enhancement approaches, when applied to the noisy input, require an estimation of both the speech and the noise envelopes. This results in a computationally complex search, since the noise envelopes can have a wide variety of shapes. Speech envelopes, on the other hand, exhibit a much smaller variance, imposed by the physiological nature of speech production. Recent research has attempted to reduce the search space of envelopes by exploiting this characteristic. Typically, it is done in a two-stage manner. First, a preliminary noise suppression is applied, yielding a rough estimate of the underlying speech. In the second stage, instead of estimating the clean envelopes from noisy spectra, a coarse estimate of the clean envelope of the underlying speech is recovered from this preliminary denoised signal. The search for the true underlying envelope can then be cast as a regression approach that predicts the speech envelope directly, or as a classification problem which finds the best candidate from several templates, based on this coarse estimate. We present our recent investigations into the achievable benefits of envelope reconstruction, using oracle methods for both classes of approaches. These define the upper bound of the achievable performance. Further, we benchmark two practical, real-time systems using the classification approach against the oracle classification. The first system is based on statistical modelling of the features using the well-known Gaussian mixture models, with temporal context being described by hidden Markov models (GMM-HMM). The second system is a DNN-based system with roughly the same number of parameters as the GMM-HMM. We also study two different feature sets for representing the envelope by both the oracle tests and the practical systems – those based on LPC-features (which implicitly assumes an auto-regressive model for speech) and those based on cepstral coefficients. Oracle results demonstrate that direct prediction of envelopes outperforms the classification strategy. Envelope representation using the cepstral coefficients seems most robust for a wide range of noise conditions, especially at low SNRs. Results of the practical systems confirm this advantage. The results also demonstrate that deep-learning-based systems match the oracle performance of the classification approach. In future investigations, we would like to explore how envelope enhancement will further benefit from a regression model.