T07 Prediction of the ASR quality by analysis of the sound environment
Automatic Speech Recognition (ASR) systems automatically convert the speech contained in an audio signal into text. However, the quality of the resulting text is highly dependent on many factors, such as the context of the recording, the speakers and the subject. In our work, we tried to predict the word error rate (WER) of automatic transcriptions before the speech is converted into text by an ASR system.
Among the various factors degrading the performance of ASR systems, which can be analysed a priori, we studied the sound environment to design this prediction before decoding. Our approach consists of two steps. The first step is to extract signal parameters that are sufficiently correlated with the WER. The second step is to learn the relationship between these parameters and the WER by a supervised regression using a multilayer perceptron.
To analyse speech in noise, we differentiated three types of signal perturbation due to the sound environment: ambient noise (additive and stationary noise), signal superposition (with speech and music) and reverberation. For ambient noise, we designed a new parameter extraction method that consists in extracting statistics on noise and speech, after a separation by a binary mask. For signal superposition, we designed new parameters exploiting the tracking of partials extraction in the spectrogram. Concerning the reverberation, we designed a new parameter, named Excitation Behaviour, which exploits the residuals of the linear prediction. The efficiency of our parameters has been compared with state-of-the-art methods: we obtain a better prediction of the WER, i.e. of the quality of the automatic transcription before decoding.
The different works presented are then tested in an industrial context: on a corpus of real data from the company Authôt. Our a priori prediction obtains excellent results: an average error of 5.26, omitting the case of regional accents (which are not currently processed by our method). Our method is reliable enough to inform a user as soon as possible about the quality of the automatic transcription of his speech recording. These results can also be used to estimate the time needed to correct automatic transcriptions.