Parameter optimization of late reverberation suppression algorithm

Boundary values between early reflections and late reverberation, optimal in sense of such criteria as speech recognition accuracy and speech quality, had been found. When optimal boundary value is chosen, usage of logMMSE method for late reverberation suppression makes it possible to increase recognition accuracy from 22 ... 30% to 56...74% and speech quality index PESQ from 2.281 to 2.33. Reference 6, figures 4.


Introduction
The problem of speech dereverberation in communication and automatic speech recognition (ASR) systems was actively investigated in the last decade due to the rapid development of mobile communications [1][2]. It was found that late reverberation is main detrimental factor which is kind of additive noise. The formula for estimation of late reverberation power spectrum contains parameter l T , which is time boundary between early reflections and late reverberation. The boundary is blurred: we find 30...100 l T ≈ ms in [1][2].
Moreover, these values were experimentally obtained when problems of speech intelligibility and musical clarity were investigated, and it isn't evident that the same values will be good for speech recognition and communication systems. The objective of this paper is searching of parameter l T optimal values in sense of such criteria as speech recognition accuracy and speech quality.

Target setting
The reverberant signal ( ) y t results from the convolution of the anechoic speech signal ( ) x t and the causal time-invariant Acoustic Impulse Response (AIR) ( ) h t : When selecting in AIR ( ) h t (Fig. 1) regions corresponding to early reflections and late reflections is component due to late reverberation; l T is time, corresponding to boundary between early reflections and late reverberation (see Fig. 1).

Fig. 1. Room AIR structure
It is clear from (1) that late reverberation may be interpreted as kind of noise. Unfortunately, strong non-stationarity of late reverberation makes ineffective traditional techniques of stationary or slow non-stationary noise suppression [1].
It can be assume that late reverberation suppression may be realized almost by the same remedies which are usually used for noise suppression by estimating of late reverberation spectrum instead of noise spectrum.
Correction in frequency domain is popular noise suppression method [3]: for l -th signal ( ) y t frame for k -th frequency sample.
In the paper logMMSE method [3] is considered, for which enhancement filter gain is

Experimental organization
There were two groups of experiments: qualitative and quantitative. When realizing qualitative evaluation of dereverberation performance, real speech signal was recorded in room with volume 80 m 3 and time reverberation 1.1 s (sampling frequency 22050 Hz, linear quantization 16 bit). Distance between speaker and microphone was much more of critical distance [1][2].
When realizing quantitative evaluation of dereverberation performance, clear speech signals were convolved with AIRs of three rooms with time reverberation 0.74 s, 0.89 s and 1.1 s for simulation of reverberation action. Sounds of bursting rubber ball were used as AIRs for these rooms. Dereverberation performance had been estimated by means of ASR accuracy: where N is the total number of labels in the reference transcriptions; D is the number of deletion errors; S is the number of substitution errors; I is the number of insertion errors. Indicator PESQ had been used for speech quality assessment [4]. Toolkit HTK [5] had been used for ASR system simulation. Training of ASR system had been made with usage of 269 samples of 27 words saved for two speakers-women. Sound file of discrete speech (with 0.2…0.5 s pauses) was used as test signal, there were used all 27 words in training. There were 27 phonemes of Ukrainian language in phoneme vocabulary and there had been used 39 MFCC_0_D_A coefficients when ASR simulating.
VoiceBox [6] routine "ssubmmse.m" designed to reduce the noise was modified in accordance with propositions of previous section. Moreover, it was taken ( ) 0,5 ( )

Experimental results
Spectrograms of reverberant and enhanced signals for qualitative experiments are shown in Fig. 2. There is noticeable by ear slight distortion introduced by the dereverberation procedure (it was taken 48 l T ms = upon the procedure). In- Ladoshko O., Prodeus A., 2014 creasing l T to 100 ms led to some improvement in sound quality. It demonstrates real problem of true choice of parameter l T value.
Results of Acc% and PESQ estimation for enhanced speech signals are shown in Fig. 3. As it can be seen, enhancement by method 1 (usage of "classic" logMMSE method) did not lead to positive results. Meanwhile, enhancement by method 2 (usage of modified logMMSE method) had made it possible to significantly increase the Acc% value (raised from 22 ... 30% to 56…74%). It is interesting that PESQ value did not raised so much (increased from 2.