Objective quality evaluation of speech band-limited signals

Dependence of objective quality evaluation of speech band-limited signals is experimentally obtained. As part of this task, a comparison of the considered indicators of the speech quality had been made. It is shown that computationally simple indicators, such as segmental SNR (SSNR) and log-spectral distortion (LSD), may not adequately respond to changes in bandwidth. More complex computationally perceptual indicators, such as bark spectral distortion (BSD) and perceptual evaluation of speech quality (PESQ), behave much more correct and, in the end, clarify the real needs of the human auditory system to speech perception. Reference 14, figures 5.


Introduction
It is supposed to use super-wideband (50 Hz -14 kHz) signal at a sampling frequency of 48 kHz in the standard ITU-T Rec. P.863 (POLQA) [9,11] for a modern commercial communication. Speech signal may be transformed for transmission in wide band (50 Hz -7 kHz) and narrow band (300 Hz -3,4 kHz) after proper band-pass filtering and sampling down to 16 or 8 kHz, accordingly. Obviously, the inclusion super-wideband (SWB) in the modern standards of commercial communications stems from a desire to improve the quality of communication. This is evidenced by the following circumstantial evidence presented in [9]: the maximum quality of the speech signal in a narrow band is estimated to be 4,5 points on MOS scale, and superwideband maximum quality is 4,75 points. Unfortunately, it is difficult to find in literature information about dependence of estimates of real (i.e. no maximum) speech quality on the signal bandwidth [2,3,7,13]. Meanwhile, the issue, in our opinion, is of undoubted theoretical and practical interest, as paired with the clarification of the real needs of the human auditory system.
The other side of the raised issue is the choice of the quality index of the speech signal. Subjective assessment methods are very resource intensive, so the attention of researchers is aimed at finding objective (instrumental) indicators of speech quality. Today, the best solution would be to use the standard ITU-T P.863 (POLQA), which most fully takes into account the effect of confounding factors and features of the human auditory system. However, the use of this standard for scientific purposes is practically impossible, since access to the source code of the corresponding software is closed. So you have to either use outdated index PESQ [1,10,14], or to look for alternative, more computationally simple indicators, allowing for the possibility of reduced effectiveness. Unfortunately, there is no clear evaluation of the potential of objective measures of speech quality in the solution of certain problems in the literature.
The object of the paper is filling, at least in part, the above-mentioned gaps.

Objective quality measures of speech signals
To get the dependence of speech quality estimates on the frequency band occupied by the signal, let us use a series of low-pass filters instead of the exact models of the band-pass filters used in narrowband (NB), wideband (WB) and SWB modes. Successively increasing the cut-off frequency of the filter, one would expect the growth of the quality of the filtered speech signal. Obviously, the used quality indicators must, as a minimum, adequately reflect this growth. Otherwise, the quality indicators should recognize ineffective.
Subjective methods for evaluating the speech quality, suggesting the participation in the experiments of several speakers and several auditors, have the undoubted advantage that real human auditory system is used in this estimation. Obvious drawback of subjective methods is their high requirement to resources.
Objective (instrumental) methods for speech quality estimation are largely free of these shortcomings. There are two approaches to estimation  Bogdanova N.V., Prodeus A.М., 2014 and, consequently, two kinds of speech quality indicators, when using objective methods [3]: 1) with use of a reference signal (intrusive indicators); 2) without the use of a reference signal (nonintrusive indicators).
Only intrusive indicators, providing the greatest proximity to the results of the subjective evaluation, had been used in this paper.
From the set of the currently known indicators of this kind [2,3,7,8,13], we consider four. They are segmental signal to noise ratio (SSNR), logarithmic spectral distortion (LSD), bark spectral distortion (BSD) and perceptual evaluation of speech quality (PESQ). In justifying this choice, we note that the first two indicators -SSNR and LSD -are very attractive due to ease of computation, while the other two indicators, referred to as "perceptual" -BSD and PESQ -have the advantage that they allow to take into account, with varying degrees of accuracy, features of the human auditory system.
Analytical description of the above-mentioned indicators is next: Analytical description of a very cumbersome algorithm of a PESQ calculation, which is significantly improved, compared with BSD, to incorporate features of the human auditory system, is presented in [1].

Some features of BSD and PESQ calculation
BSD and PESQ are the most computationally complex indicators among objective indicators considered in this work. However, this computational complexity is compensated with high quality estimation: relatively high Pearson correlation coefficient was achieved between results of objective and subjective evaluation ( = r 0,85-0,95) [2,3,7,13]. In this regard, it is interesting how one can overcome the estimation difficulties of indexes BSD and PESQ in Matlab.
Comparing the definitions of "bark spectrum" and "PLP-spectrum" (perceptual linear predictive spectrum), given in [3,5], it is easy to come to a conclusion about the identity of these concepts, since in both cases it is assumed that the following computational steps are made: ; − the loudness scale is corrected by means of cubic root calculation of the previous step result (phone is translated in sone). This identity can be used to compute the bark spectrum by means of ready programs from the library rastamat [4]. They are rastaplp, powspec, audspec, fft2barkmx, hz2bark, bark2hz, postaud, spec2cep, lifter.
However, some correction of these programs was required before their using. Firstly, a modern function spectrogram need be used instead of the obsolete function specgram in program powspec. Secondary, frame length (32 ms was used in the paper) and frame shift (16 ms was used) must be specified in the program rastaplp when calling the program powspec. Specifying the input data when the program rastaplp starts, we must reject the RASTAspectrum calculation, and must specify the zero order model. Bark spectrum assessment is obtained as the result of the command executing: [cepstra, spectra] = rastaplp(x, fs, 0,0) The results of cepstrum calculation are discarded as unusable in the future.
Note that PESQ calculation can be realised in accordance with early algorithm version (standard ITU-T P.862), and the later version (standard ITU-T P.862.2) [10]. For brevity, we will call them PESQ and PESQ-2, respectively. In this paper both versions are used, allowing to compare the results of their operation.
Although the PESQ index is only designed for narrowband telephony, but the sampling frequency of the analyzed signals can be used either 8 or 16 kHz when the PESQ calculations are realised in the Matlab [6].
Indicator PESQ-2 is designed for both narrowband and wideband telephony. It can be calculated in Windows, using the console application pesq.exe, which is result of compiling the source code written in C and available in the public domain [14]. Another, more convenient way of calculating the PESQ-2 is to control the pesq.exe application from Matlab. To implement this method function pesq2_mtlb, presented in [12], was used.

Experimental results
When evaluating speech quality, there were recorded 1 minute length speech signals of each for 4 speakers female and 4 male speakers reading text on juridical topics. Signal recording had beenmade at the Department of Acoustic of National Technical University of Ukraine "Kyiv Polytechnic Institute", in anechoic room with a reverberation time of 0,15 s, with a sampling rate of 22050 Hz and a bit depth of 16 bits.
Set of FIR low-pass filters was synthesized by Remez method by means of Matlab (fdatool). Filters features are: − cutoff frequency varies from 0,5 kHz to 10,5, incrementing of 0,5 kHz; − the size of the transition zone is 5% from bandwidth; − ripple in the pass band is 1 dB; − transfer coefficient in the stop band is minus 80 dB. Speech signals quality calculation results at the filter output are shown in Fig. 1-4.
Note that SSNR index is clearly inefficient since its values are non-monotonic and fluctuate significantly when bandwidth increasing. This conclusion agrees with the findings of [3] about the unsuitability of SSNR index to assess the distortion caused by filtration.

Fif. 5. PESQ-2 index: female (a), male (b), averaged (c)
LSD index is much better "on average", however, and its drawback is local violations of monotonic dependence on the frequency band. Therefore LSD should be also recognize as ineffective index.
As it can be seen, monotonic behaviour of PESQ and BSD indicators say in their favour. However, these graphs indicate that PESQ is designed for narrowband telephony (although the calculations used PESQ signals sampled at 16 kHz). PESQ-2 is free of this drawback and allows analyzing the quality of speech signals transmitted in a narrow and in a wide band (see Fig. 5).
However, as follows from Fig. 5, PESQ-2 abilities are not sufficient for a final verdict as to the potential ability of objective indicators to assess the speech band limited signal quality, and to assess the real needs of the human auditory system to speech perception. It is necessary to use indicator POLQA for this purpose. But the measurements of POLQA, unfortunately, are only feasible on a commercial basis today.

Conclusions
When evaluating the speech band-limited signal quality, BSD and PESQ are the most informative indicators among examined ones in this paper. It should be noted that estimation results depend strongly on the choice of the estimation algorithm version when using the PESQ index.
Analysis of the BSD dependence on the speech signal bandwidth showed that increasing in the quality of the speech signal stops when band width reaching of 9-10 kHz. It is advisable to check the validity of this result with the POLQA usage in the future.