Predictive Estimation of Speech Intelligibility Masked by Noise Interference Using Analytical Modeling

A detailed description of the speech intelligibility prediction algorithm using analytical modeling is presented. The efficiency of the proposed algorithm is tested for four types of noise interference: white, pink, brown and typical for classrooms. The consistency of the results with known similar results indicates the correctness of the proposed components of the analytical algorithm. In addition, we compared the results of evaluating speech intelligibility obtained in accordance with the “classical” approach with the results of evaluating the STI index of speech intelligibility, which allowed us to confirm the thesis of a low camouflage ability of white noise at low signal-to-noise ratios.


INTRODUCTION
The task of calculating and measuring speech intelligibility is not new, its history currently covers 90 years, if the reference is from pioneering work [1]. Nevertheless, the scope of speech intelligibility assessment applications is constantly expanding, technical means of engineers are changing and improving, the list of factors taken into account when assessing speech intelligibility is increasing. As a result, there is a need for constant updating of the corresponding algorithms and software.
Overseas, the most widely used versions of the Formant Method for assessing speech intelligibility are the Articulation Index AI [2] and the Speech Intelligibility Index (SII) [3]. By the end of the 1950s, several scientific schools had been formed in the USSR headed by N. B. Pokrovskiy, M. A. Sapozhkov and Yu. S. Bykov, where his versions of the Formant Method developed [4]- [6].
In 1973, the Modulation Method appeared, where STI (Speech Transmission Index) is the measure of speech intelligibility [7]. Since the Modulation Method has the ability to take into account the influence of not only noise, but also reverberation on speech intelligibility, some authors even made statements about the "obsolescence" of the Formant Method [8]. At the same time, a careful comparison of the potential capabilities of the Formant and Modulation Methods indicates that the Formant Method is superior to its competitor in accuracy and speed of calculations in conditions where the action of noise prevails over the action of reverb [9].

II. STATEMENT OF THE PROBLEM
The "classical" computer simulation algorithm for evaluating the intelligibility of noisy speech by the Formant Method is described in [3], [4].
The structure of this algorithm is shown in Fig. 1. At the first stage of calculations, the primary speech signal and noise models are formed in the form of arrays of samples of stationary random processes with specified spectral characteristics. Then, the variance correction of these model processes is performed to provide the required integral signal-to-noise ratio 0 . After this correction, the partial signal-to-noise ratios k E are estimated. At the final stage, speech intelligibility measures are calculated: Formant intelligibility A and verbal intelligibility W.

Акустичні прилади та системи
Тхань Ві Нгуєн, Дарчук А. В. Продеус А. М. DOI: 10.20535/2523DOI: 10.20535/ -4455.2019 Fig . 1 The structure of the computer simulation algorithm The essence of the Formant Method for assessing speech intelligibility is as follows. The frequency range of the speech signal is divided into adjacent frequency bands, with center frequencies and boundary frequencies and, within each of which the speech and noise spectra can be considered practically unchanged [4]. Verbal intelligibility is calculated through Formant intelligibility A [10]: where k p -is the probability of staying of Formants in the k -th frequency band: In accordance with the method of N.B. Pokrovskiy [4], 3 E′ -the effective level of sensation of Formants in the k -th frequency band: k E -the effective level of sensation of a speech signal in the nth frequency band, equal (at sufficiently high noise levels) to the signal-to-noise ratio in this frequency band: where sk D and nk D are the variances of the signal and noise in the k -th frequency band; ( ) B f ∆ -the difference between the averaged spectra of speech and Formants: Following the method of M. A. Sapozhkov, the spectrum of Formants is considered to practically coincide with the spectrum of speech, i.e.
( ) 1 GOST R ISO 24504-2015 Ergonomics design. Sound pressure levels of spoken announcements for products and public address systems , the method of M. A. Sapozhkov is clarified by taking into account the dependence of the perception coefficients on the frequency band. In this case, instead of (1), the relation is used: where the perception coefficients ( ) k P E are described by polynomial dependencies (Appendix 1).
In recent years, there has been a tendency toward partial unification of Formant and Modulation [7] Methods for assessing speech intelligibility. So, for example, according to the simplified method for assessing speech intelligibility, presented in GOST R ISO 24504-2015 1 , speech intelligibility is evaluated using the STI index: where k α are weight coefficients, k β are redundancy coefficients, the values of which for octave bands with center frequencies 0 f are given in Table 1. It is easy to see the fundamental similarity of relations (2) and (9), on the one hand, and (10), on the other hand. Moreover, the second term in (10) is a correction that takes into account the correlation of speech signals in adjacent frequency bands, and relation (11) can be interpreted as the result of linearization of the perception coefficient.
Despite the ability of computer simulations to evaluate the performance and effectiveness of software prototypes of real digital measuring systems, a predictive assessment of speech intelligibility is no less urgent. However, it is hardly rational to solve the forecasting problem by computer simulation, given the resource consumption of this method. More economical is the method of analytical modeling, according to which the speech and noise models are described by deterministic functions in the form of power distributions or spectral power densities.
In essence, the stages of analytical and computer modeling are similar. Moreover, relations (1) -(11) used at the final stage are the same for both types of modeling. However, the issue of the analytical description of the initial and intermediate stages in the literature is not adequately covered. One of the goals of this paper is to bridge this gap. Another goal is to compare the results of the assessment using relations (1), (2) and (10). Despite the obvious usefulness of such a comparison, it has not been implemented until recently.

III. NOISY SPEECH PREDICTION ALGORITHM
The structure of the proposed algorithm for predicting the intelligibility of noisy speech is similar to the structure of Fig. 1. Only the method for implementing the individual steps of the algorithm differs, based on the analytical description of the spectral properties of the speech signal and noise. We detail the description of each of the stages.

Stage 1. Formation of input data:
• specification of the analytical spectral model of the speech signal in the form of the dispersion distribution sk D , where k is the number of the frequency band; • creation of an analytical spectral model of noise in the form of a dispersion distribution nk D ; • setting the expected integral signal-to-noise ratio 0 SNR for the signal being listened to.
Stage 2. Correction of the distribution of variances of the speech signal (or noise) to provide a given integral signal-to-noise ratio 0 SNR : • calculation of the "primary" value of the integral signal-to-noise ratio • correction factor calculation 0 T SNR SNR = ; • adjustment of the dispersion distribution of the speech signal in accordance with the ratio 0sk sk • or adjusting the distribution of noise variances in accordance with the ratio 0nk nk Stage 3. Calculation of partial signal-to-noise ratios for a given 0 SNR : • calculation of partial signal-to-noise ratios taking into account (14) and ( • correction of the values of 0k E in accordance with (6), if the Pokrovskiy technique is used.

Stage 4. Formation of the output:
• calculating articulatory intelligibility and verbal intelligibility in accordance with (1) and (2) or calculating the intelligibility index in accordance with (11).
Let's make some comments about the stage of the input data formation.
As the initial data, we can use the probability density of the signal and the interference, if we take into account the relationship between the dispersion k D of a random stationary process in the k-th frequency band k f ∆ and the average value of the power spectral density k P within this frequency band, then: Obviously, when calculating speech intelligibility in communication lines, the value of the expected integral signal-to-noise ratio When calculating speech intelligibility in open-plan rooms (ISO 3382-3 : 2013 2 standard), which include offices, library halls, classrooms, for small (2-4 m) R distances between the speaker and the listener, we can assume that the speech signal attenuates in the same way as in free space, i.e. 6 dB with doubling the distance. With an increase in R , however, this pattern is violated, and then the calculation of the signal level at the listening point is somewhat more complicated.

IV. CHECKING THE PERFORMANCE OF THE ALGORITHM
The performance of the proposed algorithm was tested for 4 types of noise: white, pink, brown and typical for classrooms ( Table 2). Note that the typical distribution of noise levels , dB nk D over the frequency channels for classrooms is borrowed from the regulatory document SSN 3.3.6.037-99 3 , and the parameters of the long-term speech spectrum are borrowed from [4].
For maximum ease of subsequent calculations values, are given in Table 2, the dispersion distribution values are normalized to the dispersion value in the fourth frequency channel ( 04 f = 1000 kHz).    Fig. 2d shows the estimates of STI (Fig. 2).
Comparing the graphs with each other, we see that the graphs obtained for the perception coefficients N. B. Pokrovskiy (Fig. 2a), fundamentally differ from the rest of the graphs in that in the entire range of considered values 0 SNR the masking properties of white noise turn out to be better than that of brown noise. Meanwhile, in [11], [12], the incorrectness of perception coefficients was first noted and corrected by N. B. Pokrovskiy.
Graphs Fig. 2b and Fig. 2c, constructed using adjusted perceptual coefficients, indicate that for small 0 SNR ( 0 SNR < -8 dB) the masking properties of white noise are inferior to those for brown noise. It is noteworthy that this result is consistent with the behavior of the graphs in Presented in fig. 3, the results of subjective articulation tests [13] also indicate a pronounced tendency to deteriorate the masking properties of white noise at low signal-to-noise ratios 0 SNR . The absence of a clear loss of white noise in this case can be explained by the insufficient correctness of the organization of articulation tests.
As expected, the computational time for analytical modeling turned out to be an order of magnitude shorter than the time required for computer simulation and did not exceed 1 s for an FDA with a clock frequency of 2.66 GHz, 4 GB of RAM, and 32-bit OS.

CONCLUSION
A detailed description of the initial and intermediate stages of the speech intelligibility prediction algorithm by analytical modeling is presented. The efficiency of the proposed algorithm is tested for 4 types of noise interference: white, pink, brown and typical for classrooms. The consistency of the results with known similar results indicates the correctness of the proposed components of the analytical algorithm.
A comparison of the results of evaluating speech intelligibility obtained in accordance with the "classical" approach with the results of evaluating the STI index of speech intelligibility made it possible to confirm the thesis of a low camouflage ability of white noise at low signal-to-noise ratios. In the future, it is advisable to carry out an additional verification of this thesis by the method of articulation tests. Надійшла до редакції 20 вересня 2019 р.

Appendix 1
Analytical Description of Perceptual Coefficients for Seven Octave Frequency Bands where the values of the coefficients n a′ and m b′ are presented in table. A1.1 and A1.2. These perceptual coefficients differ from the coefficients given in [12] by the presence of not five, but seven octave frequency bands.