Masking Properties Of The Human Ear Biology Essay

Published: November 2, 2015 Words: 5379

The masking properties of the human ear have been successfully applied to adapt a speech enhancement system, yielding an improvement of speech quality. The accuracy of estimated speech spectra plays a major role in computing the noise masking threshold. Although traditional methods using the power-spectral-subtraction method to roughly estimate the speech spectra can provide an acceptable performance, however, the estimated speech spectra can be further improved for computing the noise masking threshold. In this article, we aim at finding a better spectral estimate of speech by the two-step-decision-directed method. In turn, this estimate is employed to compute the noise masking threshold of a perceptual gain factor. Experimental results show that the amounts of residual noise can be efficiently suppressed by embedding the two-step-decision-directed algorithm in the perceptual gain factor.

The masking properties of the human ear have been successfully applied to adapt a speech enhancement system, yielding an improvement of speech quality. The accuracy of estimated speech spectra plays a major role in computing the noise masking threshold. Although traditional methods using the power-spectral-subtraction method to roughly estimate the speech spectra can provide an acceptable performance, however, the estimated speech spectra can be further improved for computing the noise masking threshold. In this article, we aim at finding a better spectral estimate of speech by the two-step-decision-directed method. In turn, this estimate is employed to compute the noise masking threshold of a perceptual gain factor. Experimental results show that the amounts of residual noise can be efficiently suppressed by embedding the two-step-decision-directed algorithm in the perceptual gain factor.

Enhancement of single channel speech using perceptual-decision-directed approach

Abstract

The masking properties of the human ear have been successfully applied to adapt a speech enhancement system, yielding an improvement of speech quality. The accuracy of estimated speech spectra plays a major role in computing the noise masking threshold. Although traditional methods using the power-spectral-subtraction method to roughly estimate the speech spectra can provide an acceptable performance, however, the estimated speech spectra can be further improved for computing the noise masking threshold. In this article, we aim at finding a better spectral estimate of speech by the two-step-decision-directed method. In turn, this estimate is employed to compute the noise masking threshold of a perceptual gain factor. Experimental results show that the amounts of residual noise can be efficiently suppressed by embedding the two-step-decision-directed algorithm in the perceptual gain factor.

Keywordsâ€" Speech enhancement; power spectral subtraction; masking property; decision directed; perceptual gain factor

1. Introduction

Speech enhancement is useful in many applications such as voice communication and automatic speech recognition. Recently, many novel schemes (Amehraye et al., 2008; Cappe, 1994; Ding et al., 2009; Ephraim and Malah, 1984; Ghanbari and Karami-Mollaei, 2006; Lu and Wang, 2003; 2004; 2007; Lu, 2007; Plapous et al., 2006; Virag, 1999; Udrea et al., 2008; Yu et al., 2008) have been proposed to enhance a speech signal which is corrupted by additive noise. Although their improvement is presented in terms of speech enhancement, the main drawback is the appearance of annoying musical residual noise which is caused by randomly spaced spectral peaks that come and go in successive frames, and occur at random frequencies. Some novel schemes attempt to reduce the effect of musical residual noise by the human auditory system (Amehraye et al., 2008; Hu and Loizou, 2004; Lu and Wang, 2004; 2007; Lu, 2007; Schroeder et al., 1979; Virag, 1999). This auditory system is based on the fact that the human ear cannot perceive residual noise when this noise level falls below the noise masking threshold (NMT). Only the audible noise components are suppressed, this results in the reduction of speech distortion.

Hu and Loizou (2004) derived a perceptual gain factor in the frequency domain. This perceptual gain factor incorporates the masking properties of the human auditory system to make residual noise inaudible. Ghanbari and Karami-Mollaei (Ghanbari and Karami-Mollaei, 2006) proposed to use an adaptive threshold on modified hard thresholding function. Ding et al. (2009) used a hybrid Wiener spectrogram filter for noise reduction, followed by a multi-blade post-processor which exploits 2 dimensional features of the spectrogram to preserve the speech quality and to further reduce the residual noise. Yu et al. (2008) proposed to use a non-diagonal audio denoising algorithm through adaptive time-frequency block thresholding for enhancing an audio signal. This algorithm can adjust parameters to signal property by minimizing a Stein estimation of the risk. Experimental results showed that this method can improve the quality of an audio signal. Lu (2007) also derived a smoothing factor as a second stage to reduce the effect of musical residual noise. An accurate estimate of the a priori SNR is critical for eliminating the musical noise. Plapous et al. (2006) proposed a two-step-decision-directed algorithm to improve the estimate of the a priori SNR of decision-directed approach. Experimental results show that the performance of decision-directed approach can be significantly improved by their novel approach.

Based on the above findings, utilizing the noise masking properties of the human ear to adapt a speech enhancement system is beneficial to result in lesser amounts of annoying musical residual noise. However, the magnitude of residual noise is still apparent and audible to deteriorate the speech quality in the enhanced speech. In this paper, we propose to improve the performance of the perceptual gain factor (Hu and Loizou, 2004) in the frequency domain. The idea is based on improving the estimated speech spectra by the decision-directed algorithm, yielding a better estimate of the noise masking threshold (NMT). In turn, this NMT is employed to adapt the perceptual gain factor (Hu and Loizou, 2004). The performance of perceptual gain factor is accordingly improved. Experimental results show that the proposed approach can significantly improve the performance of the perceptual gain factor (Hu and Loizou, 2004) by reducing much more amounts of residual noise, while the speech distortion can also be maintained at an acceptable level. In addition, the proposed approach also outperforms the two-step-decision-directed noise reduction algorithm (Plapous et al., 2006) in most experiments.

The rest of this paper is organized as follows. Section 2 brief reviews the conception of speech enhancement. Section 3 introduces the proposed perceptual-decision-directed approach for speech enhancement. Section 4 demonstrates the experimental results. Conclusions are finally drawn in Section 5.

2. Brief review of speech enhancement

A noisy speech signal can be modeled as the sum of clean speech and additive noise in the frame m of the time domain, i.e.,

(1) In the spectral domain, the spectral estimate of speech signal is obtained by multiplying

a gain factor with the noisy spectrum of a subband. The estimated speech spectra can be expressed by

(2)

In the reconstruction phase, the phase of noisy speech is not modified. The enhanced speech signal can be obtained by the inverse-fast-Fourier transform (IFFT), given as

(3)

The object of this study is to find an appropriate gain factor which can remove much more amounts of added noise, while speech quality can be maintained at an acceptable level.

3. Proposed perceptual-decision-directed gain factor

In order to improve to performance of a gain factor for noise reduction, a perceptual-decision­direct approach which estimates a gain factor by three steps is proposed herein. Initially, the decision-directed method (Cappe, 1994; Ephraim and Malah, 1984) is performed to enhance a corrupted speech signal. Although the decision-directed method is better able to reduce the effect of musical residual noise, it introduces a frame delay arose from the interpolation for estimating the a priori SNR. Therefore, a decision-directed method is performed again to improve the estimated a priori SNR by removing the frame delay (Plapous et al., 2006). These procedures formulate a two-step decision directed (TSDD) algorithm. In turn, we apply this TSDD algorithm to estimate the spectra of pre-processed speech which will be employed to estimate the noise masking threshold (NMT). Hence, a perceptual gain factor which plays the role as the third step is adapted by the NMT. Finally, the spectra of enhanced speech are obtained by multiplying the spectra of noisy speech with this perceptual gain factor.

3.1. Two-Step-Decision-Directed Algorithm

The spectral estimate of a pre-processed speech signal can be obtained by multiplying a gain factor with the spectra of noisy speech as expressed in (2). The gain factor is decided by the estimated a priori SNR , given as

where = . E is the expectation operator.

The a priori SNR is unknown, and is critical to the gain factor given in (4). It can be estimated by the decision-directed approach (Cappe, 1994; Ephraim and Malah, 1984), given as

(5) where

represents the smoothing factor, it is a constant 0.98. P[.] denotes the half-wave

rectification. is the a posteriori SNR which is defined as When the a posteriori SNR is much larger than 0dB, i.e., a speech-dominant frame,

the estimate of the a priori SNR given in (5) corresponds to a frame delayed version of the a priori SNR. When the posteriori SNR is lower than or close to 0dB, the estimate of the a priori

SNR corresponds to a highly smoothed and delayed version of the a posteriori SNR. Thus the variance of the a priori SNR is reduced compared to the a posteriori SNR. Consequently, the effect of musical residual noise is reduced.

The delay inherent to the decision-directed algorithm is a drawback especially in the speech transients, e.g., the onset and the offset of a vowel signal. In addition, this delay introduces bias in gain estimation as given in (4), and limits the performance of noise reduction for a speech enhancement system. Hence, the delayed version of gain factor will generates an annoying reverberation effect. In order to compensate this delay phenomenon, Plapous et al. (2006) proposed to use a two-step-decision-directed (TSDD) algorithm to improve the estimate of the a priori SNR.

In the first step, the decision-directed algorithm (Cappe, 1994; Ephraim and Malah, 1984) is used to estimate the a priori SNR. This algorithm computes the spectral gain as described in

(4)-(6). In the second step, the value of this gain factor is used to estimate the a priori SNR for the frame m+1, given as

(7) where

is also a smoothing factor. is an estimate of the a priori SNR which can be

obtained by the decision-directed gain factor given in (4)-(6), can be expressed by

Observing the second term of (7), the a posteriori SNR is related to the future frame which introduces an additional processing delay and may be incompatible with the

desired application. Thus, the smoothing factor

in (7) is chosen to unity. In this case, the estimated a priori SNR given in (7) degenerates into the particular case:

Substituting (9) into (4), the two-step-decision-directed algorithm can be obtained, given as

The estimated spectra of pre-processed speech can be obtained by

(11) Due to the spectral estimate of speech obtained by the two-step-decision-directed algorithm given in (11) is more accurate than that obtained by the power-spectral-subtraction algorithm, thus the NMT is then more accurately estimated. Applying this NMT to adapt the

perceptual gain factor (Hu and Loizou, 2004), the performance of perceptual gain factor is therefore improved.

3.2. Estimation of noise masking threshold

The noise masking threshold (NMT) is obtained through modeling the frequency selectivity of the human ear and its masking property. The detailed procedure for estimating the NMT used herein is described as follows.

Initially, the estimated spectra of pre-processed speech can be accurately estimated by the TSDD method. Hence, the critical-band energy

is computed by where

and represent the upper and the lower frequencies at the

th critical band. The upper frequency

and the lower frequency

of a critical band can be found in (Schroeder et al., 1979; Virag, 1999). Taking into account masking properties between different critical bands, an excitation pattern

can be thought of as an energy distribution along the basilar membrane.

is determined by convolving the critical-band energy

, with the spreading function

, which can be found in (Schroeder et al., 1979).

(13) A relative threshold offset which specifies whether a speech frame is tone-like or noise-like is

imposed to adjust the log-critical-band energy. The adjusted log critical-band energy

is evaluated by adding the log energy of the excitation pattern and the offset

up in dB scale.

(14) where the contents of the offset

are all negative.

Convolving the critical-band energy with the spreading function given in (13) increases the energy in each critical-band. Thus the adjusted log-critical-band energy

should be divided by a gain function which is the gain value between the spread energy

and the ctitical-band energy . A normalized threshold can be obtained by subtracting the gain value

from the offset spread energy

in the dB scale. The normalized threshold

is expressed by

(15) where

denotes the gain factor between the spread energy

, and the critical-band energy , at the

th critical-band. The gain value in dB scale

is expressed by

(16)

The normalized threshold

is compared with the absolute-hearing threshold (AHT) which is frequency-dependent and can be closely approximated by the following expression

(17)

with f in kilohertz.

Finally, the noise masking threshold (NMT) is determined by selecting the larger value between the absolute hearing threshold AHT(f) and the normalized threshold

, the NMT

is given as

(18) where f is chosen to be the central frequency of the

th critical band.

3.3. Perceptual gain factor

The spectral estimate of speech signal is obtained by multiplying a perceptual gain factor

with the noisy spectrum of a subband. This perceptual gain factor serves as the third step of the proposed approach. The spectral estimate of an enhanced speech signal can be expressed by

(19) A spectral distortion measure is defined as the difference between the short-term spectra

of clean speech and of enhanced speech . This spectral distortion specifies the performance of a speech enhancement system, and is given by

(20)

where the spectra of speech distortion and that of residual noise are expressed as

(21)

(22) where and represent the spectra of speech and of noise signals, respectively.

We use the assumption that the noise signal is additive, and is uncorrelated with a speech signal. The gain factor can be optimized by minimizing the short-term spectral energy associated with the speech distortion, subject to a constraint on the short-term spectral energy related to residual noise below the noise masking threshold (NMT):

where is the NMT corresponding to the frequency bin

. The values of NMT are all identical in a critical band.

A perceptual gain factor in the frequency domain can be derived as (Hu and Loizou, 2004)

As the descriptions in section 3.1, the estimated spectral of pre-processed speech is obtained by the two-step-decision-directed (TSDD) method. Hence, is employed to estimate the NMT which is used to adapt the perceptual gain factor. Accordingly, the TSDD method can be thought as being embedded in the perceptual gain factor given in (24) which is named as perceptual-decision-directed gain factor.

4. Experimental results

In the following experiments, speech signals are Mandarin Chinese spoken by five female and five male speakers. Noisy speech signals are obtained by adding a clean speech signal with white, F16-cockpit, factory, babble (speech-like), helicopter-cockpit, and car noise signals, which are extracted from the Noisex-92 database. Three SNR levels, including 0 dB, 5 dB and 10dB, are used to evaluate the performance of a speech enhancement system. The minimum statistics algorithm (Martin, 2001) is performed to estimate the power of noise for each frequency bin. This algorithm updates the noise estimate in both speech-activity and speech-pause regions, which fact represents the advantage of the minimum statistics approach. The following parameters are used in the experiments: (1) sampling frequency is 8 kHz; (2) the frame size is 256 with 50% overlap; (3) Hanning window is utilized; (4) total number of critical bands is 18, the central frequency and the corresponding bandwidth of each critical band can be found in (Virag, 1999).

Objective measures, including the average of segmental SNR improvement (Avg_SegSNR_Imp) and the modified Bark spectral distortion (MBSD) (Yang et al., 1998) are conducted to evaluate the performance of a speech enhancement system. Only the performance in the speech-activity regions is evaluated. An informal subjective measure based on mean-opinion score (MOS), and spectrogram comparisons are also conducted. In order to evaluate the performance of proposed gain factor, the two-step decision-directed (TSDD) algorithm (Plapous et al., 2006) and the perceptual gain factor (Hu and Loizou, 2004) are implemented for comparisons.

4.1. Noise estimate The noise estimator plays a major role in deciding the quality of a speech enhancement system. If the noise estimate is too low, residual noise increases. Conversely, if the level of noise estimate is too high, enhanced speech sounds would be muffled and intelligibility would be lost. The traditional voice activity detectors (VADs) are difficult to tune in non-stationary noise corruption. In addition, the voice activity detector (VAD) application to low SNR speech results often in clipped speech. Thus, the VAD can not well estimate the noise level in non-stationary and low SNR environments. Martin (2001) proposed the minimum statistics algorithm to estimate the power of noise for each subband. The algorithm does not use the VAD, instead it tracks power minimum in each subband to decide the noise estimate. The minimum statistics noise tracking method is based on the observation that even during speech activity a short-term power density estimate of the noisy signal frequently decays to values which are representative of the noise level. This method rests on the fundamental assumption that during speech pause or within brief periods in between words and

syllables, the speech energy is close to, or identical to, zero. Thus, by tracking the minimum power within a finite window large enough to bridge high power speech segments, the noise floor can be estimated. Detailed procedure of the minimum statistics noise estimation algorithm can be found in (Martin, 2001).

4.2. Segmental SNR improvement

The amounts of noise reduction, residual noise and speech distortion can be measured by the average segmental SNR improvement (Avg_SegSNR_Imp). The average of segmental SNR (Avg_SegSNR) of a test signal is evaluated according to clean speech , and the enhanced

signal . It can be expressed by

where

represents a set of speech-activity frames. M and N denote the numbers of speech-activity frames and of samples per frame, respectively. m is frame index.

The Avg_SegSNR_Imp is computed by subtracting the Avg_SegSNR of noisy speech from that of enhanced speech. Table 1 presents the performance comparisons in terms of the Avg_SegSNR_Imp for various methods. All of the speech enhancement algorithms provide much more Avg_SNR improvement in low-SNR environments. The best performances are obtained in the cases of car noise corruption. It is due to the fact that the spectra of car noise are mostly concentrated in low-frequency subbands and the variations of spectral magnitude are stationary. The magnitude of noise can be accurately estimated, enabling the added noise to be efficiently removed. In most experiments, the proposed approach outperforms the other two methods. In the case of babble noise corruption with low-SNR inputs, the perceptual method performs better than the TSDD and the proposed methods. It may be attributed to the fact that the spectral magnitude of babble noise can not be accurately estimated. This is because the spectral properties of babble noise are similar to the speech signal. The more the noise is reduced, the more the speech distortion is. Thus the TSDD algorithm suffers from the most deterioration of speech, yielding the lowest values

of SegSNR improvement. Because the proposed method also employs the NMT to adapt the perceptual gain factor, the performance of proposed method can approximate that of perceptual method. (Table 1 is about here)

4.3. Modified Bark spectral distortion

The Bark spectral distortion (BSD) measure has been shown to be a good candidate for a highly correlated objective quality measure (Wang et al., 1992). Hence, Yang et al. (1998) proposed a modified BSD (MBSD) which is an improved version of the BSD to evaluate speech quality. The MBSD measure incorporates the concept of noise masking properties into the BSD measure, where any distortion below the noise masking threshold (NMT) is not included in the BSD measure. Consequently, the noise spectral components below the NMT are considered to be inaudible and these components are excluded from the calculation of the MBSD. The MBSD is expressed as Yang et al. (1998)

where M and K represent the numbers of speech-activity frames and of critical bands. denotes the indicator of distortion at the critical band

. and represent the Bark spectra of original speech and of enhanced speech at the critical band

of the frame m.

Table 2 presents the performance comparisons in terms of the MBSD. The minimal MBSD corresponds to the best speech quality. As the performances presented in Table 1, the proposed approach also significantly outperforms the perceptual and the TSDD methods in the cases of white and F16-cockpit noise corruptions. When the proposed approach can not outperform the other two methods, the performance of proposed method also approaches the score which is the lower MBSD score of the TSDD and the perceptual methods. This is attributed to the proposed method employing three steps to determine a gain factor. Thus the proposed gain factor is not only better

able to remove much more amounts of residual noise than the other two methods, but also preserve speech quality at an acceptable level.

(Table 2 is about here)

4.4. Waveforms

Figure 1 demonstrates an example of waveform plots for comparisons. A speech signal uttered by a female speaker was respectively corrupted by white noise with Avg_SegSNR = 0 dB. Observing Figs. 1(c-e), a clipped signal is absent at the output waveforms of the three methods. This is due to the speech enhancement systems employing the minimum statistics algorithm (Martin, 2001), which does not overestimate the noise level for each subband; enabling these methods not to over-attenuate the noisy speech signals.

Comparing the waveforms of enhanced speech shown in Figs. 1(c-d), the TSDD method is better able to remove more amounts of residual noise than the perceptual method. Thus integrating the decision-directed method into the perceptual gain factor can significantly improve the performance of perceptual method by removing more amounts of residual noise. This fact enables the enhanced speech of proposed method to sound less annoying than that produced by the perceptual method. The major reason is that the speech spectra can be better estimated by the TSDD method than that obtained by the power-spectral-subtraction method. Although the proposed method significantly reduces the amount of residual noise in speech-pause regions, the enhanced speech signal does not been severely deteriorated during speech-dominant regions. Therefore, the speech quality can be maintained at an acceptable level. (Fig. 1 is about here)

4.5. Spectrograms

In order to yield more information about residual noise and speech distortion, we analyze the time-frequency distribution of enhanced speech and evaluate the structure of residual noise by observing speech spectrograms. Figures 2 and 3 present spectrogram comparisons for various speech enhancement methods. A speech signal is corrupted by factory (non-stationary) and white (stationary) noise signals with Avg_SegSNR = 0 dB. In Fig. 2, a speech signal is uttered by a female speaker, and corrupted by factory noise (Fig. 2(b)) with Avg_SegSNR = 0 dB. Observing the spectrograms of enhanced speech shown in Figs. 2(c-e), the harmonic spectra of vowel signals can be preserved in the enhanced speech signals. Thus all of these three methods do not suffer from over-attenuating noisy speech to remove more amounts of residual noise. In addition, the spectrograms also reveal fine structure of spectra in speech-activity regions. A muffled signal is absent at the output of each speech enhancement method. Comparing the spectrograms of enhanced speech during speech-pause regions, the proposed method shown in Fig. 2(e) is better able to remove background/residual noise than the other two methods shown in Fig. 2(c) and 2(d). Thus the amounts of residual noise containments are lesser than those of the other two methods. In Figs. 2(c) and 2(e), the proposed method is better able to reserve weak harmonic spectra in high frequency subbands. So the speech quality can be well maintained.

In the cases of white noise corruption (Fig. 3), the TSDD and the proposed algorithms are better able to remove added noise than the perceptual method. Although the amounts of residual noise are comparable for the TSDD and the proposed methods, the isolated spectral peaks are randomly distributed over frequency subbands and in successive frames during noise-dominated regions (Fig. 3(c)) for the TSDD algorithm. It results in the musical effect of residual noise which is annoying to the human ear. Accordingly, the residual noise of the TSDD method sounds more annoying than the proposed method. In addition, the speech spectra with weak energy are also removed by the TSDD algorithm, yielding more speech distortion than the proposed methods. Accordingly, the speech quality is deteriorated. Although the spectra produced by the perceptual method (Fig. 3(d)) varies smoother than those produced by the other two methods, however, the amounts of residual noise containments are much more than the TSDD (Fig. 3(c)) and the proposed (Fig. 3(e)) methods. So the residual noise sounds very annoying. On the other hand, the proposed method can not only remove more amounts of residual noise, but also preserve the speech spectra with weak energy. Consequently, employing the TSDD algorithm to estimate speech spectra which is utilized to compute the noise masking threshold (NMT) is beneficial to improve the performance of the perceptual method (Hu and Loizou, 2004) by reducing much more amounts of residual noise. (Figs. 2 and 3 are about here)

4.6. Listening tests

A subjective measure reflects the way the signal is perceived by listeners. Such a measure expresses how pleasant the signal sounds. Herein, the mean opinion score (MOS) was also used to evaluate the global appreciation of the residual noise contamination and the speech distortion. The subjective listening tests were conducted with twelve listeners. Each listener gave each test signal a score between one and five. Table 3 presents the results of performance comparisons in terms of the MOS measure. Initially, the clean and the noisy speech signals corrupted by F16-cockpit noise were played back for the listening test. Then, the utterances of enhanced speech produced by the TSDD, the perceptual, and the proposed methods were played to listeners in a random order.

The performances in speech distortion are not apparently different among the three speech enhancement methods, so that the speech distortion only has a slight impact on the MOS measure. The major difference among them is on the annoyance level of residual noise. In Table 3, the proposed approach outperforms the other methods in all conditions. The major reason is attributed to the fact that the proposed method is better able to reduce the amounts and the magnitudes of residual noise. Comparing the TSDD and the proposed methods, the proposed method employs the NMT to adapt the perceptual-decision-directed gain factor given in (24). This makes the spectral peaks of residual noise tend to distribute less isolated than that of the TSDD method. Although the noise containments of enhanced speech for these two methods are comparable, the residual noise produced by the TSDD algorithm sounds more annoying than that produced by the proposed method.

Owing to the well estimated spectra of speech obtained by the TSDD algorithm, the residual noise can be well reduced. This is the major reason why the proposed method can significantly improve the performance of the perceptual gain factor, yielding much higher MOS values than the perceptual method.

(Table 3 is about here)

4.9. Discussions

An example of the gain variation contours for subband 7, which is arbitrarily selected, is shown in Fig. 4. In a speech-dominated region with strong energy, the value of the gain factor tends towards unity for all methods as illustrated in Figs. 4(b-d). It is beneficial to obtain a lower speech distortion. In the case of speech with weak energy, the values of gain factor for the perceptual and the proposed method are larger than that of the TSDD method. This enables the speech with weak energy, such as unvoiced speech or a weak vowel, to be preserved. Speech quality is maintained. Therefore, employing the masking properties of the human ear to adapt a gain factor can preserve more amounts of speech components than that without using the masking properties, such as the TSDD algorithm. Observing a noise-dominant region, the TSDD (Fig. 4(b)) and the proposed (Fig. 4(d)) methods achieve lower values of gain factor than the perceptual (Fig. 4c) method, yielding less amounts of residual noise. Although the perceptual method makes the magnitude of residual noise under the noise masking threshold, however, the residual noise is still annoying to the human ear. This is attributed to the estimation error of noise magnitude. Therefore, suppressing the magnitude of residual noise as low as possible is necessary.

The major reason why the TSDD algorithm can efficiently remove residual noise is that this algorithm employs the Wiener filter two times to estimate the spectra of speech. Firstly, the decision-directed algorithm is utilized to estimate the a priori SNR, as given in eqs. (7) and (8). In turn, the estimated a priori SNR is refined again by the TSDD algorithm, as the expressions in eqs.

(9) and (11). In this study, we employ the refined spectral estimate of speech to compute the noise masking threshold (NMT), as discussion in section 3.2. Hence, this NMT is employed to adapt the perceptual gain factor given in (24). Therefore, the proposed approach can be regarded as a third-step perceptual gain factor, where the TSDD algorithm serves as the first two steps. This is also the reason why the proposed method is better able to reduce much more amounts of residual noise than the TSDD algorithm as shown in Figs. 1 and 3.

Although the TSDD method is better able to suppress residual noise, it only decides a gain factor according to the estimated a priori SNR. A tradeoff between the amounts of residual noise and speech distortion is not considered, so the ability to prevent speech distortion decreases. In this study, we propose not only to employ the improved version of the estimated a priori SNR, but also to utilize the NMT to adapt a gain factor for speech enhancement. An improved result can be achieved. Therefore, the proposed approach can reduce more amounts of residual noise in noise-dominant regions. We thank the contribution of the TSDD algorithm which is embedded in the proposed speech enhancement system. In addition, the proposed method also can preserve the spectra of speech with weak energy. It is due to the contribution of the perceptual gain factor given in (24), enabling speech distortion to be kept at a low level in enhanced speech. Consequently, the proposed approach is an improved version of the perceptual gain factor.

5. Conclusions

The spectral estimates of a speech signal play a major role in deciding the value of noise masking threshold which is used to adapt the perceptual gain factor. In this article, we employ the two-step­decision-directed (TSDD) algorithm to improve the accuracy of estimated speech spectra. In turn, these spectra of pre-processed speech are employed to compute the NMT which is applied to adapt a perceptual gain factor. This leads to significant improvement of a perceptual gain factor in reducing much more amounts of residual noise. In addition, the proposed approach also can improve the performance of the TSDD algorithm. It is due to the fact that the TSDD does not consider the auditory properties of the human ear to adapt the speech enhancement system. A trade­off between the amounts of residual noise containments and speech distortion is not concerned. Although the amounts of residual noise of the TSDD algorithm are comparable to the proposed approach, the enhanced speech produced by the TSDD algorithm sounds more annoying than that of the proposed method. This is because of the apparent isolated spectral peaks of residual noise. Experimental results show that the proposed approach cannot only significantly improve the performance of the perceptual method in removing more amounts of residual noise, but also can ensure enhanced speech quality at an acceptable level. Consequently, the proposed approach can be regarded as an improved version of the perceptual and of the TSDD methods.