节点文献

鲁棒语音识别技术的研究

Research on Robust Speech Recognition

【作者】 董婧

【导师】 赵晓晖;

【作者基本信息】 吉林大学 , 通信与信息系统, 2007, 博士

【摘要】 鲁棒语音识别技术是语音识别系统从实验室理论走向实际应用的关键性技术之一,其研究的主要目的是解决训练环境与应用环境之间失配所造成的识别率下降问题,本文在总结和分析现有多种鲁棒性识别算法的基础上,主要针对加性噪声的影响,在语音增强、基音提取、端点检测、鲁棒特征参数的选择等方面进行了深入地研究和探讨。采用共轭梯度递推求解带噪语音三阶累积量的修正Yule-Walker方程,以此估计纯净语音的生成模型参数和激励增益,提出了一种基于高阶累积量的卡尔曼滤波语音增强算法,增强后的语音失真度小,适合于识别系统的前端预处理。根据信号不连续性在小波变换不同分辨率下的可传递性,结合循环平均幅度差函数,提出了一种基于小波变换的循环平均幅度差基音提取算法(WCAMDF);同时,研究了小波多阈值估计在语音增强中的应用问题,基于小波对噪声的强抑制性,结合短时能量和谱熵函数将小波在基音提取及语音增强中的结果分别用于端点检测,给出了用于噪声环境中端点检测的两种鲁棒性特征。最后从特征空间研究了鲁棒语音识别中的特征参数提取问题,提出了三种基于MFCC的鲁棒特征参数改进策略:TEMFCC、LDA-TEMFCC和HOC-LPC-MFCC。在不同噪声环境下,对各种鲁棒识别算法进行仿真实验,成功实现了对加性噪声的抑制,验证了新算法的优良鲁棒性。

【Abstract】 Speech signal is the most convenient and effective intercommunication mode. With the rapid development and wide application of computer technology, people hope more and more to realize the natural man-machine communication by speech. Automatic speech recognition (ASR) emerges as the times require and has achieved quite remarkable progress in recent years. Now it is being applied to the real-world applications from the laboratory research theory and may be the leading user interface for the followon operating system and application program.Most speech recognition systems are designed for clean speech and relatively easy to accomplish fairly complex recognition tasks with high accuracy in controlled quiet laboratory environments. However, when a ASR system is used in a real-life situation, there is bound to be a mismatch between training and testing caused by background noise. The performance of systems deteriorates severely, which is the most major obstacle to the commercial use of speech recognition technology. So, how to increase the robustness of ASR is significant and necessary. The aim of robust speech recognition is to alleviate the effect of mismatch and to achieve good recognition performance in noisy conditions. Various methods have been studied in this area, which can be broadly classified into 3 categories– speech enhancement in signal space, robust feature extraction in feature space and speech model compensation in model space. In this paper, we focus on the first two problems i.e. improving the speech recognition accuracy in signal space and feature space using some new approaches under additive background noise. The main attributes are listed as follows.1、Speech enhancement aims at extracting clean speech from noisy signal while suppressing noise, minimizing distortion of speech and enhancing speech intelligibility. For robust speech recognition, speech enhancement often exists as a preprocessor and produces an almost clean speech signal to a ASR system. Consequently, it is not necessary to make any changes in the recognition system to make it robust. Currently, most enhancement algorithms present important limitations, as they only focus on one given noise. With noise diversification, the techniques are becoming more and more complex. Moreover, many algorithms aim at improving intelligibility in mind, the enhanced speech signal may lose some useful information, which can degrade the performance of ASR system. To cope with these problems, in this paper, a Kalman-filter speech enhancement algorithm based on higher-order cumulants is proposed.The performance of Kalman-filter algorithm is mostly up to the precision of clean speech LPC parameters and the impulse gain. Considering the good robustness of higer-order cumulants to Gaussian noise, the LPC parameters of clean signal can be estimated by solving the modified Yule-Walker (MYW) equation of third-order cumulant of noisy signal. At the same time, the impulse gain needed is proposed to be approximately obtained by the estimated model parameters and the noise variance.Based on three objective measures-the power spectrogram, time domain waveform and SNR, the enhancement performance is evaluated respectively under nine types of noise with different SNR conditions. Simulation results show that this algorithm is simple, effective and robust in the presence of very complicated noise. There are significant improvements both in SNR and in apperception quality, besides the distortion of enhanced speech is very small. Therefore, this algorithm is especially adapted to robust speech recognition preprocessing as well. In isolated word speech recognition system, experiments show that this cascading can improve recognition accuracy at low SNR levels.2、We propose an adaptive recursive estimation algorithm of AR model parameters based on conjugate gradient when solving the third-order cumulant MYW equation. By contrast with the estimation errors of noisy AR sequence using RIV, direct inversion and LMS separately, this algorithm has the most rapid convergence and the highest accuracy without a mass of matrix inversion operation. At the same time, reconstructing the power spectrum of noisy sine sequence and speech signal by use of parameters spectral estimation algorithm, the model parameters estimated by conjugate gradient have good performance in envelope fitting, formant acutance and resolution even if the SNR is very low. 3、Pitch detection is one of the most difficult technologies in speech signal processing under noisy conditions. According to the transmissibility of signal discontinuity under different resolution of wavelet transform, a new method for pitch detection on the basis of wavelet transform and circular AMDF (WCAMDF) is presented in this thesis. The method overcomes the disadvantages of low accuracy, high complexity and lack of robustness in many actual pitch detection algorithms. Simulation results indicate that the proposed algorithm possesses better pitch detection precision for speech signals under strong background noise, low calculation complexity, high resolution, and capability for real time implementation.4、The wavelet transform is adaptive to signal. This paper researches the multi threshold estimation of regular signal based on wavelet transform and its application in speech enhancement area. The noisy speech signal can be denoised by using of wavelet. We point out that the SURE translate soft threshold is the most adaptive to speech signal from theory analysis and experiments, and the enhancement performance is perfect. The evaluations are performed on the power spectrogram, time domain waveform and SNR, it is shown that this method is effective in noisy conditions.5、The VAD technology plays a very important role in ASR systems. The correct endpoint detection can reduce the computational cost and shorten the run time. A major cause for errors in speech recognition is the incorrect detection of the beginning and the ending boundaries of the test. So, the reliable, accurate, real-time, adaptive and robust VAD technology is needed in every recognition system. Based on wavelet transform, two novel strategies are proposed for accurate and robust endpoint detection under noisy environments in this paper.1) Endpoint detection algorithm based on WCAMDF pitch extraction. WCAMDF can extract exact pitch information against variations of noisy environments. Therefore, by use of the magnitude envelope of CAMDF during the process of pitch extraction, the proposed algorithm is verified that improved robustness is achieved in both detection accuracy and recognition performance at low SNR levels, with an average recognition error rate reduction of more than 21%. 2) Endpoint detection algorithm based on energy-entropy of wavelet. It is found that the detection using basic energy and spectral entropy becomes difficult and inaccurate when speech signals are contaminated by colored noise, and the main specificity of wavelet transform is that the residual noises in enhanced speech signals are almost white. As a consequence, we try to couple them together closely, instead of using the energy-entropy feature of initial noisy signals, the feature are computed after wavelet transform. This modification outperforms basic energy-entropy, improves the discriminability between speech and noise so that it becomes easier to set threshold.The two endpoint detection approaches can go along with pitch extraction or speech enhancement simultaneously. They are realtime, simple, easy to realize, and have small model complexity, which is very important especially in large vocabulary ASR systems where processing power and memory available are limited. 6、In real world, robust features extraction is one of the most crucial issues in the field of ASR applications. It aims at finding succinct, salient, and representative relevant characteristics from noisy speech utterance to discriminate. The selection of robust features is highly desired in order to offer acceptable recognition performance under various noisy environments. Mel-frequency cepstral coefficients (MFCC) have been well accepted as a good choice for speech features with reasonable robustness, and many advanced techniques have been developed based on them. Three new improved methods are proposed based on MFCC in this thesis.Teager energy-Entropy MFCC (TEMFCC). Teager energy-entropy features are commonly used for locating the endpoints of an utterance. When integrated with MFCC, it is shown to offer an average accuracy increase of 10% as compared to MFCC in baseline system. The selection of Teager energy-entropy increases the dimension of feature vectors. In order to overcome this shortcoming, we can perform the classification and dimensionality reduction of the feature vectors by use of Linear Discriminant Analysis (LDA) technology. LDA-TEMFCC robust features, 20 dimensions, yields 6% increase of recognition performance by contrast with 24 dimensions MFCC in baseline. The MFCC, directly derived from power spectrum of noisy speech signals, show excessive sensitivity to external additive colored noise and generally result in degradation of recognition performance in noisy conditions. By virtue of powerful Gaussian noise restraint property of HOC, HOC-LPC-MFCC feature vectors are developed. The speech power spectrum is reconstructed by the model parameters estimated from third-order cumulant of noisy signal, and MFCC is derived from the reconstruction. The experimental results show that significant noise robustness can be achieved by the use of the proposed features in all conditions as compared to the pure MFCC.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2007年 03期
  • 【分类号】TN912.34
  • 【被引频次】14
  • 【下载频次】1238
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络