节点文献

基于计算听觉场景分析的单通道语音分离研究

The Research of Monaural Speech Segregation Based on Computational Auditory Scene Analysis

【作者】 王雨

【导师】 林家骏;

【作者基本信息】 华东理工大学 , 控制科学与工程, 2013, 博士

【摘要】 单通道语音分离系统能够在单一信道内将目标语音从嘈杂的背景干扰中提取并分离出来,常作为语音识别与说话人识别的前端模块。而基于计算听觉场景分析(Computational Auditory Scene Analysis,简称CASA)的语音分离系统能够通过计算机模拟人耳对目标语音的感知和跟踪现象,完成单通道语音分离的任务。由于其语音分离过程更接近于人耳对混合语音的感知分离过程,因此近年来该课题逐渐成为语音分离领域的研究热点。本文对计算听觉场景分析课题进行深入研究,详细介绍了基于CASA原理的单通道语音分离系统的结构和发展背景,并且在传统CASA系统的基础上提出了一套改进的语音分离系统。本文主要创新点如下:(1)基于改进阂值的有效能量特征提取。在对自然语音的浊音信号进行提取分离时,能量是重要的信号特征。传统CASA系统在计算有效能量特征时采取同一阂值,但由于噪音信号的不确定性与多样化,当混合语音中所含的噪音数据分布规律未知时,背景噪声对各频率信道有效能量特征的干扰将具有差异性,而传统恒定阈值无法有效地剔除干扰噪声单元。因此,本文采取基于平均信道能量的改进阂值方法对每个信道的时频域响应能量进行提取,提高了有效能量特征提取的精确性。(2)基于目标源单元的迭代基音估计算法。传统基音估计算法在进行基音估计时没有剔除干扰源单元,而是直接基于信道中的所有单元的自相关响应进行基音频率计算,导致基音计算结果具有一定的误差。本文提出的改进基音算法仅针对于已标记的目标源单元进行基音计算,首先将标记为干扰源的单元剔除,仅从估计的目标源单元中提取基音,之后再根据估计的基音轨迹进行新一轮的目标源单元标记。该算法对目标源单元标记和基音估计两个步骤进行迭代计算,直到每个浊音段的各帧基频达到稳定为止。实验证明,该算法能够提高基音估计的鲁棒性,改进了含噪环境下的传统基音提取算法。(3)基于谱减的改进清音分离方法。在提取了具有基音周期特征的浊音信号之后,需要将清音信号从残余干扰噪声中进一步提取出来。根据噪声信号分布的不确定性和不稳定性,本文提出了基于谱减的改进清音分离方法,通过距离加权的残余噪声估计算法得到每个清音单元中所包含的噪声能量,之后对每个清音单元进行谱减算法并标记,剔除残余噪声单元,提取出清音信号。该方法对具有时变性的残余噪声估计结果更加精确,能够提高清音分离的有效性。(4)基于形态学图像处理的掩码平滑。聚类后的二值掩码图被用于最终的语音重构。由于含噪情况下基音提取与目标源标记存在着不可避免的误差,导致二值掩码图中经常包含零星的残余噪声点与破损的语音段,这将会大大影响重构语音的质量与可懂度。为了降低和消除该问题对重构语音造成的影响,本文提出了基于形态学图像后处理的掩码平滑方法,该方法对聚类后的二值掩码图进行平滑处理,通过膨胀,腐蚀等形态学图像处理算法的有效结合处理,能够在不破坏图像细节信息的情况下对二值掩码图进行有效地去噪修补,从而进一步提高了分离语音的质量。

【Abstract】 Monaural speech segregation system is able to extract the target speech from noisy environment in a single channel. It’s usually the front end of speech and speaker recognition. Speech segregation system based on Computational Auditory Scene Analysis can simulate human auditory system and extract the target speech by computer to accomplish monaural speech segregation. Since its processing of the mixture speech is similar to the human perception processing of sound, the topic has been one of the most hot research issues in speech segregation field in the recent years.This dissertation studies the CASA topic, introduces the structure and history of CASA-based speech segregation system and proposes an improved monaural speech segregation system. The main contributions of this dissertation are presented as follows:(1) We propose an improved threshold selection technique for energy extraction. Response energy is an important auditory feature for speech segregation. Conventional method uses a constant value for energy extraction. As the types of noise are various and unknown, the interferences of different types of noise will differ in each channel. Conventional threshold is not able to remove the noise units effectively, so this paper proposes an improved threshold selection method for each channel based on its average mixture speech energy. The proposed method can remove background intrusion effectively and yield a significant improvement in energy extraction.(2) We propose an improved iterative pitch tracking algorithm based on the estimated target source. Conventional pitch tracking algorithm doesn’t remove interference when detecting the target pitch, which will inevitably cause the errors of pitch estimates. The proposed pitch tracking algorithm only estimates the pitch periods based on the labeled target units. It first removes the interference units, computes the pitch periods of each frame and then labels the target units repeatedly based on estimated pitch contours. It estimates the target units and detects the pitch periods iteratively until the pitch contours become stable. The experiment results show that the proposed algorithm performs more robust and accurate than conventional pitch tracking method under various interferences environments.(3) We propose an improved method for unvoiced speech segregation. After voiced segregation, unvoiced speech needs to be extracted from residue noise. The proposed method extracts the unvoiced speech based on spectral subtraction. We estimate the noise energy in each unvoiced segments based on distance-weighted noise estimation algorithm. Then spectral subtraction is applied to extract and label the target unvoiced units. The proposed method performs better than conventional one while handling the time-varying noise situations. It improves the accuracy of noise estimation and yield a better performance for unvoiced speech segregation. (4) We introduce morphological image processing technique to improve the mask smoothing module. The mask obtained after grouping is used for speech resynthesis. As the mask usually contains residue noise particles and broken auditory segments due to the errors of pitch tracking and target units labeling, which will degrade the quality of the resynthesized speech, the proposed method based on morphological image processing is applied to solve this problem. It can remove the unwanted particles and complement the broken auditory elements while maintaining the original mask details through the effective combination of dilation and erosion processing, further enhancing the quality of segregated speech.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络