节点文献

音视频双模态车载语音控制系统的设计与实现

Design of Speech Control System on Av Bimodal Information and Its Realization

【作者】 严乐贫

【导师】 贺前华;

【作者基本信息】 华南理工大学 , 通信与信息系统, 2010, 硕士

【摘要】 语音控制应用于行车环境有助于解放驾驶员的双手和双眼,提高驾驶安全和驾驶乐趣。目前噪声环境下单独依靠音频信息所得到的识别率很低,制约了车载语音控制的发展。利用视觉信息辅助语音识别能提高噪声环境下语音识别系统的识别率。行车过程中驾驶员位置固定,取像方便,使得在车载语音控制系统中利用视觉信息成为可能。车载语音控制系统中使用双模态语音识别抗噪声,已成为一个重要的研究课题。为了加快车载语音控制系统的研发进程,本文在PC机平台上构建了一个双模态车载语音控制仿真系统,为嵌入式车载语音控制系统的研发提供参考。本文主要工作如下:(1)论述了双模态语音识别基本原理及相关技术,并提出了双模态车载语音控制仿真系统的设计方案。系统整体构架采用中词汇量连续语音识别,音频特征选取能体现人耳听觉特征且抗噪性能较强的美尔频率倒谱系数(Mel Frequency Cepstral Coefficients, MFCC),声学模型采用隐马尔可夫模型(Hidden Markov Model, HMM),视频特征采用基于嘴唇轮廓的像素特征,听觉信息与视觉信息使用后融合的策略进行双模态语音识别。(2)结合车载语音控制的实际需要,构建了一个面向车载控制语音识别双模态数据库。分析了现有的国内外双模态数据库,归纳了建立双模态数据库的依据。参考建库依据,建立了车载语音控制双模态数据库。为减小数据库内语料标注的工作量,设计了标注软件,并进行了标注。(3)设计并实现了双模态车载语音识别控制系统。系统分为模型训练、离线识别和在线识别三个子系统,各子系统在结构上相互联系,功能上相互独立。各子系统由若干功能模块组成,且功能相同的模块在子系统中能通用。模型训练子系统分听觉和视觉两个通道训练了声学模型和视觉模型,供离线和在线识别子系统使用。研究了在Visual C/C++环境下调用ATK(Application Toolkit for HTK)接口进行音频信号的处理。为便于算法的升级,视频信号的处理模块采用动态链接库的方法。为了使系统能体现直观的测试结果,离线识别子系统中设计了结果统计功能模块。为了体现良好的人机交互和有效地降低外界语音的干扰,在线识别子系统中设计了人机语音对话式交互处理流程,以及结果的归一化处理和可选择处理。(4)评估了仿真系统在多种环境下的识别性能,并对评估结果进行了讨论。实验结果表明,与纯听觉的语音识别相比,双模态语音识别具有更好的抗噪性能,更适合应用于车载语音控制。

【Abstract】 Voice control used in the automotive environment can liberate the driver’s hands and eyes, and improve driving safety and pleasure. However, weak Audio- only speech recognition technology in noisy environment restricts the development of automotive voice control. There is another kind of Automatic speech recognition (ASR), which uses an video sequence of the speakers lips, called visual speech recognition (speech reading or lip-reading). Visual speech can improve the robustness of recognition system under noise environment. The application of audio-visual speech recognition in vehicular become better, because the driver’s position fixed and it’s easier to get the visual feature. Nowadays, audio-visual speech recognition for voice control in vehicular become an important research topic. In order to expedite the study process, an audio-visual speech recognition simulation system is built for voice control in vehicular on PC. This simulation system provides reference for embedded speech control systems in vehicle. The main works in this thesis as follow:1) The basic knowledge of audio-visual speech recognition is studied, and the design is proposed for audio-visual vehicle control simulation system. Mel Frequency Cepstral Coefficients (MFCC) is used as the audio-only feature, which approximates the human auditory system’s response and be robust in the presence of additive noise. Hidden Markov Model (HMM) is used as the acoustic model. The image pixel-based features in the mouth area are considered as visual-only features. Feature fusion and Decision fusion are discussed for audio-visual speech recognition.2) Bimodal Speech Recognition for vehicular Control database (BiMoSp) is collected. The rule of how to build an bimodal speech database is summed, according to the current audio-visual speech database in home and abroad. All the data in BiMoSp are labeled. An labeling soft is also designed to label the data, which reduce the amount of labeling work.3) Bimodal speech vehicular control system (BSVCS) is designed and carried out. There are three sub-systems in the BSVCS: model training、online recognition and offline recognition . These sub-systems have relation with each other on struct, but are independent on functions. These sub-systems are composed of many models. Some models can used in the two or above sun-systems. Model training sub-system includes audio training and visual training model. The output of model training sub-system will be used in online recognition and offline recognition sub-system. Application Toolkit for HTK(ATK) is used to do the audio-visual signal processing under Visual C\C++ program. Dynamic link library method is used in visual signal processing model in order to improve the algorithm. Offline recognition includes statistical function model which can show the result directly and intuitively. In online recognition, there are human-computer interaction processing flow model, results normalized model and optional processing model. These models are designed to show the good performance of the human-computer interaction and reduce the disturbing of audio noisy.4) The performance of the simulation system are tested in different environment. And the results of experiments are discussed. Experiments show that compare with traditional audio-only speech recognition, there is an great improvement on the audio-visual speech recognition. Audio-visual speech recognition is more useful for BSVCS.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络