节点文献
基于统计模式识别发音错误自动检测的研究
A Study on Automatic Mispronunciation Detection Based on Statistical Pattern Recognition
【作者】 张峰;
【作者基本信息】 中国科学技术大学 , 信号与信息处理, 2009, 博士
【摘要】 发音错误自动检测是计算机辅助语言学习系统的关键技术,在很大程度上决定了计算机辅助语言学习系统的性能。可靠的自动发音错误检测技术有助于计算机辅助语言学习系统了解学习者掌握语言的水平,分析出学习者的发音缺陷,针对性的给出改进意见,并给出对应的学习材料,有效的提高学习者的语言水平。本文针对主流的基于统计模式识别的发音错误自动检测技术进行了深入的分析,在声学模型和后端处理方面都进行了有针对性地研究,建立了性能稳定的发音检错系统。本论文的具体工作和研究成果概述如下。首先,本文调研了发音错误自动检测技术,通过对该研究背景和现状的分析,选择了基于统计语音识别的策略作为发音错误检测的基本方法。在对基本的发音检错的系统进行介绍时,本文重点说明了系统中的错误检测度量得分算法。针对原有的错误检测度量得分算法在实际使用中的缺陷,本文提出了SLPP算法,其检错性能要明显好于原有算法。在对本文的实验数据库进行介绍时,分析了数据库上几个专家检错结果的一致性问题,了解了人工发音检错的性能,说明了发音检错自动任务的挑战性。其次,在声学模型的改进方面,本文提出了引入统计语音识别中的自适应技术,将该技术用于测试数据,减少测试数据与训练数据的不匹配,同时也应用于训练数据,以有效的估计出话者无关的规范模型。在对测试数据采用自适应技术时,本文引入了语音识别中成熟的MLLR算法。由于语音检错与语音识别的目标不一致,MLLR算法不一定能提高发音检错系统的性能。为此,本文针对发音检错的目标,提出了SMLLR的自适应技术;在对训练数据采用自适应技术时,本文引入了语音识别中的SAT算法,以生成规范的声学模型,提高检错性能。由于规范模型会导致其与测试数据更加不一致,因此需要把SAT技术和SMLLR技术结合使用,以有效的提高发音检错系统的性能。再次,在声学建模的改进方面,本文还提出了采用语音识别中的区分性训练的思想,针对性的设置与发音检错目标相一致的声学建模目标函数。通过回顾语音识别中的各种区分性训练的方法,本文说明了这些区分性训练的方法如何与语音识别的提高识别率的这个目标函数相一致。然后针对发音检错的任务,本文分析了该任务的目标函数以及与之对应的区分性训练的策略,提出了发音检错的区分性训练的方法要与错误检测度量得分算法相一致,并且提出在进行区分性训练时,训练数据库中除了正确发音的样本外,还需要错误发音的样本,否则区分性训练可能作用不明显。此外,除了声学建模的改进以外,本文还从发音检错的后端处理方面,提出了三维后端归一化的处理策略和基于机器学习的后端处理策略。首先,通过对专家打分和实验数据的分析,提出了要在说话人层次上引入说话人整体发音水平的特征;其次,通过对文本相关的后验概率的分析,提出了要在说话内容层次上引入音素类别的特征;再次,通过对系统使用中的干扰问题的分析,提出了要在说话时间层次引入前后文得分的特征。最后,通过引入这三个层次的特征,提出了三维后端归一化的处理策略,大幅度提高了系统性能。三维后端归一化的处理策略也有一些问题,比如多维特征的处理。为了解决这些问题,我们提出了更加可靠的基于机器学习的后端处理策略,通过SVM来处理多维特征的优化。最后,通过以上的研究工作,可以实现一个性能比较稳定的发音检错系统,在此基础上,本文提出了发音检错的声学模型自动更新策略,该策略能通过对未标注的原始数据的获得,针对错误发音样本进行处理,不断的提高发音检错系统的性能。首先,本文分析了错误检测度量生成算法,说明了对错误发音进行建模的必要性;接着通过对错误发音的特点和非监督的参数估计的分析,提出了几种错误发音建模的策略,其中错误发音半监督聚类建模的算法效果最好。进一步,通过已建立的性能比较可靠的检错系统以及错误发音建模算法,本文提出了发音检错的声学模型自动更新策略,能够处理未标注的原始数据,改进声学模型的建模空间,提高发音检错系统的性能。
【Abstract】 Automatic mispronunciation detection is the key technique of Computer Assisted language learning(CALL) system.With the help of automatic mispronunciation detection module,CALL system can evaluate the language learner,analysis his pronunciation defection and give him the specific advice and most suitable training materials in order to improve his pronunciation level.This thesis focuses on the automatic mispronunciation detection based on statistical pattern recognition and carries out thorough research in the areas of the acoustic model and the back-end processing.The specific work and research findings of this thesis are summarized below.Firstly,the automatic mispronunciation detection system based on statistical speech recognition is used as the basic strategy in this thesis through the survey of the current technology.A brief introduction of this system is given.This thesis also introduces the details of the algorithms of the measure of mispronunciation scoring and their defect in actual usage.To eliminate the defect,SLPP algorithm is proposed here.While introducing the experiment databases,the consistence of the mispronunciation detection by the experts on these databases is analyzed,this shows up the performance of the artificial level of mispronunciation detection and considers automatic mispronunciation detection as a challenging task.Secondly,in the area of the acoustic modeling,to reduce the mismatch between the training and testing data and build a speaker-independent canonical model,this thesis induces the adaptation technology to the mispronunciation detection system in testing and training.In testing,speaker adaptation based on maximum likelihood linear regression(MLLR) for speech recognition is induced here.Taking account of the difference objections for speech recognition and mispronunciation detection, selective maximum likelihood linear regression(SMLLR) strategy is proposed for the special purpose of mispronunciation detection;In training,adaptive training based on speaker adaptive training(SAT) for speech recognition is induced which can be a useful approach of speaker normalization to reduce the overlap of speaker independent model caused by variation among the speakers of the training data.SAT and SMLLR strategies must be used together as the only canonical model will lead to more inconsistent with the testing data.Thirdly,in the area of the acoustic modeling,besides adaptation technology,this thesis also makes use of the notion of discriminative training original for speech recognition and analyses the special objective function consisted with the target of mispronunciation detection.From the review of the various methods of discriminate training for speech recognition,the connection between these methods and the target of speech recognition is shown.With the analysis of the target of mispronunciation detection task and the related objection functions,this thesis proposes that the strategy of the discriminative function must be consisted with the measure of mispronunciation scoring.Furthermore,the mispronunciation samples are needed in the training database for discriminative function of mispronunciation detection.Fourthly,besides investigating proper strategy for acoustic modeling,improving the back-end processing can also improve the mispronunciation detection system.In this thesis,three-dimension back-end normalization and machine learning back-end processing strategies are proposed.Three-dimension means the speaker-level, context- level and time-level.As the analysis based on the expert rating and experimental data,this thesis proposes the feature of the speaker overall pronunciation score in the speaker-level;as the analysis of the content-dependent posterior probability algorithm,this thesis proposes the phoneme-related feature in the content-level;as the problem of the actual usage,this thesis proposes the context-related feature in the time-level.For the usage of these three features,this thesis proposed three-dimension back-end normalization strategy.To avoid some defects of this strategy,machine learning back-end processing strategy is proposed here which can deal with the incremental multi-features wisely.At last,a reliable system of mispronunciation detection can be achieved by the previous strategies in the acoustic modeling and back-end processing.On the basis of this system,the thesis proposed a strategy of automatic updating of acoustic model by handling of the mispronunciation modeling.The necessity of mispronunciation modeling is proved by the analysis of the algorithms of the measure of mispronunciation scoring.To modeling the mispronunciation,several strategies are proposed.Among them,the performance of half-supervised cluster modeling strategy based on unsupervised parameter estimation is the best.Consequently,through the reliable system and the mispronunciation modeling algorithm,this thesis proposed a strategy for automatic updating of acoustic model of mispronunciation detection, which can continuously improve the acoustic modeling space and the performance of the system.
【Key words】 Automatic Mispronunciation Detection; Statistical Speech Recognition; SLPP; SMLLR; DT; Back-end Processing; Machine Learning; Half-supervised Cluster;