节点文献

极小化标注的音频分类和句子切分的研究

Research on Label-minimized Audio Classification and Sentence Segmentation

【作者】 赵群

【导师】 张巍;

【作者基本信息】 中国海洋大学 , 计算机软件与理论, 2010, 硕士

【摘要】 语音库的自动建设在可训练的语音合成中占有很重要的地位,它要求对输入的音频进行类别的区分来进行不同的处理,并将处理后的音频分割为句子作为后续的音段切分系统的输入。音频分类和句子切分技术是解决这一问题的关键。此外,已有的音频分类和句子切分技术都需要大量的人工标注数据来训练模型和测试分类结果,但人工标注费时费力,很大程度上增加了系统构建的成本。在这种背景下,极小化标注的音频分类和句子切分的研究有很高的理论研究及使用价值。对此,本论文在基于内容的音频分类和不依赖语音识别的句子切分方面,包括特征选择、极小化标注、关键技术改进以及相关技术应用,进行了深入而系统的研究,本论文具体的研究工作和研究成果如下。1)深入分析了音频信息的主要来源和音频的语义内容,根据所采用的新闻朗读音频的特点,将音频分为:纯语音,纯音乐和音乐和语音的混合三类。从帧层次上和段层次上深入研究了不同类别音频之间的区别性特征,除了频域能量、过零率、MFCC参数等基础特征,还采用了新的特征:静音比率、High-ZCR比率和Low frequency energy比率。本文的一个创新点是,通过深入分析协同训练算法co-training在极小化标注数据量并保证分类精度方面的优势,采用基于最大熵分类的co-training算法进行音频分类。通过实验证明了co-training在音频分类上的性能。2)为实现极小化标注,深入研究了基于最大熵(Maxent)分类的协同训练算法co-training。Co-training是实现极小化标注的核心,通过研究比较了不同参数设置对分类精度的影响,综合时间代价及计算代价进行分析,确定了性能最优的一组参数。同时,针对音频分类和句子切分的数值分类方式,对Maxent分类器的分类方式进行调整。通过实验证明了co-training算法在极小化可用的人工标注数据量和二元分类方面的性能,为极小化标注的音频分类和句子切分的实现提供了坚实的基础。3)通过对依赖语音识别的句子切分方法的缺点的分析,深入研究韵律特征对句子切分的重要作用,据此对音频进行帧水平上的元音/辅音/停顿的分类,并采用了韵律特征、停顿特征和语速两个特征集,对音频进行基于语义的句子切分。为了实现句子切分的无标注特性,引入一种基于强制对齐和语音识别的带有检错机制的标注数据生成方法用于自动提供标注数据,并采用基于最大熵分类的co-training算法,解决了标注数据不足对分类精度的影响,实现了无标注的不依赖识别的句子边界探测。最后,针对无法确定探测出的句子边界是否为真正的边界的问题,提出一种检错机制,通过比对文本和元音/辅音/停顿分类后的音频上的元音个数的相应比例对句子切分的结果进行检错,以确定绝对准确的句子边界,直接用于后续的处理过程和系统中。本文的第二个创新点是实现了句子切分系统的无标注特性,并提出一种检错机制来确定和提取真正的句子边界。

【Abstract】 Automatic building of voice database is of particular importance for speech synthesis. It requires distinguishing the category of input audio for different treatment, and segmenting the processed audio into sentences, which is taken as the input of following automatic syllabic segment cutting system. Audio classification and sentence segmentation are the key technologies to solving these problems. In addition, methods proposed of audio classification and sentence segmentation require a great quantity of manual label data to train the model and test the results, which is expensive, time-consuming and laborious to prepare, largely increased the cost of system construction. Due to this, research on label-minimized audio classification and sentence segmentation has high research value and application usage. Therefore, this thesis studies the topic of the content-based audio classification and sentence segmentation without speech recognition in depth and systematically, including feature selection, label minimizing, the key technology improvements and the related application. The detailed research works in this thesis are as follows.(1)The main sources of audio information and semantic content of audio are deeply analyzed and based on the characteristics of news broadcasting audio adopted, audio clip is classified into three classes:pure speech, pure music and speech mixed with music. Based on the deeply research of distinguishable characteristics of audio features in frame level and clip level, apart from basic features such as frequency energy, zero-crossing rate, MFCCs and so on, new features are introduced, including silence ratio and High ZCR ratio and Low frequency energy ratio. The first innovations of thesis is that through in-depth analysis on advantage of collaborative training algorithm co-training, in minimizing the amount of label data and guaranteeing the classification accuracy, the co-training algorithm based on maximum entropy (Maxent) is used for audio classification. Experimental results demonstrate the performances of co-training in the audio classification.(2)To implement the label-minimizing, the co-training algorithm based on maximum entropy classifier is studied in detail. Co-training is the core to realize the label-minimizing, through contrasting the effect of different parameter settings on the classification accuracy and comprehensive analysis of the cost of time and computation, the optimal set of parameters is determined. Meanwhile, the classification way of Maxent is adjusted for the numerical classification of audio classification and sentence segmentation. Experimental results prove the performances of co-training in binary classification and minimizing the amount of label data, which provides a solid foundation to the implementation of label-minimized audio classification system and sentence segmentation system.(3)Based on in-depth analysis of the shortage of sentence segmentation methods which rely heavily on the results of speech recognition, and research on the important role of prosodic features to sentence segmentation, the semantic sentence segmentation is performed on audios, by doing vowel/consonant/pause (V/C/P) classification to audios in the frame level and using prosodic features, pause features and rate of speed (ROS) as two feature sets. A label data generating approach with checking mechanism, based on forced alignment and speech recognition, is introduced to provide label data automatically and make sentence segmentation label-free. In addition, Maxent-based co-training is executed to solve the problem of insufficient label data and realize the sentence boundary detection without manual label and speech recognition. At last, a checking mechanism is proposed to solve the problem that it can not to make certain the boundary detected is a real sentence boundary or not, by contrasting the proportion of vowels on text with that on audio data after V/C/P classification. It can pick out the real sentence boundaries from boundaries detected form co-training, which can be used in following process and system directly. The second innovations of thesis is the realization of zero manual label to sentence segmentation, and the checking mechanism which can

节点文献中: 

本文链接的文献网络图示:

本文的引文网络