节点文献

快速精确的结构化机器学习方法研究

Research on Fast Exact Structured Learning

【作者】 钱线

【导师】 吴立德;

【作者基本信息】 复旦大学 , 计算机应用技术, 2010, 博士

【摘要】 相比于普通的机器学习算法,结构化机器学习可以利用结构信息达到更好的效果,但其时间复杂度要高很多,虽然有快速的近似解法,但精度的损失一定程度上抵消了结构信息带来的好处,因此研究快速精确的结构化机器学习算法成了一个重要的课题。本文中,我们对结构化机器学习中的推断算法以及特征抽取两个重要环节进行改进。首先,我们针对序列标注问题,基于许多实际应用中高阶特征信息的稀疏性特点,提出了稀疏高阶的条件随机场模型和一种新的快速精确的推断算法,它可以同时处理局部特征和稀疏的高阶特征。由于稀疏性的存在,这种新的推断算法是十分高效的。在手写体识别任务上,我们采用词缀特征作为高阶特征,稀疏高阶的条件随机场模型达到了所有公开的实验结果中最高的精度。在中文组织机构名识别任务上,我们将人工抽取的规则转化为高阶特征,并取得了微软亚洲研究院数据集上第二名的成绩。这两个实验表明,在特征集相同的情况下,稀疏高阶的条件随机场模型明显优于其他的方法。其次,我们提出了一种新的特征字符串索引结构以加速特征抽取,从而缩短解码时间。现在许多结构化机器学习方法采用模板生成数以百万千万的特征。复杂的模板可以产生大量复杂的特征,从而提高了精度,但却需要更多特征抽取的时间,大大影响了解码速度。为此,我们提出了两维的Trie结构,该结构可以利用模板之间的相互关系提高特征抽取的速度:一个模板生成的特征字符串是它的扩展模板生成的特征字符串的前缀,因此前一个特征字符串的索引号可以用来检索后一个特征字符串,从而节约了时间。我们将这种新的数据结构用在基于图模型的依存句法分析的任务上。在中文宾州树库上的实验表明,两维Trie的特征抽取速度是传统Trie的5倍,整个句法分析的解码速度是后者的4.3倍。

【Abstract】 Structured learning models owe a great part of their success to the ability in using structured information. However, these methods are more time consuming than non-structured learning models. Though approximate algorithms reduce the computational complexity, they degrade the accuracy to some extent. Therefore, exploring fast extract algorithms has important role in structured learning.In this paper, we improve two aspects of structured learning:inference and feature extraction. First, for sequence labeling tasks, we proposed sparse higher order Conditional Random Fields(SHO-CRFs) based on the characteristics of sparseness of higher order features in many real applications, together with a novel extract tractable inference algorithm which is able to deal with local and sparse higher order features. SHO-CRFs are practically very efficient due to fea-ture sparseness. In optical character recognition task, we use word affixes as higher order features, SHO-CRFs achieve the highest reported accuracy. In Chi-nese organization name recognition task, we achieve the second highest F1 score on Microsoft Research Asia corpus. Both experimental results show that, with the same feature set, SHO-CRFs significantly outperform other state of the art methods.Second, we propose a novel feature string indexing structure for fast feature extraction to reduce decoding time. Many modern structured learning methods adopt templates to generate millions features. Complicated templates bring about abundant features which lead to higher accuracy but more feature extraction time. We proposed two-dimensional Trie(2D Trie), a novel data structure which takes advantages of relationship between templates for fast feature extraction:feature strings generated by a template are prefixes of the features from its extended templates. Experimental results on Chinese Tree Bank 6 corpus show that,2D Trie is about 5 times faster than traditional Trie, leading parsing time 4.3 times faster.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2010年 11期
节点文献中: