节点文献

中文词法分析的研究及其应用

The Research and Applications of Chinese Lexical Analysis

【作者】 孙晓

【导师】 高庆狮; 黄德根;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2010, 博士

【摘要】 在机器翻译和其他自然语言处理任务中,对于中文和日文等亚洲语言,词的识别和处理是一个最为关键的基础性步骤,而其中存在的问题至今仍然没有得到完善的解决,从而影响了机器翻译以及其他自然语言处理任务的精度和效率。在中文词法分析任务中,除了中文分词,还包括词性标注,未登录词(或新词)的识别和词性标注等基础性步骤,这些也是影响中文词法分析性能和精度提高的难点所在。首先,针对中文词法分析存在的问题,提出了一种新的融合单词和单字信息的基于词格的中文词法分析方法。该方法利用系统词表,构建包含所有分词和词性标注候选路径的词格,同时对候选未登录词及其词性进行同步识别并加入到词格中,降低了未登录词识别的运算复杂度,然后利用基于词的条件随机域模型,结合定义在整条输入路径上的全局特征模板,在词格中选择最终的分词以及词性标注结果。基于词的条件随机域的解码速度要高于基于单字的条件随机域,并降低了标注偏置问题和长度偏置的影响,在SIGHAN-6等开式和闭式语料上进行测试,获得了令人满意的结果。另外,为了进行对比,对基于单字的中文分词模型也进行了进一步的研究,在其中引入多个外部词典,并增加了相应的特征,进一步提高了基于单字的中文分词模型的分词精度;同时,为了满足高效率的中文词法分析需求,提出了基于最长次长匹配算法的一体化的中文词法分析方法,因为是基于隐马尔可夫进行编码和解码,因此具有较高的训练和词法分析速度。其次,针对中文词法分析中的未登录词识别和标注问题,提出了隐藏状态的半马尔可夫条件随机域模型(Hidden semi-CRF), Hidden semi-CRF模型可以同步识别未登录词及其词性。Hidden semi-CRF模型结合了隐藏变量动态条件随机域模型(LDCRF)和半马尔可夫条件随机域模型(semi-CRF)的优势,相对semi-CRF模型具有更低的运算代价和更高的识别精度。通过Hidden semi-CRF模型同步识别未登录词及其词性,并加入到词格中参与整体路径选择,提高了词法分析的整体精度。最后,将中文词法分析的结果直接应用到基于超函数的中日机器翻译系统中,对原有超函数进行了扩展:首先是将超函数扩展为面向句子的超函数和面向短语的超函数,其次是扩展了超函数中变量的范围,最后提出了高效率的搜索相似超函数的匹配算法。扩展后的超函数降低了超函数库的数量,提高了匹配超函数的检索速度,并且翻译的精度和质量也得到提高。

【Abstract】 Words are the smallest meaningful units that can be used independently, lexical analysis is the basic step for syntactic tagging, semantic tagging and other deeply corpus processing. Most natural language processing systems, such as machine translation, speech synthesis, information extraction, document retrieval and so on, treat the word as the basic processing units, so correct lexical analysis is of great significance, In machine translation and other natural language processing tasks, the identification of words has been, and is still problematic in Chinese and other Asian language such as Japanese. Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts (Chinese word segmentation) becomes an essential task for Chinese language processing. In Chinese lexical analysis, besides Chinese word segmentation, we also need to identify the part-of-speech (POS) tags for the words and detect the unknown words.First, we proposed a pragmatic Chinese lexical analyzer integrating the word-level and character-level information based on conditional random fields (CRFs) model. The word-lattice, which represents all candidate outputs, is built by utilizing the system lexicon. The linear-chain CRF is applied in the selection of final token sequence from the word-lattice by using rich and flexible predefined features. This pragmatic method based on hybrid CRF models offers a solution to the long-standing problems in corpus-based or statistical, word-based or character-based Chinese lexical analysis.In order to make comparisons, we continue to extend the character-based Chinese lexical analysis for comparison, several extended dictionary are added into the system and corresponding features are imported for Chinese lexical analysis. We used this model to attend the SIGHAN-6 bakeoff and gained satisfying results. For meeting the demand of effectiveness, based on the maximum matching and second-maximum matching algorithm, we build the integrative Chinese lexical analyzer, which is encoded and decoded by using the HMM model. Thus, the integrative model has higher training and testing speed.Secondly, for the unknown words in the real-word text, we proposed a hidden semi-CRF model, which combines the strength of (Latent-Dynamic CRF) LDCRF and semi-CRF. The proposed hidden semi-CRF, which incorporates the character-level features and word-level features, is invoked when no matching word can be found in a lexicon and could detect the unknown words and the corresponding POS tags synchronously. Thirdly, based on the results from the pragmatic Chinese lexical analyzer, we built an extended Super Function-based Chinese Japanese machine Translator. We extended the original Super Function in three ways, the first is that the Super Function is divided in to Super Function for sentences and Super Function for phrases; the second is the scope of the variables is extended, and the third is the matching algorithm for Super Functions is proposed. With the extended Super Function, fewer Super Functions are stored in database and the precision of the Chinese Japanese machine translation is also guaranteed.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络