节点文献

中文信息处理关键问题的研究

Research on Key Topics in Chinese Information Processing

【作者】 朱冲

【导师】 张向利;

【作者基本信息】 桂林电子科技大学 , 信号与信息处理, 2009, 硕士

【摘要】 语言文字信息的计算机自动处理水平和处理量已成为衡量一个国家是否步入信息社会的重要标准之一。汉语自身的复杂性导致我国中文信息处理(Chinese InformationProcessing, CIP)水平远远滞后于21世纪中国经济全球化的步伐,因此,如何实现中文自然语言的有效理解,已经成为备受人们关注的极具挑战性的国际前沿课题。本文针对目前中文信息处理领域存在的问题,重点研究了中文语法层词法、基本短语分析和中文语义处理及其在信息检索中的应用技术。本文的创新主要体现在以下几个方面:1.在语法层面上,研究了汉语词法分析和基本短语分析相关技术。重点研究了最大熵模型,给出了必要的数学推导及IFS、SGC、GIS、IIS算法的伪代码描述,针对汉语的特点,提出了一个汉语基本短语分析模型,将汉语短语的边界划分和短语标识分开,假定这两个过程相互独立,采用最大熵方法分别建立模型解决。最大熵模型的关键是如何选取有效的特征,文中给出了两个步骤相关的特征空间以及特征选择过程和算法。实验表明,模型的短语定界精确率达到95.27%,标注精确率达到96.20%。2.在应用层面上,研究了将中文信息处理引入信息检索领域需要解决的关键问题。设计了一个基于潜在语义分析(Latent Semantic Analysis, LSA)的常用问答(Frequently-Asked Question, FAQ)系统,并给出了系统中各个子模块的详细实现过程,其中,在自然语言接口模块中提出了一种新的语义匹配方法,在数据采集子系统中提出了一种新的聚焦网络爬虫主题相关度判断算法。农业领域实验表明,该FAQ系统性能上优于FAQ-Finder系统。

【Abstract】 The computer auto-processing quality and amount on language character infor-mation is one of the important standard to judge whether a country has stepped intoinformation age or not. As we known, Chinese Information Processing(CIP) level ofchina can’t meet the needs of it’s global economy developing in the 21st century, sohow to realize the e?ective understanding of chinese is a real challenge, and also a hotresearch field.Based on these challenges mentioned above, this dissertation studies on the chineseword, basic phrase analysis, chinese sematic processing and the applied technology ofinformation retrieval system base on CIP. The major creative work of this dissertationis as follows:1. Syntax layer: mainly introduce the basic theory, mathematics deduction andalgorithms with pseudo code of maximum entropy method, theose algorithms includeIFS, SGC, GIS, IIS, and then, a basic chinese phrase parsing model is presented ,which separate the prediction of the phrase boundary location and tagging, a maximumentropy method was adopt to solve the model, respectively. The focus of ME modelis how to select useful features, and the procedure and algorithms of feature selectionwith feature space was given. Experimental results demonstrate a high rate of thesuccess for predicting the phrase boundary(95.27%), the 96.20%correct the predictionof phrase tagging.2. Application research: a genenal framework of Frequently-Asked Question(FAQ)system base on Latent Semantic Analysis(LSA) is designed, which uses an new ap-proach to semantic inference for FAQ mining, and in the data-gathering system, itgives a new approach to design agriculture ontology based web focused crawler. Ex-perimental results indicate that the FAQ system outperformed the FAQ-Finder systemin the agriculture field.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络