节点文献

基于N元分析与词频统计的文本复合标引研究

【作者】 杲晓锋

【导师】 李培;

【作者基本信息】 南开大学 , 情报学, 2009, 硕士

【摘要】 科学技术的发展已将人类带入智能化的信息社会,使得信息成为重要的资源,但随之也带来了信息资源的爆炸性增长和无限扩张。面对庞大的信息资源,信息处理成为人们有效利用信息必须借助的关键手段。在信息处理中,一项重要的工作就是根据原文信息内容产生简明准确的信息标引,因为信息标引的质量在一定程度上决定了信息处理的效果,也就必然影响信息对于人们的利用价值。在此背景下,研究出低成本、高效率的信息标引方法显得至关重要。因此,本文围绕自动标引技术和方法,以文本信息的自动标引作为研究对象,利用比较分析与实验分析相结合的研究方法,针对N-gram标引和词频统计标引展开研究与探讨。在此基础上,提出了基于N元分析与词频统计的文本复合标引这一新型标引方法。本文主体内容如下:首先,本文从文本和自动标引相关介绍切入,对自动标引的研究发展进行了系统的回顾与总结,重点从自动标引基本理论的宏观层面划分、自动标引发展过程中兼具创新性与影响力的代表性方法纵览以及自动标引研究路线图三方面进行简要论述,继而指出自动标引发展中存在的问题和可能的解决途径,引出本文的复合标引这一研究主题。其次,本文从原理、方法和实现过程三个角度对词频统计标引和N-gram标引两种方法进行较为全面系统的分析与比较,阐述了两种方法在本质上的一致性和方法过程上的互补性,通过引入统计学领域中的条件概率和信息论领域中的信息熵这两个工具,将N-gram标引和词频统计标引有效的复合为一体,提出了兼具二者优势的基于N元分析与词频统计的文本复合标引方法,并对其进行了详细的介绍,给出了具体的实现过程。最后,本文采用实验分析法,通过对比试验,进一步的从实践的角度论证本文提出的文本复合标引方法在理论方面的正确性和在应用实践方面的可行性与有效性,相关实验结果也对本文的方法提供了有力的论证。因此,本文的研究工作具有一定的创新性,同时对他人在自动标引方法的复合研究方面也具有一定的借鉴和指导意义。

【Abstract】 Due to the development of science and technology, the information has become an important resource in our modern information society, which also makes the information resource keeping a speed of explosive growth and unlimited expansion. To cope with this problem, information processing is the key factor to achieve the satisfying condition for information utilization. It is an important task to generate concise and accurate information indexing for information processing. To some extent, the quality of automatic indexing could determine the effect of information processing and the value of information utilization. Under this background, it’s very important to improve and promote methods of automatic indexing for information indexing with low cost and high efficiency.Therefore, centering on technologies and methods of the automatic indexing as well as taking text information indexing as an object of study, this paper discussed the new combined method of automatic indexing for text information based on N-gram analysis and word frequency statistic by combining comparative analysis method with experimental analysis method. The main content as follows:First, taking text and automatic indexing as main breakthrough point, this paper provided a review and summary of automatic indexing from micro-segmentation of basic theory, representative methods and map of research route, then, it pointed out the problems in development of automatic indexing and the possible solution as well as the research topic of combined method of automatic indexing.Second, based on the comprehensive and systematic comparison and analysis between the method of N-gram automatic indexing and word frequency statistic automatic indexing from aspects of theory, approach and realization process, this paper pointed out that they shared an essential agreement and complemented each other’s advantages of approach. Furthermore, the author presented a new combined method of automatic indexing for text information based on N-gram analysis and word frequency statistic, which combined N-gram analysis with word frequency statistics by introducing two tools of conditional probability in Statistics and entropy in information theory.In the end, to verify the validity in the theory and the feasibility and effectiveness in the application of this new method, a detailed realizing plan and process for the automatic indexing was produced by computer program. Furthermore, through the comparative experiment from the view of practice, the result showed that it had certain superiority in the performance of automatic indexing.So this paper’s research work possesses certain innovation. And this method could provide certain reference and guiding significance for studying combined method of automatic indexing.

  • 【网络出版投稿人】 南开大学
  • 【网络出版年期】2010年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络