节点文献

面向主题的文档摘要技术研究

【作者】 刘治华

【导师】 王景中; 张华平;

【作者基本信息】 北方工业大学 , 计算机软件与理论, 2011, 硕士

【摘要】 随着经济、社会的飞速发展,尤其是互联网的迅猛发展,网络信息量呈爆炸式增长。如何从海量的信息中快速的获取有效信息已成为目前亟待解决的问题。目前蓬勃发展的搜索引擎技术主要用于通用信息的获取,而对于特定地、内容不适合公开的领域则还没有成熟的系统。本文针对标准化过程中的重要阶段——标准信息的有效挖掘与获取,实现了一个海量信息垂直搜索引擎,并对搜索引擎中的摘要技术(即面向主题的摘要)进行重点研究,着重于对摘要提取效率、与查询主题相关、反映文档主要内容三者的平衡,以满足用户的信息需求。本文主要在提高摘要的处理效率、基于关键词提取摘要、摘要中句子间的冗余去除方面进行了深入细致的工作。在提高效率方面,引入搜索引擎的倒排链表结构统计词、句子的特征,并使用双数组Trie树存储分词词典和用户主题词表,以期提高词的查找效率。将关键词提取与摘要提取相结合,在关键词提取中引入了词邻接类别、词的位置局部性分别提高高频词、低频词的质量。在去除句子冗余度方面,提出句子之间的包含度概念,能够对文章中存在包含关系的一类句子进行了有效的排重,降低了文摘的冗余度,提高了文摘质量。另外,本文实现了一个垂直搜索原型系统,并将其作为面向主题摘要的一个应用场景。成功将编码压缩、内存交换、缓冲池等技术应用于实际系统中,并将该系统应用于标准检索与组织机构搜索,目前已在河北省标准化研究院、中国邮政集团名址中心上线。将自动文摘、垂直搜索、数据库连接起来,形成一个完整的针对标准信息进行管理的统一解决方案。将文摘、搜索技术呈现给最终用户。

【Abstract】 With the fast development of our economy, society and Internet, network information is explosive growth. How to quickly obtain useful information from information has become a problem to be solved. At present the search engine technology mainly used for general information processing, but in specific fields, there is not mature system.This paper studied summarization of the search engine technology. Focused on efficiency, related to query and reflecting the main content of document.Automatic summarization is the key of this paper. This paper has focused on work of three areas:improving the summarization efficiency, introducing keywords extraction to summarization, the removal of redundancy between summary sentences. To improve summarization efficiency, we introducing the inverted list structure to calculation the features of words and sentences, and use Double-Array Trie to storage segmentation dictionary and user thesauri, so as to improve the efficiency of word search. Combined keywords extraction and summary extraction, introducing accessor variety and the position locality of the words to respectively improve the extraction of high frequency and low-frequency word. To lessen the redundancy between sentences, this paper propose the concept of inclusion between sentences. Through sentences inclusion, reduced the probability of extracting sentences that one includes another to summary together, so as to improve the abstract quality.In addition, the paper implemented a vertical search prototype system, and applied query-focused summarization in vertical search. Successfully applied coding compression, memory exchange and memory cache technology in word segmentation system, and the application of this system in the standard retrieval and organization search, has been online in Institute of Standardizatioin of Hebei province and Name Address Center of China Post. During the period of testing. It integrated automatic summarization, vertical search and database connection together, and provided an uniform solution of standard management. The solution take summarization, search technology to end users.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络