节点文献

面向查询的多文档自动文摘关键技术研究

Research on Key Techniques of Query-focused Multi-document Summarization

【作者】 赵林

【导师】 吴立德;

【作者基本信息】 复旦大学 , 计算机应用技术, 2008, 博士

【摘要】 随着互联网的迅速发展和文本信息的日益增多,从大量信息中快速查找和获取有用信息的迫切需求使得自动文摘技术日益重要。自动文摘是指由计算机自动从一篇或多篇文本中概括出主要内容,从而把大量原来需要用户来完成的工作都交给计算机自动完成,节省了用户浏览信息的时间,减轻了用户负担。这个任务涉及到文本理解、文本生成等自然语言处理领域的多个方面,对于计算机具有很大的挑战性。本文正是在这种前提下,对自动文摘技术进行了探索性研究。本文在面向查询的多文档自动文摘方面以及文摘连贯性的自动评价方面做了深入的研究工作。我们在这两年参加文摘方面的国际评测会议DUC的基础上,研究并实现了多种面向查询的多文档自动文摘技术。我们采用了最大熵模型来实现基于机器学习的自动文摘系统。为了进一步找出文档句之间以及句子与查询之间的语义关联,我们提出了一种在文摘系统中进行语义扩展的方法,该方法通过WordNet中定义的同义词集以及词与词之间的语义关系,对传统的基于词的句子向量进行语义扩展,从而将语义信息融入到句子中,使得系统性能比起语义扩展前得到了显著提高。本文还提出了一种基于图排序算法的查询扩展方法,将其结合到面向查询的自动文摘系统中,可以很好的解决原始查询中通常包含信息量不足的问题。该扩展方法在句一句关系以及句一词关系的基础上利用上下文信息对查询进行扩展,能够以较少的噪声获取到更多相关信息。加入了查询扩展后的文摘系统在性能上比扩展前有明显的提高,在DUC标准评测语料上达到了目前的最好结果,充分表明了该查询扩展方法的有效性。自动文摘研究的另一个主要方面是文摘的评测。当前对文摘的自动评测主要在于考察文摘的内容覆盖率,对文摘语言质量如可读性、连贯性等方面的评测则由人工完成,由于需要消耗大量人力而且缺乏客观性,使得人工评测方法不能普及,所以如何能对文摘的语言质量进行自动评测是一个重要研究问题。本文提出了一种对文摘连贯性的自动评价模型。在文摘连贯性的自动评价上,我们对基于实体的连贯性基本模型从特征和实体选取等方面做了深入研究,通过考虑网格中的邻居以及非相邻句等信息对原有实体转移特征进行了细化;分析了实体选择在模型中的重要性,并且通过潜在语义分析重新建立了实体网格,从而对原有模型进行了改进,在实验中获得了更高的准确率。

【Abstract】 With the quick development of Internet and increasing amount of text information, the requirement of searching from large amount of texts to get useful information has made automatic summarization more and more important. Automatic summarization means summarizing from single or multiple documents to get generalized content automatically. It can save much time for the users when browsing. This task is related to multiple aspects in the area of natural language processing, which is a big challenge for the computer. We described our research work on the technique of automatic summarization in this thesis.We have done much work on query-focused multi-document summarization and automatic evaluation of summary coherence. We have realized several summarization systems on the basis of participation in the DUC evaluation in recent years.We use CME model for machine learning based automatic summarizer. Furthermore, in order to find the semantic relatedness between sentences and the queries, we proposed a method of semantic extension which is applied to the summarization system. In this method, sentence vectors can be semantically extended based on the Synset and different word relations defined in WordNet. In this way, semantic information can be combined into the sentences and the performance of the summarization system gets obvious improvement.We also proposed a method of query expansion based on graph-based ranking algorithm, which is combined into the query-focused summarization system to solve the problem of information paucity in the original query. This method makes use of context information to expand the query, which can obtain more relevant information with less noise. The summarization system with query expansion has obtained significant performance improvement compared to without expansion, and we have achieved the state-of-the-art performance on the evaluation data from DUC.Another important problem is the summary evaluation. Currently the evaluation on linguistic quality relies on manual evaluation, which is time-consuming, so it is important to develop automatic method. We have studied the entity-based coherence model and improved it from both feature calculation and entity selection. In both ways we have improved the base model and got higher accuracy in the experiments.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2009年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络