

Research on Relevant Techniques of Temporal Multi-document Summarization

【作者】 贺瑞芳

【导师】 李生; 刘挺;

【作者基本信息】 哈尔滨工业大学 , 计算机应用技术, 2009, 博士

【摘要】 互联网的发展产生了爆炸式增长的文本、图像、音频和视频等多媒体信息。面对信息极大丰富,知识相对匮乏的时代,人们陷入一种咨讯焦虑的困境之中。而且随着时间的不断演化,相关的媒体信息也在逐渐地更新和进化。如何有效地获取、组织信息逐渐成为信息处理领域的一大挑战。本文以信息压缩为目标,着重研究文本压缩技术。时序多文档文摘为自动文摘领域的新方向,是传统静态多文档文摘的自然扩展,其处理的对象跨越了同一时段的相关文档集,即处理跨时段的相关文档集。其主要目标是按照一定的压缩比从时序角度自动总结出系列新闻报道的内容进化,以帮助人们快速获取信息。伴随着国际评测DUC2007、TAC2008的举办,相关的研究越来越受到政府、企业界和学术界的重视。时序多文档文摘有着广阔的应用前景,可用于新闻搜索引擎、商业竞争情报分析、趋势预测等领域,通过不断满足人们的需求,创造更大的社会价值。本文的研究对象系列新闻报道本身具有比较突出的时序特性,可以认为同一时段的静态多文档文摘是时序多文档文摘的一种特殊情况。因此,时序多文档文摘的研究重点是如何在时序上下文的背景下解决传统静态多文档文摘的内容选择和语言质量控制两大难题。前人的工作对时序信息考虑的比较少,本文着眼于识别时序特性并应用其来深度挖掘时序多文档文摘的抽取式内容选择方法,力图保持文摘内容的重要性、新颖性和覆盖性,重点研究了以下问题:1、识别时间表达式并进行归一化。理解文本的语义是自然语言处理的终极目标,而时序语义对于理解文本是不可或缺的。时间表达式识别和归一化是时序语义标注的基础。时间表达式识别与归一化的研究为时序多文档文摘的内容选择和语言质量控制奠定了基础,也可以为其它时序信息抽取应用提供支撑。2、基于宏微观重要性判别模型的内容选择。本着逐步求精的原则,首先在假设系列新闻报道各时间片相互独立的基础上,通过分析其不断演化的宏微观时序进化特性,探索基于宏微观重要性判别模型的时序多文档文摘内容选择方法。3、基于进化流形排序的话题相关内容选择。更进一步,系列新闻报道在时间轴上是连续进化的,在假设当前时间片的内容进化依赖于以前时间片话题内容的基础上,研究话题描述的动态增强对表达用户兴趣不断更新所带来的信息需求的变化,对内容选择的影响。提出迭代反馈机制引导的进化流形排序算法,以模拟系列新闻报道中话题演化的动态性,为时序多文档文摘的内容选择提供了时序自适应的重要性排序。4、谱聚类增强的话题相关内容选择优化。在进化流形排序的基础上,研究了通过归一化谱聚类改进内容选择的覆盖性,设计了时序去冗余策略来保证文摘内容更好的新颖性。结合子话题排序和新颖的去冗余策略探索了时序多文档文摘优化的内容选择方法。在国际评测TAC2008中的UpdateSummarization任务上,获得了名列前茅的内容选择评测性能,证明了该方法的优越性。本文对时序多文档文摘及其内容选择技术进行了初步探索,提出的方法具有语言无关性,取得了一定成果,为今后的深入研究奠定了基础。

【Abstract】 The development of Internet produces the explosive growth of multimedia informa-tion, such as text, picture, audio, video and so on. In the era of greatly rich informationand relative lack of knowledge, people fall into a kind of information anxiety. As timegoes, the relevant multimedia information also gradually updates and evolves. How toeffectively acquire and organize information becomes a challenge in information extrac-tion. This paper emphasizes on studying text compression technology for the goal ofinformation compression.Temporal multi-document summarization (TMDS) is a new direction in automaticsummarization. It is the natural extension of multi-document summarization, which cap-tures evolving information of a single topic over time. The greatest difference from tra-ditional static multi-document summarization is that it deals with the dynamic collectionbeyond the same period, say, the relevant document collection across periods. It mainlyaims to automatically summarize series of news reports so as to help people to efficientlyacquire the evolutionary content. With the conduct of international evaluation DUC 2007and TAC 2008, the relevant researches become more and more emphasized by industry,academia, and government. TMDS has a wide application future, which can be used tonews search engine, commercial intelligence analysis, trend prediction. It will bring greatsocial value by satisfying people’s needs.The research object in the thesis, series of news report, has strong temporal char-acteristics. It can be considered that static multi-document summarization in the sameperiod is a special situation of TMDS. Therefore, the research keystone of TMDS is howto resolve the two difficult problems of static multi-document summarization in temporalcontext. Previous researches rarely consider temporal information. Our thesis focuseson how to recognize temporal characteristics and use it to deeply mine extractive contentselection of TMDS. We also try to keep the summary content to be important, novel andfull-coverage. The mainly research problems are as follows:1. Time Expression Recognition and Normalization. Understanding semantic oftext is the ultimate goal of natural language processing, and temporal semantic is neces-sary for understanding text. Time expression recognition and normalization are the basis of temporal semantic labeling, which build a foundation for content selection and lan-guage quality controal of TMDS, and also support other temporal information extractionapplications.2. Macro-micro importance discriminative model based content selection. Basedon the principle of stepwise refinement, we assume that the time slices in series of newsreport are independent. Content selection method of TMDS with macro-micro importancediscriminative model is explored through analyzing the evolutionary macro and microtemporal characteristics.3. Evolutionary manifold ranking based topic oriented content selection. Series ofnews report continuously evolve along timeline. Further step, it is assumed that contentevolution in the current time slice is dependent on topic content in the previous time slice.We study how to enhance the expression capability of the static query and embody thedynamic evolution of query, and how these changes in?uence content selection. We pro-pose the evolutionary manifold ranking based on iterative feedback mechanism in order tomodel the dynamic characteristics of topic evolution in series of news report. It providesthe temporally adaptive ranking algorithm for content selection of TMDS.4. Topic oriented content selection optimization strengthened by spectral cluster-ing. Based on evolutionary manifold ranking, we adopt normalized spectral clustering toimprove content coverage and design temporal redundancy removal strategy to keep thesummary content to be more novel. We explore the optimization content selection methodby combining sub-topics ordering with novel redundancy removal strategy. In the updatesummarization task of TAC 2008, we receive the competitive evaluation performance,proving the superiority of our approach.This thesis explores TMDS and its content selection,which makes some progress.The proposed methods have language independence. It builds a deep foundation for futurework.
