节点文献

基于云模型的中文面向查询多文档自动文摘研究

Chinese Query-Focused Multi-document Summarization Based on Cloud Model

【作者】 陈劲光

【导师】 何婷婷;

【作者基本信息】 华中师范大学 , 教育技术学, 2011, 博士

【摘要】 随着互联网的普及,互联网上包含着海量的并且时刻在增加的信息。针对用户输入的一个简单查询,搜索引擎一般会返回用户可能需要的一系列经过排序的网页,其中有大量不相关的、重复的数据,需要用户耗费很多精力来自己查找有用的结果。面向查询的多文档自动文摘技术将大量的查询相关文档中的内容提炼、重组为一定长度的简短摘要,加速用户的信息获取,通常要求摘要的内容简洁、组织良好、冗余低、满足个性化需求。面向查询的多文档自动文摘技术能够减小从海量数据中获取信息的难度,提高信息获取及理解的速度,进而提高用户获取以及利用信息的效率,提高使用者在信息社会中的竞争实力。云模型是李德毅院士提出的一种处理不确定性概念中模糊性、随机性及其关联性的定性定量转换模型。云模型从研究自然语言概念的不确定性入手,展开对不确定性人工智能的研究。虽然云模型发端于自然语言中的概念,但遗憾的是,就目前搜集到的论文情况看来,将云模型直接应用在自然语言处理领域本身的工作还比较少见。本论文针对中文语料中的面向查询多文档自动文摘展开了研究。首先构建可以用于公开评测的评测语料、人工摘要;在此基础上利用云模型进行文摘内容选取、句子修剪、句子排序,力图生成满足用户需求的聚焦度高、内容精练、可读性好的连贯摘要;最后采用修改后的ROUGE工具进行中文文摘自动评测。本文主要研究工作和研究成果概括如下:一、提出了一种基于云模型的文摘单元选取方法,利用云模型,全面考虑文摘单元的随机性和模糊性,提高面向查询的多文档自动文摘系统的性能。首先计算文摘单元和查询条件的相关性,将文摘单元和各个查询词的相关度看成云滴,通过对云的不确定性的计算,找出与查询条件真正意义相关的文摘单元。随后利用文档集合重要度对查询相关的结果进行修正,将文摘句和其他各文摘句的相似度看成云滴,利用云的数字特征计算句子重要度,找出能够概括尽可能多的文档集合内容的句子,避免片面地只从某一个方面回答查询问题。为了证明文摘单元选取方法的有效性,在英文大规模公开语料上进行了实验,并参加了国际自动文摘公开评测,取得了较好的成绩。二、构建了中文自动文摘评测语料库及中文自动评测工具,并以此为基础,构建了一种基于云模型的中文面向查询多文档自动文摘系统。中文自动文摘评测语料库由1000篇文档、100个文档集合和查询条件、400篇人工摘要构成。通过修改英文文摘评测工具ROUGE的源程序,实现了中文自动文摘的ROUGE自动评测。首先将50个文档集合作为训练语料,采用哈工大最新共享的语言技术平台进行句子切分、分词;随后利用中文自动评测工具,在测试语料中进行参数训练;最后采用基于云模型的文摘单元选取方法生成中文摘要,就此搭建了中文云摘要系统。三、提出了一种基于多维云和依存分析的中文句子修剪方法,进一步提高文摘质量。首先制定基于依存分析的句子修剪规则,对每个候选文摘句进行句子修剪,从而产生多候选句;随后利用多维云,综合考虑词语在句子、文档集合中的分布以及和查询条件的相关性,对各修剪句进行打分,在云的叠加过程中实现了不确定性的有效传递;最后选取那些包含信息量最大、长度最短的修剪句替换候选文摘句,构成自动摘要,从而使文摘包含更多的有效信息。四、提出了一种基于云模板的文摘句排序方法,使生成的中文云摘要更加连贯。云模板的方法将文档集合中的每一篇文档都看成模板,利用云模型将各篇文档的排序结果综合到一起,既避免了单一模板方法对于单个文档的依赖,也避免了多数次序方法只能两两排序的缺点。首先利用基于复杂网络的自适应增量聚类方法对文档集合进行聚类,找出那些包含有一个或多个文摘句的子主题;随后将文档集合中的每一篇文档都看作模板,利用这些模板构成的云确定子主题和文摘句在模板中的相对位置;最后依次对子主题以及对子主题内部的句子进行排序,从而生成连贯性更好、可读性更强的自动摘要。

【Abstract】 Wide spread use of internet lead to accumulation of vast amount of information data. With ever increasing popularity of internet, this amount is ever increasing by the moment. For a simple query, a search engine always returns a series web page a user maybe interested in. Since a large proportion of the search results are repetitive or irrelevant information, the user has to spend a lot of time to look for the information they need. To solve this problem, query-focused multi-document summarization was proposed. When given a set of topic-related documents, a query topic consisting of several complex questions, and a user preference profile, one can generate a brief, well-organized fluent summary for the purpose of answering an information need. Query-focused multi-document summarization aims to improve efficiency of obtaining and using information and to increase utilization of network information, therefore to-provide advantages for the user in today’s information world.Cloud model, firstly proposed by Academician Li Deyi, is an effective model in transforming qualitative concepts to their quantitative expressions and visa versa. It represents fuzziness, randomness and their relationships of concept of uncertainty. It starts with quantitative representation of qualitative concepts in natural languages in doing research of artificial intelligence with uncertainty. Unfortunately, to the best of our knowledge cloud model is rarely applied in Nature Language Processing (NLP).This paper is concerned with Chinese query-focused multi-document summarization based on Cloud model. First, a large-scale open-benchmark corpus as well as reference summaries written by human is constructed. Then, in order to generate concise and fluent summaries which satisfy the user’s needs, cloud model is used in key processes of summarization, such as content unit selecting, sentence compression, as well as sentence ordering. Lastly, summaries are evaluated by ROUGE-CN, which is an improved version of ROUGE and can be used to evaluate summaries in Chinese in an automated fashion.The essence of this thesis can be summarized as the following:First, this paper proposes a summarization unit selecting method based on cloud model. Cloud model is used to consider randomness as well as fuzziness on distribution of summarization unit. In the process of obtaining relevance between summarization unit and query, the scores of relevance between the word and each query word are seen as cloud drops. By obtaining uncertainty of cloud, summarization unit which is more relevant to the query is given higher score. After that, importance in the document set is also obtained to evaluate the sentence’s ability to summarize content of the document set. Similarities between a sentence and all sentences in document set are considered as cloud drops. Together these cloud drops become a cloud. We use the cloud to evaluate the sentence’s ability to summarize content of the document set, trying to find sentences which can summarize the most content of the document set and avoid under representing the document set. In order to demonstrate the effectiveness of the proposed method, large-scale open benchmark corpuses in English are used in the experiment. We also participated TAC (Text Analysis Conference) 2010 and got satisfactory results.Secondly, this paper introduces the process of constructing a large-scale Chinese query-focused multi-document summarization corpus, as well as the process of setting up the Chinese query-focused multi-document summarization system. The Chinese query-focused multi-document summarization corpus includes 1000 documents,100 document sets and queries, as well as 400 summarization references. By modifying the source code of ROUGE, which is an automated evaluation tools in English, this paper realizes automated evaluation of Chinese summaries. When constructing the Chinese summarization system, we use 50 document sets as training data to train parameters of the module for selecting summarization units.Thirdly, this paper proposes a Chinese sentence compression method based on multi-dimension cloud and dependency relationships to further improve the quality of summaries. A set of heuristic rules based on analysis of dependency relationships are proposed and used to trim sentence and produce compressed sentences that can be used as multiple candidate sentences. The candidate sentences are then scored by multi-dimension cloud model which considers influence of distribution of words among sentences and documents, as well as relevance between the words and the query. Comparing with the single dimension cloud model, the multi-dimension cloud model can retain uncertainties while the clouds are superposing. Candidate sentence which contains the largest amount of information and is shortest in length will replace the original sentence to construct the summary and allow more room for the summary to include more effective information.Lastly, this paper proposes a sentence ordering method that is based on cloud model to make the summary more readily comprehensible. This method takes every source document in any given document set as a template of sentence ordering and combines results of different templates into one single ordering result. The advantage of this method is that it doesn’t depend on one single document like the single-template-sentence-ordering method and also avoids the complication of pairwise comparison of the majority-sentence-ordering method. All sentences in document set are clustered into several sub-topics by using adaptive incremental clustering method based on complex networks. Then every document in the document set is seen as a template. All these templates together decide relative position of sub-topics as well as sentences. Sub-topics and sentences in the same topic are sorted in sequence to generate more fluent and more readily comprehensible automated summarization.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络