节点文献

文本自动摘要和信息抽取方法及其应用研究

Study on Methods and Their Applications of Text Automatic Summarization and Information Extraction

【作者】 刘娜

【导师】 鲁明羽;

【作者基本信息】 大连海事大学 , 计算机应用技术, 2012, 博士

【摘要】 随着文本数据特别是网页信息的持续激增,如何快速、自动地抽取海量文本中蕴含的主要或重要信息,已成为人们关心的一个热点研究问题,由此刺激了面向文本的信息抽取技术的迅速发展。文本摘要技术能够抽取文本的篇章结构及主要信息,自动生成单篇文档或多篇文档的摘要,可以看成是信息抽取技术的一种。而通常意义上的信息抽取技术则主要是抽取文本中蕴含的用户所需的特定重要信息。本文面向循证医学(EBM)网页并结合其它类型的训练文本,重点研究文本的自动摘要和信息抽取方法,主要针对信息抽取结果不理想、主题划分不明确、段落聚类算法对初始值敏感、聚类数目需要人工设定等问题,提出一系列新颖的研究方法和模型。(1)提出一种段落特征与隐马尔可夫模型相结合的信息抽取方法。该方法与其它信息抽取方法的不同之处在于以段落而不是单词为研究对象。网页上的信息经过预处理以后,以段落为单位,保存成文本序列,每一个段落要转换成特定的字符串,这些字符串做为隐马尔可夫模型中的可观察变量。实验表明,无论是准确率还是召回率,以段落为观察序列的信息抽取结果都要优于以单词为观察序列的信息抽取结果。(2)对文档进行主题划分,为摘要的生成做准备。主题划分的过程是将文本中的段落表示成向量空间模型,利用互信息计算连续段落的关联程度,将关联程度较弱的段落作为划分的边界。考虑到算法中人工定义参数会对划分结果造成一定程度的不利影响,所以本文采用遗传算法对主题划分过程中出现的参数阈值进行优化。实验表明,互信息与遗传算法相结合的主题划分方法在准确率上取得了较好的结果。(3)对单词-文档谱聚类方法的基本步骤进行分析,找出其对初始值敏感的根本原因,提出一种基于模糊K-调和均值的单词-文档谱聚类方法。该方法包括两个方面,一是从矩阵相似的角度对谱聚类中的Laplacian矩阵进行处理,使其满足对初始值不敏感的条件。二是通过加入模糊的概念,用模糊K-调和均值算法代替K-均值算法,使聚类结果对初始值不敏感。实验表明,基于模糊K-调和均值的单词-文档谱聚类方法不仅使聚类结果对初始值不敏感,而且在一定程度上改进了数据的聚类结果。(4)利用形态学的方法确定聚类数目,并对单词-文档谱聚类方法进行改进。确定聚类数目主要分三个步骤,第一步将单词-文档谱聚类方法中产生的矩阵转换成VAT灰度图,第二步利用灰度形态学、图像二值化、距离转换等图像处理技术对VAT灰度图进行过滤,第三步对过滤后的VAT灰度图建立信号图,并进行平滑处理,通过平滑后的信号图的波峰波谷数目确定文档集的聚类数目。实验表明,该方法能够提高单词-文档谱聚类方法的聚类效果。(5)在LDA主题模型的基础上,提出了基于主题融合的多文档自动摘要算法Titled-LDA。考虑到文档的标题信息对摘要形成有很强的指示作用,因此为每篇文档分别建立标题和正文的主题模型,并对两个模型进行融合。融合过程中,根据两种形态的信息熵,进行自适应不对称学习,从而对不同形态的主题分布进行加权处理,融合后的模型适当地关联了标题和正文的信息,因此有助于摘要质量的提高。实验表明,Titled-LDA方法在DUC标准数据集上取得了较好的效果。

【Abstract】 With continuous growth of text data especially of web information, how to quickly and automatically extract main or important information that mass text contains, has become a hot research issue of concern, thus stimulating to the rapid development of text information extraction technology. Text summarization technology can extract text discourse structure and main information; automatically generate a single document or multi-document summarization, which is considered as a kind of information extraction technology. In the usual sense, information extraction technologies are to extract specific or important information that text contains.Oriented Evidence-Based Medicine web page and other types of training text, this paper mainly focuses on method of text automatic summarization and information extraction. In view of unsatisfactory information extraction results, unclear topic segmentation, paragraphs clustering algorithm sensitive to initiation, the need of manual set for the number of clusters, this paper provides a series of novel research methods and models.(1) This paper puts forward a method of information extraction that incorporates paragraph feature and hidden Markov Model. The main difference between this method and other information extraction methods is that this proposed method takes paragraph sequence as research object instead of word sequence. Paragraph is a unit of text sequence saved from web pages after preprocessed. Every paragraph is converted into special tokens, and these tokens are the observation symbols of hidden Markov Model. The experiments show that, regardless of precision or recall, information extraction results on the paragraphs as the observed sequence is better than the results on the word as the observed sequence.(2) This paper denotes paragraphs as Vector Space Model, segment text into different semantic units by calculating Mutual Independence between continuous paragraphs. After that, considering the influence of thresholds, we use Genetic algorithm to optimize parameters. The experimental results show that the method can improve precision to some degree. (3) This paper analyses the main step of spectral co-clustering documents and words, finds out its cause of sensitivity to initialization, and presents a modified method of spectral co-clustering documents and words based on fuzzy K-harmonic means. This method consists of two steps. The first step constructs matrix which is insensitive to the initialization. The second step exploits fuzzy K-harmonic means algorithm instead of K-means algorithm to obtain clustering results. Fuzzy K-harmonic means algorithm uses fuzzy weight distance while calculating the distance between each data points and cluster centers. The experiments show that the proposed method not only is insensitive to initialization, but also can improve the clustering results.(4) This paper explores a method based on morphology for determining the number of clusters present in the given dataset and modifies spectral co-clustering documents and words. This method includes three main steps. First, the input matrix generated by spectral co-clustering documents and words is created into VAT gray image. Then, sequential image processing operations are used to filter the VAT image. These processing operations consist of gray morphology, image binarization, distance transform. Finally, we establish signal from filtered VAT image, from which we can extract the number of clusters by major peaks and troughs after smoothing signal. Experiments show that this method can improve the clustering results of spectral co-clustering documents and words.(5) Based on the LDA topic model, this paper proposed Titled-LDA algorithm for multi-document summarization by fusing topic model. In view of the strong indication effect of the title in the summarization, Titled-LDA established corresponding topic model for title and content of each document. In the fusing stage, the algorithm can do weight processing subject to different topics distribution in an adaptive asymmetric learning way based on two kinds of information entropies. In this way, the final model incorporated title information and content information appropriately, which helped the performance of summarization process. The experiments showed that the proposed algorithm achieved better performance compared the other state-of-the-art algorithms on DUC datasets.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络