节点文献

基于RSS源文本的自动文摘系统研究

Research on Automatic Summarization System Based on RSS

【作者】 刘启元

【导师】 叶鹰;

【作者基本信息】 浙江大学 , 信息资源管理, 2012, 硕士

【摘要】 随着网络信息资源总量指数级的增长,如何在海量的数据中检索信息并获取主旨,是一个值得研究的问题。搜索引擎和RSS推送技术解决了信息的“源”问题,却没有很好的解决信息的“量”问题。自动文摘技术正是对信息进行压缩和精炼的有效应用之一。自动文摘利用计算机技术,自动从原始文档中抽取或总结出能够反映文本中心内容的简短连贯短文,以帮助用户快速、准确和全面的获取信息主旨。本文认为不同主题类型的新闻文摘具有不同形式的文本特征组合模型,因此应将文本自动分类结果作为自动文摘的前提。通过网页抓取、网页清洗和数据存储构建分类语料库,并在此基础之上利用不同特征选择算法和分类算法实现了自动归类。提出文摘句的可能性(Probability)和可行性(Possibility)两种度量方式,基于文摘语料库的构建,采用基于回归分析的有监督机器学习算法(线性回归和Logistic回归)进行训练学习,以确定文摘句特征组合模型的最优参数。针对中文文本,提出改进型ROUGE-CN系列评价算法,用于对文摘句可能性的度量和对机器文摘的测评。基于机器学习的自动文摘方法产生的文摘与基准文摘和Word文摘的对比实验结果表明,以自动分类为前提,利用基于回归分析的有监督机器学习算法,能够有效的提高机器文摘质量。以在线RSS数据源与基于回归机器学习的自动文摘方法的结合作为创新点,最终设计和实现了基于RSS源文本的自动文摘系统。系统以在线RSS源文本为数据来源,利用正则表达式匹配的方式抽取原文元数据内容,提供不同特征选择算法、自动分类算法、机器学习算法和压缩率选项,结合自动分类和自动文摘技术得出分类标签并生成机器文摘,实现了新闻文摘与原文的在线双重呈现。

【Abstract】 With the increasing amount of information, it’s valuable to figure out how to retrieve information and obtain its summary. Search Engine and the "PUSH" technology of RSS offering the "Source" of information has not addressed the issue of the "quantity" of information. Automatic Summarization technology is one of the best ways to deal with the information overload.This article assumes that documents with different topics should have different features combination models, thus automatic classification is the prerequisite of the automatic summarization procedure. After the construction of a self-build classification corpus, four features selection algorithms have been used with the classification algorithm Simple Vector Distance to finish automatic classification. Two measures for the evaluation of summary sentences have been proposed in this article: Probability and Possibility. Based on the summary corpus, machine learning algorithms including Linear Regression and Logistic Regression have been applied to construct the optimum features combination model of the summary sentences. This article proposes ROUGE-CN algorithm to deal with Chinese text.The experimental comparison results show that, the combination of automatic classification methods and machine learning algorithms based on regression statistics improves the quality of machine-generated Chinese news summaries.Innovation of this paper is the combination of online RSS feeds and automatic summarization technology based on machine learning. An automatic Summarization System Based on RSS Feeds has been implemented in the end. The system obtains news text from online RSS feeds, extracts metadata using regex matching, provides users with various options, and then generates the class label and summary.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2012年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络