节点文献

基于标签路径特征的Web新闻内容抽取研究

Extracting Web News Using Tag Path Features

【作者】 吴共庆

【导师】 吴信东; 胡学钢;

【作者基本信息】 合肥工业大学 , 计算机应用技术, 2012, 博士

【摘要】 Web新闻内容抽取是Web智能信息处理过程中的一个非常重要的步骤,是情报获取与安全、网络舆情监测、移动终端个性化推荐服务、异构Web数据集成、信息检索、搜索引擎等研究与应用的基础。因此,面向Web新闻内容抽取领域中的相关问题开展研究,具有重要的研究和应用价值。实例分析和进一步研究发现,许多新闻网站具有类似的布局结构和风格,网页内容布局与其解析树的标签路径之间存在隐含的关联性。传统的路径表达式过于刚性,在Web信息抽取过程中难以适应HTML文档结构的细微变化,影响信息抽取的准确率;此外,Web新闻网页具有海量异构的特点,对手工构造包装器技术以及基于规则学习的包装器技术的通用性提出了挑战。为此,本文开展基于标签路径特征的Web新闻内容抽取研究,研究内容涉及两方面:面向特定网站,研究基于路径模式知识的高精度Web新闻内容抽取模型和方法;面向开放环境,研究基于标签路径特征的通用Web新闻内容抽取模型和方法。主要研究内容如下:(1)在研究网页内容布局与其解析树的路径模式之间存在隐含关联性的基础上,提出了一种新颖的Web信息抽取系统模型—基于区分路径模式的Web新闻内容抽取模型PP-WNE。在此基础上,定义了一种特殊的适用于Web新闻内容抽取的路径模式—区分路径模式,并提出一种区分路径模式挖掘方法,解决了抽取模式知识库的构建问题。以中文、英文网站上随机选取的网页为实验数据集,实验结果表明,通过采用合理设置的容噪阈值,基于路径模式挖掘的新闻网页内容抽取方法的F值可达到98%以上,同时也验证了路径模式应用于Web新闻内容信息抽取领域的可行性和有效性。(2)为解决基于路径模式的Web信息抽取模型PP-WNE中知识库规模的优化问题,提出区分路径模式覆盖问题,并证明了区分路径模式覆盖问题是一个NP-complete问题。为求解区分路径模式覆盖问题的近似最优解,定义了一种特殊的区分路径模式—极小区分路径模式,在此基础上,设计了一个求解区分路径模式覆盖问题的多项式时间(in|n|+1)近似算法MPM,其中,n为训练样本中正例的规模。在测试数据集上的实验结果表明,MPM算法可有效优化区分路径模式集,并且在节点级评估标准和文本级评估标准下均可达到98%以上的抽取精度、召回率和F值。(3)面向开放环境Web新闻内容抽取的需求,设计了一种文本标签路径比特征,描述了基于网页解析树节点遍历的文本标签路径比计算过程,提出基于文本标签路径直方图区分内容和非内容的阈值方法CEPR,有效地解决了在线Web新闻内容抽取的问题;提出了基于路径编辑距离的加权高斯平滑方法,有效地提高了CEPR算法在抽取短文本方面的能力,并解决了新闻内容中非新闻内容过滤的问题。CEPR是一种快速的、通用的、无需训练的网页内容抽取算法,可抽取多种来源、多种风格、多种语言的Web信息网页。在CleanEval测试数据集上的实验结果表明,大多数情况下,CEPR方法优于CETR等抽取方法。(4)设计并实现了一个HTML新闻网页过滤与总结系统NFaS。其中,提出并实现了一种基于URL特征、网页结构特征、内容属性特征相结合的Web新闻网页自动识别方法,有效地解决了Web新闻网页自动识别问题;采用Web新闻内容抽取技术,有效地解决了Web新闻网页过滤问题;采用一种基于词语语义联系的关键词抽取方法,通过词汇链构造词语语义联系图,抽取出高质量的关键词,完成Web新闻的总结任务。在测试数据集上的评估结果验证了NFaS系统的有效性。

【Abstract】 Web news extraction plays an important role in intelligent Web information processing. It settles a foundation for research and development in information acquisition, information security, Internet sentiment monitoring, personalized recommendation for mobile users, integration of heterogeneous Web data sources, information retrieval, and search engines. Therefore, key issues of Web news extraction have both research and application values.Many Web news sites have similar structures and layout styles. Our extensive case studies have indicated that there exists potential relevance between Web content layouts and tag path patterns on the parsing trees. The traditional path expression is too rigid to adapt to slight changes of HTML structures, which affects the accuracy of information extraction. In addition, massive and heterogeneous Web news data brings a challenge to the wrappers based on handcrafted or rule-based learning. Motivated by these observations, this dissertation explores a novel research topic on Web news extraction using tag path features. Our research consists of two components. For specific websites, we focus on highly accurate Web news extraction based on tag path patterns. For an open environment, we put forward a generic Web news extraction model using tag path features.The main contributions of this dissertation are as follows:(1) Based on potential relevance between Web content layouts and tag path patterns on parsing trees, we propose a novel Web news extraction model PP-WNE, which uses tag path patterns as the extraction knowledge. Based on this model, a special tag path pattern-the distinguishing tag path pattern-which is adapted to Web news extraction is defined, and a distinguishing tag path pattern mining method is designed to construct the extraction knowledge base. Experimental results show that the Web new extraction method using tag path patterns can achieve better performance with an F-score more than98%on real-world datasets. These datasets are randomly selected from Chinese and English Web news sites. These results validate the feasibility and effectiveness of our Web news extracting method using tag path pattern;(2) To optimize the scale of the knowledge base in PP-WNE, we propose a distinguishing tag-path-pattern covering problem, which is proved to be a NP-complete problem. To obtain a near-optimal solution of the distinguishing tag-path-pattern covering problem, a special distinguishing tag path pattern-the minimal distinguishing tag path pattern is defined. A polynomial-time (ln|n|+1)-approximation algorithm, MPM, is designed, where n is the scale of positive samples. Experimental results show that the MPM algorithm can optimize the scale of the distinguishing tag path patterns, and meanwhile, it can also achieve better performance with precision, recall and F-score all above98%on real-world datasets by both node-level and text-level evaluation criteria; (3) To meet the requirements of Web news extraction in an open environment, we design a TTPR feature (Text to Tag Path Ration feature), and describe the calculation process of the TTPR feature by traversing the parser tree of a web page. A threshold method CEPR, which can solve the on-line Web news extraction problem effectively, is designed to distinguish the content from the non-content by the histogram of TTPR. With the combination of a Gaussian smoothing method weighted by the tag path edit distances, the ability of CEPR in extracting short text is improved significantly. CEPR is a Web news extraction algorithm with the merits of a fast, general and no-training process. It can extract Web pages across multi-resources, multi-styles, and multi-languages. The experimental results on the CleanEval datasets show that CERP outperforms CETR and other start-of-art extraction methods in most cases;(4) An HTML Web News Filtering and Summarization system (NFaS) is designed and implemented. In this system, a Web page identification method is proposed by using URL features, structural features, and content features. This method can solve the automatic identification problem of Web news effectively. Furthermore, Web news extraction is used to accomplish the task of Web news filtering. Finally, lexical chains are used to represent semantic relations for summarizing the Web news by extracting keywords with high quality. The effectiveness of NFaS has also been evaluated on real-world datasets.

节点文献中: