节点文献

网络信息自动化高效抽取技术研究

Research on Automatic and Efficient Technologies for Web Information Extraction

【作者】 宋鑫莹

【导师】 洪小文;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2013, 博士

【摘要】 随着互联网爆炸式的发展和普及,网络信息已经成为了一种宝贵的信息数据资源。海量的网络数据使得数据分析与挖掘系统进入了一个新时代,越来越多的网络应用系统需要对来自不同数据源的结构化数据进行抽取、挖掘和整合。然而,由于网页文档的半结构化性质,网页上呈现的数据往往不能被机器自动地抽取和理解,因此,网络信息抽取的研究目标在于提取网页的结构化数据。互联网数据的海量规模与高度异构的特征,为网络信息抽取工作带来了巨大的挑战。本文围绕网络信息的海量规模与高度异构的特征,分数据记录抽取和数据单元抽取两个层次,对自动化、高效抽取网络信息的技术展开了相关研究,研究内容包括以下四个方面:1.针对网络信息高度异构的特点,提出新的自动化的基于锚点树的数据记录的抽取方法(Mining data records Based on Anchor Trees,MiBAT)。首先分析了当数据记录含有一定的不规则内容时(例如用户原创内容)时,现有的基于相似度检测的自动化方法并不能取得理想的抽取效果。本文提出锚点的概念,对应数据记录中的某些关键的数据单元。例如,每个用户创建、发表的帖子记录(例如在线论坛帖子、用户评论等)都含有发帖时间这个关键的数据单元,可以作为由领域约束获得的锚点。本文提出MiBAT方法,利用领域约束检测出锚点,然后围绕包含锚点的DOM(Document Object Model)子树,完成数据记录的自动化抽取工作。实验表明,与以往的自动化的数据记录抽取方法相比,MiBAT方法可以较好的克服数据记录的不规则性,具有较高的抽取准确度。2.针对数据记录层次的网络信息的海量规模的特点,提出快速高效的锚点树的寻找算法。传统的网络信息挖掘算法采用自上而下的枚举DOM子树的方式,按照这种方式设计锚点树寻找算法,MiBAT的时间复杂度为O(n2),其中n是输入网页的DOM树的结点的数量。本文提出一个新的基于标签路径自底向上聚集的锚点树寻找算法,使得MiBAT的时间复杂度降到O(nlogn)。实验表明,新的锚点树寻找算法极大地提高了MiBAT方法的运行效率,同时保持较高的抽取准确度。3.针对网络信息的跨领域异构的特点,提出不依赖领域约束的通用锚点的检测方法。锚点的概念最初由领域约束而来,对应于领域相关的数据单元。在实际应用时,对不同的领域,需要预先指定相应的领域约束,这在某种程度上限制了MiBAT方法的自动化应用。本文对此进行扩展,提出通用锚点的概念及其检测和应用方法。实验表明,应用通用锚点时,MiBAT方法可以应用于不同的领域的信息抽取任务,具有较高的准确度,不需要人为指定领域约束。4.针对数据单元层次的网络信息的海量规模的特点,研究快速高效的DOM树匹配算法,应用在数据单元抽取对齐任务中。现有的广泛应用的树匹配方法的复杂度是O(n2),并不适合海量规模的网络信息抽取任务。本文提出一个新的基于标签路径序列的最长公共子列(Longest Common Subsequence,LCS)的方法。利用LCS问题的稀疏性质,算法复杂度可以达到O(rlogn),其中r等于两棵树上具有相同标签路径的结点对的数量;当两棵树的候选匹配较为稀疏时,r≈O(n),算法的复杂度接近O(nlogn)。实验表明,与现有的广泛应用的DOM树匹配方法相比,本文提出的方法具有更高的运行效率,同时保持较为一致的树匹配准确度和数据单元对齐准确度。综上所述,本文在数据记录抽取和数据单元抽取两个层次上,提出了自动化的、高效的网络信息抽取方法,能够较好的适应网络信息高度异构和海量规模的特点,具有较大的理论价值和实际应用价值。

【Abstract】 The World Wide Web has become an important resource of information due to itsexplosive growth and spread in the past two decades. The tremendous amount of web datahas opened a new era for data analysis and mining systems. More and more web applica-tions need to extract, mine, and integrate data from enormous data sources. However, dueto the semi-structure characteristic of web pages, web data exhibited on web pages is notdirectly consumable by machines. Web information extraction aims at extracting struc-tured data from web pages, which is a very challenging problem due to the large-scaleand highly-heterogeneous characteristic of web information.Aiming at handling the large-scale and highly-heterogeneous characteristics of webinformation, this dissertation studies automatic and efcient technologies for web infor-mation extraction, conducted on two levels of data records and data units respectively.The research content includes:1. Targeting the high-heterogeneity of web information, a novel automatic datarecord extraction method called MiBAT (Mining data records Based on Anchor Trees) isproposed. Existing similarity-based automatic approaches cannot extract web data record-s accurately when a large amount of unstructured content exists (e.g., user-generated con-tent). This paper presents the concept of pivots, which correspond to some key data units.For example, almost every data record created and posted by users (e.g., online forumposts, user reviews, etc.) contains the publication date as a key data unit, which is apivot derived by domain constraints. The proposed MiBAT method detects pivots basedon domain constraints, identifies anchor trees that are DOM (Document Object Model)sub-trees containing the pivots, and finally extracts data records around the anchor treesautomatically. Experimental results show that, compared to existing approaches, MiBATis able to overcome the irregularity of data records caused by unstructured content, result-ing in high accuracy.2. Targeting the large-scale of web information on the level of data records, a fastand efcient anchor tree finding algorithm is proposed. In web mining community, thetraditional mining approach is to enumerate sub-trees in a top-down manner; followingthis approach, the time complexity of MiBAT is O(n2), where n is the number of nodeson the DOM tree of the web page. In this paper, a novel anchor tree finding algorithm is presented based on aggregating tag paths in a bottom-up fashion, which enables MiBAT torun in O(n log n) time. Experimental results demonstrate that the new method significantlyimproves the efciency of MiBAT while remaining high accuracy.3. Targeting the cross-domain high-heterogeneity of web information, the conceptof generic pivots is proposed. The concept of pivots origins from domain constraints,corresponding to some domain-dependent key data units. In real applications, diferentdomain constraints are required to be identified for diferent domains, which limits theapplicability of MiBAT to some extent. To resolve the domain dependency, this paperexpands the concept of pivots and proposes generic pivots. Experimental results suggestthat, when using generic pivots, MiBAT is applicable to diferent domains achieving highaccuracy, without any pre-defined domain constraints.4. Targeting the large-scale of web information on the level of data units, a fastand efcient method for DOM tree matching is proposed for data unit alignment andextraction. The most widely used tree matching algorithm runs in O(n2) time, whichis not appropriate for web-scale processing. This paper proposes a novel tree matchingmethod based on the longest common subsequence (LCS) of the tag path sequences ofDOM trees. By exploring the inherent sparsity of the LCS problem, the proposed treematching method runs in O(r log n) time, where r is the number of pairs of nodes thathave identical tag paths from the two trees; when the matching is sparse, r≈O(n),and the algorithm runs in O(n log n) time approximately. Extensive experimental resultsdemonstrate that, compared to the existing method, the proposed approach significantlyimproves the running efciency and also achieves similar tree matching results as well asdata unit alignment results.In summary, this dissertation presents technologies for automatic and efcient webinformation extraction on both levels of data records and data units, which can well handlethe large-scale and highly-heterogeneous characteristics of web information with boththeoretical and application value.

  • 【分类号】TP393.092;TP391.1
  • 【被引频次】2
  • 【下载频次】395
  • 攻读期成果
节点文献中: