节点文献

基于Web信息抽取的专业知识获取方法研究

Research on Specialty Knowledge Retrieval Method Based on Web Information Extraction

【作者】 胡燕

【导师】 钟珞;

【作者基本信息】 武汉理工大学 , 计算机应用技术, 2007, 博士

【摘要】 互联网的飞速发展使其成为全球信息传播与共享的重要资源,Web上的数据一直呈几何级数增长,要想从Web上获取一条有用信息的难度却越来越大,“信息过载”已经成为一个亟待解决的问题。一种理想的情况是:人们可以像查询数据库一样查询Web上的数据。然而,如何从浩繁的Web数据中抽取出有用的信息成为众多研究工作希望解决的问题。Internet具有的海量、异构、动态变化等特性使Web信息抽取不同于传统信息抽取,同时带来了新的挑战。抽取技术随着需求的增加而不断丰富,近年来国内外涌现了多种信息抽取方法。本文针对智能教学系统中需要构建的学科知识数据库,研究根据用户需求从Web中自动获取各学科专门知识的方法。本文提出的基于Web信息抽取的专业知识获取方法主要是受SRV把信息抽取问题看成是一种分类问题的启发,结合目前已有的基于HTML结构的Web信息抽取技术,构造了基于Web信息抽取和分类技术的Web专业知识获取系统的框架,并针对该系统框架下的若干关键技术进行了专门研究,具体内容如下:1.研究Web网页的批量获取及预处理方法。基于Web的专业知识获取需要收集大量同一主题的网页,目前各搜索引擎所提供的服务还不能满足需求,本文提出了一种简单高效的从Web自动批量获取网页,并利用正则表达式匹配出具有主题内容的网页的方法。2.研究网页预处理的方法。根据HTML文档结构中的标签含义,构造HTML容器标签树,针对网页中各噪音块和主题内容块的特点,删除标签树中的噪音结点,确定主题内容块。3.研究网页的主题信息抽取方法。该研究针对当前的信息抽取方法需要有较多的人工干预,需要较多的先验知识,不同的系统使用的描述语言不同等特点,采用了基于XML映射的信息抽取方法,提出了利用DOM构建Jtree,根据treenode结点自动获取信息抽取的路径,学习信息抽取规则,从而达到信息抽取自动化的目的。4.研究中文文本特征表示方法和文本分类算法。针对向量空间模型的文本特征表示方法中特征词数量的多少,以及数据搜索空间的大小与分类算法的效率有着密切关系的特点,提出了基于词性的特征词提取方法,有效降低了特征向量的维数;提出了基于特征词减少的改进的KNN算法和基于数据分割的改进的KNN算法,提高了分类算法的效率和性能。5.研究训练库的自动获取方法。要提高分类算法的性能,必须建立高质量的训练库,以往的研究都是基于一个已经建立好的训练库,本文提出通过Web挖掘自动生成一个高质量的训练库,以进一步提高专业知识获取的自动化程度。6.研究信息的组织和存储方法。对提取的专业知识组织成用户的应用系统——智能教学系统可以直接访问的形式,并对数据按照应用系统的要求进行了初步整理。本文对基于Web信息抽取的专业知识获取过程中各环节的关键技术进行了研究,建立了知识获取框架,初步实现了整个获取过程的自动化。

【Abstract】 Rapid development makes Internet become an important resource in global information transformation and sharing. The data in the web are growing at a steady rate of geometric series, so it is more and more difficult to acquire a piece of useful information from the Web, and "information overload" has become an urgent problem needed to be solved. The ideal case is described as: people can inquire into the data in the web in the same way as we inquire into the data base. However, how to extract the useful information from vast and numerous data on the Web is still a problem which the researchers hope to solve.Such characteristics as large quantity, isomery and dynamic variation and so on make Web information extraction different from traditional information extraction, and bring new challenges. In recent years the extraction techniques have been enriched as the demand increases, and there exist many information extraction methods domestically and abroad. In this dissertation, we investigate the method of automatic knowledge acquisition in all subjects from the Web according to the need of the customers, in accordance with the subject knowledge data base to be established in the smart instructional system.Specialized knowledge acquisition method based on Web information extraction, which is proposed in this dissertation, is mainly enlightened by the idea that SRV regards the information extraction as a classification problem. Along with Web information extraction method based on HTML structure, we have constructed the frame of Web specialized knowledge acquisition system based on Web information extraction and classification method, and conducted special studies on some key techniques in this system. The detailed contents of this dissertation are listed as follows:1. Web page large-quantity acquisition and pretreatment are analyzed. Specialized knowledge acquisition based on Web requires collecting a large quantity of web pages with the same topic. Nowadays the service provided by all Search-engines can’t meet the need. In this work, we present a simple and efficient method which is employed to automatically acquire web pages in large quantity and match the pages of the same topics by using canonical expressions.2. Page pretreatment method is studied. According to the label meaning in the HTML file structure, HTML vessel label tree is constructed. In view of the characteristics of noise block and subject content block in the pages, the noise node in the label tree is deleted and subject content block is confirmed.3. Subject information extraction method of the pages is discussed. In view of the fact that the present information extraction methods need much artificial intervention and much prior knowledge, and that different systems use different descriptive languages, we employ one kind of information extraction method based on XML mapping, establish Jtree by using DOM, automatically acquire the path of information extraction according to the tree node, and study information extraction rules, in order that the automation in information extraction is achieved.4. Chinese text characteristic expression method and text classification algorithm are also analyzed. The quantity of characteristic word in the text characteristic expression method of vector space model and the dimension of data searching space have an intimate relationship with the efficiency of classification algorithm. Based on the fact mentioned above, we have developed a characteristic word extraction method based on word gender, which can reduce the dimensions of characteristic vector. And we have also proposed two modified KNN algorithms, which are based on lessening of characteristic words and data division respectively, so that the efficiency and performance of classification algorithm are improved.5. Training base’s automatic extraction method is studied. In order to improve the performance of the classification algorithm, a high-class training base has to be established. All the past researches are based on the training base which had already been established. However, in present study one high-class training base is automatically generated by Web excavation, in order to further improve the automation degree of specialized information acquisition.6. The information organization and storage methods are analyzed. The extracted specialized knowledge is organized into a form that the customer utility system-smart instructional system- can access directly, and the data are arranged initially according to the need of the utility system.In this dissertation, researches have been done on key techniques in every link of specialized knowledge acquisition based on web information extraction, the knowledge acquisition frame has been established, and elementary automation in the process of acquisition is achieved.

  • 【分类号】TP393.09
  • 【被引频次】26
  • 【下载频次】1743
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络