

Study on Subject Search Technology of Web-oriented Text Mining

【作者】 段平

【导师】 刘志镜;

【作者基本信息】 西安电子科技大学 , 计算机应用技术, 2008, 硕士

【摘要】 随着因特网的快速发展,海量的Web数据资源已经成为人们获取知识与信息的重要来源。由于Web资源具有半结构性、离散性、实时性和异构性等特点,用户很难快速准确地从Web上获取真正有价值的信息。获取Web信息的主要方法是使用搜索引擎,而现在流行的通用搜索引擎不能很好的提供信息结构抽取、Web文本内容的分类、过滤以及文档理解方面的功能。因此,如何设计搜索引擎技术,使之更适应的对Web资源进行高效的挖掘就成为了研究热点。本论文的研究内容是面向Web文本挖掘的主题搜索引擎研究与系统设计。重点讨论了当前流行的Web挖掘以及搜索引擎的核心技术,并且设计和实现主题Web信息挖掘和搜索原型系统Label3。本文的主要工作研究如下:主题爬虫技术:改进了以往的爬虫策略,提出了基于非贪婪遗传算法的网络爬虫搜索策略,对各个算法进行数据分析和性能比较。语言过滤分词、中文字词切分算法:考虑到拉丁语言与中文语言的差异,本文讨论了各自的语言分词算法,特别针对中文语言的特殊性,提出了基于字典的“词元”分词算法。Web数据的挖掘算法:主要是对采集到的Web数据,进行数据聚类分类,发现数据的内在联系,并且提取文本的类别信息,为用户提供更好的信息服务。数据索引和检索机制:数据索引机制采用独特的倒排序策略来建立数据索引,对获取的文本信息进行细化。信息查询检索服务针对不同类别网页分类查询,使用户的得到的搜索结果更加精确。针对以上研究成果,本文描述了原型系统的设计实现细节。

【Abstract】 With fast development of Internet, mass Web data resources have become important source of knowledge and information obtainment. Due to the characters of Web resources, such as half-structure, discreteness, real-time and isomerous property, it is hard for users to get real valuable information fast and accurately from Web. The main method of getting Web information is using search engine. But the common popular search engine can not support some functions, such as information structure extraction, classification and filtration of Web text content, document understanding and so on. Therefore, how to design search engine fit for efficient Web data-mining has become hot research object.This study focuses on object search research and system design oriented to Web text mining. Current popular key technique of Web mining and search engine is importantly discussed. And prototype system named Label3 of object Web information mining and search is designed and implemented. The main research tasks can be described as follows:Object crawler technology: Past crawler strategy is improved, and search strategy of Based on genetic greedy algorithm net crawler is proposed. Besides, data analysis and performance comparison of each algorithm are given.Algorithms of filter splitting、Chinese word splitting: Considering the difference of Latin language and Chinese, we discuss word splitting algorithms of each language. Based on the specialty of Chinese, a dictionary-based word splitting algorithm named“Word cell”is proposed.Web data mining algorithm: It is mainly to cluster and classify the collected Web data, discover the inner relationship of data, and extract type information of text. This algorithm can provide better information serving for users.Data index and retrieval scheme: Unique reverse-order stratagem is adopted by data index scheme to form data index, and refine the obtained text information. The information query implement retrieval according to web pages with different type, and the retrieval results are more accurate.Based on above research achievements, details of prototype system design and implement is descried in this study.
