节点文献

Web信息检索若干关联挖掘问题的研究

Research on Several Association Rule Mining Problems for Web Information Retrieval System

【作者】 沈筱彦

【导师】 陈俊亮;

【作者基本信息】 北京邮电大学 , 计算机科学与技术, 2009, 博士

【摘要】 信息爆炸是当今信息社会的一大特点,当前信息检索技术面临着Internet网络信息更新加快,用户要求检索结果愈加精确的严重挑战,因而如何帮助用户有效地找到所需信息成为了一个关键的问题。一方面,单纯以查询词的方式检索出包含用户所需信息的网页,在某些情况下并非最有效的方式。通过挖掘网页之间的关联关系,使得用户在已知某个网页包含他所需要的信息时,可以较容易地获得其他与该信息相关的网页;另一方面,由于Web信息检索系统的用户大多是普通用户,很难将自己复杂的检索目的转化成简单的查询词表示。同时,语言中又存在着大量的同义词、缩写词、关联词等,这种语言固有的模糊性使得同一个查询词可以代表不同的查询需求,同一个查询需求也可以有多种不同的表达方式。通过挖掘查询词之间的关联关系,将有助于帮助用户更好地构建查询词以检索到更多的有用信息。鉴于当前中文Web信息检索还远未达到理想的效果,本文对于网页之间以及中文词之间的关联关系,进行了细致的研究,论文的主要工作包括以下内容:1.本文以网页之间的链接关系为切入点,提出了一种新的挖掘网页之间关联关系的算法。该算法首次将网页分块算法引入到关联网页的挖掘过程中,并综合了链接锚文字的相似性和网页模板块过滤等方法,提高了关联网页的识别精度。考虑到算法在工程实际应用时所需处理的网页库规模,本文还具体给出了算法并行实现的步骤流程。2.因中文语言中存在着大量词汇与其缩写形式混用的情况,如何有效识别中文缩写词及其对应的同义全称词是中文信息检索中需要处理的一个重要问题。本文创新地提出了一种从网页链接的锚文字中挖掘中文缩写及全称之间对应关系的算法。它首先使用最长公共子序列算法从锚文字中获得缩写全称对的候选结果,并进一步使用支持向量机对候选结果进行过滤。实验表明本文提出的算法,能够有效地挖掘隐藏在锚文字中的中文缩写及对应的全称词,结果准确率较高。3.有效地挖掘中文词之间的关联关系,获得属于同一主题的中文词聚类,对于为中文Web信息检索系统提供多样性搜索结果,构建中文关联查询词等方面都是十分有意义的。本文从中文语言的标点特性入手,创新地提出了一种利用中文语句内的并列短语来挖掘中文词之间关联关系并对其进行聚类的算法。该算法利用二分图的密集子图挖掘近似算法,能够高效地对海量中文语料库中的并列短语进行聚类。为进一步对聚类结果进行改进,本文还提出了两个算法,可以有效挖掘出属于同一主题的大量中文关联词。实验表明本文提出的算法,能获得较高的聚类成功率和聚类精度,有较强的工程应用前景。4.如何让用户准确地构建查询词以表达其检索意图,也是信息检索技术研究的重要方向。本文提出了一种复合算法框架,可以有效地根据用户输入的查询词推荐关联的查询词。一方面根据查询词的关联度、流行度和有效性推荐查询词,帮助用户限定检索意图,以期获得更准确的搜索结果;另一方面,利用查询日志的点击信息、挖掘的中文缩写全称对、中文同主题词聚类、中文同义词对和中文语言模型,对用户输入的查询词进行合理的修改,以期获得更多满足用户检索意图的结果。实验表明,本文提出的算法框架能有效地向用户推荐关联查询词,有助于提高中文Web信息检索系统的查询效果。

【Abstract】 In current century, information bomb becomes remarkable with a high-speed update, and users’ requirements about search results continues increasing, so that how to achieve useful information from a huge mount of web information resources is one of the vital problems. On one hand, in some situations it is not most efficient to use the key words to search web pages which contain required information. Mining the association relationship among web pages can guide users to obtain more useful pages via one useful page. On the other hand, many web novices are not well in using few simple words to describe their complex search targets correctly. Due to many abbreviations, synonyms and associated words, it is easy to understand the inherent ambiguity in language. Accordingly, the same word can represent different search demands; likewise, the same searching demand can be described by different words. Therefore, it is helpful to mine the association relationship to construct the effective search words and find the resultant information. Since the quality of searching results of Chinese web information retrieval system is still not very good, this dissertation focuses on solving several association rule mining problems in web information retrieval system. The contributions are as follows:1. Based on the analysis of linkage relationship between web pages, a new algorithm for mining related pages is proposed in this dissertation. The HTML segmentation step is first introduced in the process of mining related pages. Combining with other technologies, such as page template filtering and anchor text similarity boosting, the precision of related pages is improved by the algorithm. In order to handle large corpus in practical engineering project, the detailed flowchart of how to implement the algorithm in parallel is also illustrated in this dissertation.2. Chinese abbreviations are widely used in Chinese texts for convenience or space saving. Since abbreviations and their original definitions can be substituted freely without changing article meaning, it has brought much challenge in web information retrieval. For this reason, an effective and novel approach is proposed to identify Chinese abbreviations and their definitions automatically. First, the longest common sequence algorithm is used to extract abbreviation-definition pair candidates from anchor texts. Further, a support vector machine model is trained to filter the genuine abbreviation-definition pair from candidates. Experiment results show an encouraging performance.3. Mining the association relationship between Chinese words and clustering them according to its topics can help web information system provide diverse searching results and generate related queries. In this dissertation, a simple but powerful algorithm to cluster Chinese words is proposed by using Chinese punctuation characteristics. The algorithm can efficiently cluster paratactic words in large Chinese corpus through the approximation of the dense sub-graph mining algorithm into bipartite graph. Two algorithms are also proposed to further improve the precision and recall of the words clusters. Many Chinese words within the same topic can be obtained from these algorithms. Experimental results indicate that the algorithm is very suitable for Chinese terms clustering and application in practical engineering.4. How to help users construct precise queries to describe their searching target is an important research area in web information retrieval. In this dissertation, a composite framework is proposed to suggest related queries for the original queries submitted by users. This framework suggests related queries according to several factors such as relevance, popularity and effectiveness, in order to narrow users’ targets and obtain searching results with higher precision. In addition, the framework uses click information in query logs, Chinese abbreviation-definition pairs, Chinese words clusters and Chinese synonyms to modify original query without changing its meaning, which can help users get more results relevant to their searching target. Experiments show that the framework can suggest related queries for users with high efficiency. The quality of searching results of web information system may be improved by this framework.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络