节点文献

互联网文本聚类与检索技术研究

Research on Web Text Clustering and Retrieval Technology

【作者】 孟宪军

【导师】 王晓龙;

【作者基本信息】 哈尔滨工业大学 , 计算机应用技术, 2009, 博士

【摘要】 随着互联网技术的高速发展,网络上文本信息的容量与日俱增,人们迫切需要提高在互联网上的信息获取效率。文本挖掘技术用于对文本数据进行知识挖掘,试图有效的解决当前信息过载的问题。文本作为自然语言的语义载体,通过引入相关的自然语言处理技术,深度挖掘文本在语义上的特征,能提高相应的文本挖掘算法在文本挖掘中的准确性和效率。本文主要研究了自然语言处理技术在文本聚类和信息检索系统中相关问题的应用。针对搜索引擎和互联网环境下的文本数据挖掘任务,本文提出了一系列基于相关的自然语言处理技术的方法来改善文本聚类算法的效果以及提高信息检索系统中查询结果与查询的相关性质量,论文的主要内容包括以下四个方面。本文首先提出了一种基于相关自然语言处理技术的文本聚类语义特征降维方法。文本聚类作为一种无监督的数据挖掘方法,相对于有监督的文本分类算法而言,特征的选择通常没有很有效的方法。因此不同特征对聚类效果的影响就无法得到有效的控制,维度过大,聚类结果的准确性易受噪音特征的影响。本文提出了一种基于词法分析技术的特征降维方法,通过提取文本中名词性的词语作为特征进行聚类,有效的降低了文本集合中特征的维数,同时保证了特征的辨别能力。由于名词存在同义性的问题,使得相同的语义有不同的词语表现形式,影响了文本相似度的衡量。本文通过采用相关的语义知识词典对词语进行类别扩展,在一定程度上降低了特征的同义性,在进一步降低特征的维数的同时,促进了聚类结果的准确性。实验表明,基于词法分析技术和语义知识词典扩展的特征降维方法在显著的降低文本特征空间的大小的同时,有效的提高了聚类结果的准确性。相对于搜索引擎线性结果列表中存在的不足,对搜索结果进行聚类是一种更有效的结果呈现方式。搜索结果聚类针对的文档集是搜索结果的摘要描述,尽管这些摘要信息明确,但长度短小,在这样的文本集合上进行聚类,通常的文档相似度算法经常由于特征空间的稀疏而无法得到准确的结果。本文通过引入容错粗糙集技术,利用文档间词语的共现信息对原始结果摘要进行语义上的扩充,扩充后的文档间的相关性得到了强化,避免了特征空间稀疏导致的聚类准确度下降的问题。在聚类算法的选择上,本文提出了一种新的基于词语相关度计算的标签式聚类算法,将搜索结果聚类问题转换成基于搜索结果集合的查询词语义消歧问题。这种聚类算法能生成描述性更清晰、鉴别能力更强的标签描述,同时,与标签对应的结果在内容上也有更好的一致性。实验表明,本文提出的搜索结果聚类算法能有效的挖掘出用户查询在搜索结果中所对应的各种不同的语义,从而帮助用户快速定位他们所需要的文档集合。文本聚类算法通常采用向量空间模型来对文本进行形式化表示,向量空间模型中各个特征之间是没有关联的。这种假设对于文本来说丢失了很多有价值的能有效衡量文档之间相似性的信息,从而降低了聚类的准确性。相对于独立的单个词语特征,不同文档之间频繁出现的词语集合更能反映出文档之间的相似程度。本文采用基于上下文约束的闭频繁词集用于衡量文档之间的相似性,更好的体现了文档之间深层的潜在语义联系。频繁项集挖掘是数据挖掘中经典的用于关联分析的技术,通过改进,本文将这种频繁项集挖掘算法引入到了文本集合中用于挖掘文档集中的频繁词集,并通过对发现的频繁词集加入了不同的上下文距离约束限制,使得频繁模式更能保持语义上的一致性,有效地反映出了文本相对于结构化数据的特点。实验表明,基于这种新的相似度衡量方法的文本聚类算法能生成更加准确的聚类结果。搜索结果的相关度排序是信息检索中的重要研究内容之一。与传统的文本数据不同,网页通常带有大量的与主题无关的噪音信息,严重影响查询结果的相关性,因此本文采用基于内容单元的网页解析与内容提取技术,对网页首先进行净化处理,以减少网页中内容无关信息对检索相关度的影响。目前绝大多数信息检索系统的相关度计算方法是建立在全文的基础之上。但是基于网页的全文往往在内容的表达上不具一致性,存在与主题无关的内容,这也会在一定程度上影响查询结果的相关度。本文提出了一种通过计算用户查询与净化后网页的自动文摘之间的相关度来提高信息检索的质量的方法,相对于全文来说,摘要是从全文中提取的文档的核心内容,具有简洁性、准确性和清晰性等特点,更能反映文档的主题信息。实验表明,相对于全文,基于摘要的检索结果在相关度排序的准确性上能取得更好的效果。

【Abstract】 With the rapid development of the internet, the volume of the text based informationincreases day by day with the high speed, and there is urgent need for people to effectivelyaccess the information. The text mining tasks try to solve the problem of“informationoverload”.Text is the semantic representation of natural language, so if some natural languageprocessing (NLP) techniques are adopted into the text mining process to handle the se-mantic features in text, some improvements in text mining algorithms can be foreseen.This thesis put the research focus on some applications in text clustering and informationretrieval using NLP techniques. For the text mining tasks in environment of Web andsearch engine, this thesis propose a series of NLP based methods to improve the qualityof text clustering algorithm and the accuracy in the relevance of search results related touser’s query in web based information retrieval systems. The major contents of this thesisinclude the following four parts.Firstly, this thesis proposed an NLP based semantic feature reduction method usedin text clustering algorithm. Compared with the supervised text categorization algorithm,text clustering is an unsupervised data mining method, and there are little effective featurereduction methods yet. The different kinds of features that can affect the quality of textclustering results are hard to be controlled. If the dimension of feature space is too huge,the accuracy of clustering results can be easily affected by the noise features. This the-sis proposed a feature reduction method based on lexical analysis by choosing the nounrelated features, which can significantly reduce the dimension of feature space and mean-while reserve most of their discrimination power. Because there are lots of synonymousnouns that different words share the same meaning, which can cause inaccuracy in docu-ment similarity measure. To solve this problem, this thesis uses the semantic dictionary totransform each remained feature to its upper semantic categorization, leading to a smallerfeature space and meanwhile promoting the accuracy of clustering results.To tackle the deficiency in ranked results list returned from search engine, cluster-ing search results is a more suitable result representation. The content of search results issimple and concise, but short in length. The similarity measure based on this kind of short texts usually leads to poor results because of the sparseness in feature space. This thesisuses tolerance rough set to extend the original feature space to its semantic approximateupper feature space based on the words co-occurrences. In the new feature space, thelatent similarity between documents is intensified. And this thesis also presents a new la-bel based search results clustering algorithm according to the correlation between words,and transform the problem of search results clustering to query sense disambiguation.This method can generate more descriptive and indiscriminate labels for each cluster andmeanwhile make documents in the same cluster consistent in contents. Experiments showthat this clustering method can help users to find the different senses in their queries at thesearch results, and easily locate the subset of results that according to their informationneeds.The VSM (Vector Space Model) is usually adopted as the text representation in textclustering, where the features are supposed to be independent. This assumption makes alot of useful information lost in similarity measure between documents. Compared withthe single independent features, the frequent wordsets occurred in many documents canimply the similarities between documents with strong indication. This thesis measures thesimilarities between documents based on contextual constraint closed frequent wordset,which is a more suitable feature unit to re?ect the latent relations in documents. Fre-quent itemset mining is a technique adopted from data mining, which used in associationanalysis in structural transaction database. In this thesis, it is modified for text clusteringalgorithm, and constrained with different contextual proximity to make the wordset moreconsistent in semantic. The experiments results show that the clustering algorithm basedon this new documents similarity measure can get more accuracy in results of clustering.Ranking of search results by relevance is a very important topic in information re-trieval. Different with the traditional text documents, there is lots of noise informationin Web pages which has strong impacts on the relevance of results. So in this thesis, theWeb pages were purified through page analysis and content extraction method based onthe concept of content unit firstly, which can reduce the impact of the noise informationexist in the structure level of Web pages. Most of the information retrieval systems laytheir relevance computing techniques on the full-length text analysis, but there are moreinconsistent contents which are topic irrelevant existing in Web pages which can also de-teriorate the relevance of results. This thesis proposed a summarization based relevance promoting method computing the relevance between query and summarization instead offull text. Summarization is the core of full text document and more consistent in topic rep-resentation, which has the characteristics like concise, accuracy and clear. Experimentsshow that summarization based relevance computing method can lead to a more accuratesearch results in relevance ranking.

节点文献中: