节点文献

基于WordNet的语义相似性度量及其在查询推荐中的应用研究

Research on Semantic Similarity Metric Based on WordNet and Its Application in Query Suggestion

【作者】 孟玲玲

【导师】 顾君忠;

【作者基本信息】 华东师范大学 , 计算机应用技术, 2014, 博士

【摘要】 语义相似性度量一直以来都是人工智能、心理学、认知科学等领域的研究热点,并有着非常广泛的应用。作为自然语言处理技术的重要内容,它所依赖的语言知识表示中最重要的初始环节就是语义词典。一部能够表达概念关系的语义词典是自然语言处理工作中不可或缺的基础性资源。美国Princeton大学开发的WordNet就是语义词典的优秀范例。其基本思想简单明确,形式化做得彻底。目前,WordNet已成为一个事实上的国际标准,其框架的合理性已被词汇语义学界和计算词典学界所公认。与此同时,伴随数据爆炸性增长,人们越来越依赖搜索系统来获取信息。查询推荐技术成为近年来搜索领域的研究热点。其可弥补当前Web搜索方式在表达能力上的局限性,辅助用户更好地表达查询意图。随着查询推荐技术应用研究的深入,查询词信息稀疏、内部信息缺失严重,使得查询推荐技术面临许多挑战,并严重制约查询推荐技术的进一步推广和应用。而将语义相似性度量研究成果推广到查询推荐的研究中,可有效解决查询词信息稀疏等关键问题,是未来发展的重要方向。基于上述背景,本文首先探讨了国内外有关语义相似性度量和查询推荐的研究现状,从语义层面表示数据,围绕语义相似性度量,建立了基于WordNet概念拓扑结构的信息内容IC模型;提出基于概念自身内容IC和路径信息混合的语义相似度算法;进而将算法应用到相似查询判定中。本文主要创新和贡献如下:1.在WordNet语义相似性参数研究方面,提出了基于概念拓扑结构的信息内容IC模型。概念信息内容IC是概念语义相似度算法的参数,对语义相似度算法的性能具有决定性作用。本文提出的新模型不需要任何语料库的参与,概念节点所包含的信息内容,取决于该节点及其子孙节点的拓扑结构,IC值是该节点自身及其子孙节点排列方式的函数,包括该概念节点的深度,子孙节点的数目以及每个子孙节点的深度。实验结果表明:该模型性能明显优于其它IC模型,能够有效的区分开不同概念,使得概念的信息内容IC获取更为精准。2.在WordNet语义相似性度量方法研究方面,提出基于概念自身内容IC和路径信息混合的语义相似度算法。该算法不仅反映了概念节点在语义分类树中的路径信息,也反映了语义密度信息,即:将概念的信息内容IC和在语义分类树中的路径信息都考虑在内。实验结果证实:此算法较国内外学术界已有算法更接近人类的判断,性能更优。3.在相似查询度量方面,提出基于语义的相似查询度量方法。相似查询度量是后续查询推荐的核心问题。该方法从语义层面表示数据,兼顾用户检索词项的相似性以及用户点击文档内容的相似性。在此新方法基础上,通过实验聚类相似查询,形成相似查询扩展字典。实验结果显示:该算法能更精准地捕获相似查询,为后续的查询推荐奠定了良好的基础。4.在查询推荐方法研究方面,提出了基于主题的查询推荐方法。该方法充分考虑了用户查询主题与session中query的关联性、推荐query与初始query在语义上的包含关系、相似程度等因素。实验结果证实:基于本文提出的推荐方法,能更准确捕获用户查询意图,大幅提高搜索准确率。本文研究成果具备一定的学术理论价值,并已初步成功应用到了信息检索领域,未来可进一步推广到网页分类、问答系统、广告推送、电子商务等多种信息领域,具备较高的商业应用价值和宽广的应用前景。

【Abstract】 Semantic similarity metric is a hot topic for many years in artificial intelligence, psychology, and cognitive science. Nowadays, it has been successfully applied in many fields. As a key issue of natural language processing, the most important aspect is semantic dictionary. One semantic dictionary that can express the relations between concepts is indispensable resources. WordNet developed by Princeton University is an excellent example. Its basic idea is simple and clear. Currently, WordNet has become a de facto international standard and the reasonableness of its framework has been recognized by lexical semantics field and computing dictionary filed.At the same time, with the explosive growth of data, more and more people rely on search engine to obtain information. Query suggestion becomes a hot topic, which can help users to better articulate query intention. With query suggestion more and more important, query information sparse problems make query suggestion face many challenge. This is seriously restricting query suggestion for further application. Using semantic similarity measure to promote the research of query suggestion is an effective solution, which is important direction for further research.Based on the discussion above, the dissertation represents data from the level of semantic and focuses on concepts’semantic similarity. Furthermore semantic similar measure is applied into similar query metric. The main contributions of this dissertation are as follows.1. The dissertation proposed an IC model in WordNet based on concept’s topology. Different from previous work, the new model is corpora independent. The information content of a concept is the function of the topology of itself and its descendants. Experiment shows that the new model is able to provide more accurate similarity evaluation and achieves significant performance than related work.2. The dissertation proposed an effective algorithm for semantic similarity metric of word pairs in WordNet. Different from previous work, in the new algorithm not only path length, but also IC values have been taken into account, which can distinguish different concept pairs effectively. We evaluate our algorithm on the data set of Rubenstein and Goodenough, which is traditional and widely used. Coefficients of correlation between human ratings of similarity based on seven algorithms are calculated. Experiments show that the coefficient of our proposed algorithm with human judgment is0.8820, which demonstrate that our new algorithm significantly outperformed others.3. The dissertation proposed a query similarity metric algorithm based on semantic analysis. Different from previous work, the new algorithm represents data from the level of semantic. It takes full consideration the information of keywords and user clickthrough, mining the relations of queries. Experiments show that clustering queries based on the new algorithm can more accurately capture the similarity query than related works.4. This dissertation presents a query suggestion algorithm which is topic oriented. Different from previous work, the new algorithm takes full consideration of query relations in meaning; similarity values, query context and so on, and then suggests the similar queries to user. Experiments show that the new algorithm can effectively improve the precision of Web search.The achievements of this paper have high academic value. They have been successfully applied into the field of information retrieval. Furthermore they can be extended to web page classification, Q-A system, advertisement pushing, E-commerce and so on, which indicate a larger commercial value and broader application prospects.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络