节点文献
基于语义的网络知识获取相关技术研究
The Research on Web Knowledge Acquirement Based on Semantic Techniques
【作者】 郭勇;
【导师】 张维明;
【作者基本信息】 国防科学技术大学 , 管理科学与工程, 2007, 博士
【摘要】 伴随着Internet的飞速发展,Web上出现了海量、异构、半结构化、动态的信息资源,并且在这些Web信息中有80%以上的信息是以Web文本的形式存在的。如何从这些浩如烟海的Web信息资源中寻找并获取有价值的信息和知识模式,已经成为信息处理领域一个亟待解决的问题。基于语义的网络知识获取有助于解决上述问题,它可以提高用户网上信息搜索的效率,可以将搜索结果分门别类,帮助用户快速定位目标知识,并且从中抽取有价值的知识。本文在分析网络知识获取相关技术的研究现状和存在问题的基础上,研究了概念语义生成技术、文本分类方法、典型用户会话模板生成方法以及基于概念的近似查询技术,主要取得以下研究成果:(1)借助NMF算法的分解结果具有实现上的简便性以及分解形式和分解结果可解释性的优点,提出一种基于NMF的概念语义生成方法。类比图像分解的思想,将一个向量文本对应一幅图像,一个特征项数值对应一个象素点灰度值,应用NMF提取文本向量的概念语义,从而为大规模文本处理提供了一种新途径。实验结果以及相关工作比较分析表明NMF生成的概念语义能准确反映样本的局部特征,有助于解决自然语言表示中固有的歧义问题。(2)将NMF生成的概念语义向量用于Web文本分类。由于NMF生成的局部概念语义向量能和样本的特征直接对应,体现了各个分类中文本各自的特点,因此比体现所有文本共同特征的全局概念语义向量具有更强的区分能力。实验对比分析了局部概念语义空间和全局概念语义空间的构建对文本分类结果的影响,实验结果表明在NMF生成的局部概念语义空间中进行分类更精确。(3)根据NMF算法分解大规模文本矩阵的特点,提出了一种基于NMF的典型用户会话模板发现方法。应用NMF算法分解项.文本矩阵来获取项之间的相关性,在此基础上,引入语义向量和权重向量的概念,并通过定义语义向量的类别紧密度来提取用户模板。从确保概念语义向量正交,减少概念语义向量冗余的角度出发,选择NMF的变体LNMF进行降维,设计了一种基于LNMF的典型用户会话模板提取算法。由于LNMF得到的概念语义向量是尽可能正交的,实验分析表明,LNMF方法的聚类效果好,适合于发现典型用户会话模板。(4)针对基于概念最小上界和最大下界求本体概念近似查询的不足,定义了概念的最佳近似。利用复杂概念间的蕴涵关系,引入多元界和最简多元界的概念。通过相关性质和定理证明了借助多元界可以求得概念的最佳近似,从而将求概念最佳近似的问题转化为求概念的最简多元界问题。在此基础上,提出基于概念最简多元界的本体概念近似查询方法,可以有效消除查询重写冗余,提高近似查询的质量和查询重写效率。(5)给出了一个求概念最简多元最小上界的算法。详细讨论利用迭代递增的过程和概念层次减少搜索空间、优化算法效率的措施,给出算法正确性和完备性的证明,分析了算法的有效性。
【Abstract】 Along with the rapid development of Internet, there are abundant, isomeric, semi-structured and dynamic information resources on Web. Among these Web information, above 80 percent exist in the form of Web text. How to seek and gain the valuable information and knowledge model from these vast Web information resources, have already become the question urgently awaited to be solved in the information processing domain.The questions mentioned above can be resolved effectively by Web knowledge acquiration. It can classify search results, which not only enhances the efficiency of search for Web users, but also improves the ability of localization to goal knowledge, and extracts the valuable knowledge.On basis of analyzing the present research situation and existing question of Web knowledge acquisition, this dissertation mainly studies the essential technologies of concept semantic generation, the common text classification methods, user profile construction and approximate query technique based on concept. The main research works are shown as follows.(1) With the aid of realizes on simple, explainable metrics from the NMF algorithm’s decomposition result, a concept semantic generation method is proposed. In analogy with image decompotion, the NMF is applied to extract the concept semantics from text vector, providing one new way for the large-scale text processing. The experimental results as well as the related work comparison indicate that the concept semantics from the application of NMF can reflect accurately the partial characteristic of the sample, which help to solve the natural language expression problem.(2) The mechanism of text callasification based on NMF is studied. The local concept semantics vector from NMF has stronger clssification capacity than that of global concept semantics, because the fromer can correspond directly with the sample characteristic, which manifests each classified text respective characteristic. Experiment to compare the influence of local concept semantics space and the global concept semantics space construction to the text classification result is conducted. The experiment results indicate that the classification in the local concept semantics space by NMF is most precise.(3) Taking advantage of the decomposion efficiency of the large-scale text matrix by NMF, a method based on NMF for construction typical user conversation profile is presented. According to NMF, the term-text matrix is decomposed to capture the relations between terms. Then, the concepts of semantic vectors and weight vectors are introduced. Futhermore, the the class closeness degree is defined to extract the user profile. From the point of guaranting the concept semantics vector orthogonal, reducing the concept semantics vector redundancy, LNMF is carried on the dimensionality reduction. Because LNMF obtains the concept semantics vector is as far as possible orthogonal, the experiment result shows the LNMF method not only improve filtering precision markedly, but also has the merits of aggregation(4) To deal with query reformulation, an ontology concept approximate query method based on most concise multi-dimensional concept is proposed. Firstly, the most approximate concept is defined. Using the implication relations between the complex concepts, the multi-dimensional and the most concise multi-dimensional concept are defined, which makes it possible to obtain the most approximate concept from the multi-dimensional concept. So the question to get most approximate concept is transformed to get the most concise multi-dimensional concept. Related properties and theorems show that the method can reduce the query reformulation redundancy effectively and improve the approximate query quality and efficiency.(5) An algorithm to get the most concise multi-dimensional least upper concept is proposed. The detailed procedure and method to reduce search space and improve efficiency are discussed. Last but not the least, the algorithm accuracy and completeness is proved.
【Key words】 concept semantics; NMF; text classification; information extraction; user profile construction; approximate query;