节点文献

文本挖掘算法及其在知识管理中的应用研究

Text Mining Algorithms and Their Applications in Knowledge Management

【作者】 宣照国

【导师】 党延忠;

【作者基本信息】 大连理工大学 , 管理科学与工程, 2008, 博士

【摘要】 随着知识经济的到来,知识管理在社会经济中的作用日益重要。大多数的知识管理研究是为企业服务的,针对科研管理部门的知识管理研究非常少,本文对我国科研管理部门的知识管理问题进行研究。与其他领域相比,科研管理部门的知识管理有一定的特殊性。比如,科研管理部门管理着蕴含大量知识的立项建议申请书。挖掘并利用申请书中的知识,能够在从科学研究整体层面、学科领域层面和项目管理层面对科研管理工作提供决策支持。申请书中的知识隐含在申请书内容之中,从申请书中挖掘知识会面临如下问题:申请书的知识表示不能完全依赖于词典;申请书研究内容与申报学科领域不能完全吻合;学科代码体系结构与实际研究领域的体系结构不能完全一致。针对上述问题,本文在以下几个方面进行了研究:第一,提出一种不依赖于词典抽取高频词的桥接模式滤除算法(BPFA)。首先基于N-gram技术获取文本中的汉字结合模式及出现频率,然后通过消除桥接频率得到模式的支持频率,并依此来判断和提取正确词语。实验结果显示,BPFA能够有效提高分词结果的查准率和查全率。该算法适用于对词语频率敏感的中文信息处理。本文应用该算法,抽取申请书中出现的新术语,补充到系统词表中。第二,粗分类数据中包含有文本内容与类别标记不符的噪声数据,这些噪声数据会对文本分类结果的精度产生不良影响。本文提出一种针对粗分类数据的噪音修正算法。首先建立文档关联网络,把文档上标记的类别作为原始的社团结构,并用模块度衡量社团结构的质量,通过优化模块度指标把噪声数据调整到正确的类别中,从而提高数据质量。实验结果表明,本文所提算法能够有效修正粗分类数据中的噪声,具有较高的有效性和鲁棒性。该算法可以用于文本分类训练数据的预处理,或作为辅助技术用于文献库建设等工作。本文把申报到各个学科代码下的申请书作为粗分类数据,应用该算法把与代码不符的申请书调整到正确的代码中。并根据调整后的数据建立代码模型,分析代码所代表研究领域的内涵和外延、代码之间的交叉关系。第三,提出基于公共连接强度的快速聚类算法。利用社团成员之间的相似关系定义了社团连接强度,根据社团的公共连接强度定义了一种新的相似度计算方法,并应用该相似度计算方法提出一种凝聚聚类算法。在相似度计算中,综合考虑了社团内部和外部结构关系,因此能够避免其他算法在聚类初期容易出现的聚类错误。分别对拓扑和加权的实验数据进行聚类,实验结果证明了所提算法比其他算法更为有效。本文应用该算法对申请书进行聚类分析,形成了项目类,并对项目类和学科代码之间的关系进行了分析。本文在理论方法研究的基础上,对国家自然科学基金委员会的基金管理工作进行了应用研究,分析了我国基础科学研究的整体发展状况和发展规律、各个学科领域的研究状况及其关系等,为制定发展规划、发展战略、学科代码体系调整以及项目管理提供决策支持。

【Abstract】 With the advent of knowledge-based economy, the Knowledge Management(KM) contributes much more than before in the social and economic lives. Most of the researchers focus on the ones on the enterprises, and there are little research works aiming at solving the KM problems in Scientific Management Departments(SMDs). In this dissertation, the KM of SMDs of China is studied. KM in SMDs is different from those in the other domains. For instance, SMDs of China holds many research proposals with lots of knowledge. Obviously, the activities to mine and utilize the knowledge in research proposals can strongly provide decision support for the SMDs in the following levels: the whole discipline, the sub-domain of the discipline and the research projects.Knowledge is contained in the contents of research proposals. In order to discover knowledge from the proposal’s contents, there are several problems should to be solved, including knowledge representations of research proposals cannot fully rely on the thesaurus; the contents of research proposals are not completely consistent with the submitted subject field; and the structure of subject coding system is not entirely identical with that of actual research field. In terms of the aforementioned issues, the following three folds are carried out.Firstly, a bridge-connection pattern filtering algorithm is presented for extracting high-frequency words without thesaurus. The frequencies of co-occurrence patterns of the Chinese characters are counted from documents. The supported frequencies of patterns are obtained by eliminating the bridge-connection frequencies. Based on the supported frequencies, the words can be better identified and extracted than the ones obtained by using the primary appearing frequencies. This algorithm can be applied to the Chinese information processing, which is sensitive to the word frequencies. Using this algorithm, the new features which don’t exist in the thesaurus could be extracted from the proposals and added into the thesaurus.Secondly, a revision algorithm for noise texts is presented to study the effect of the noisy data to the clustering results. In the algorithm, the document similarity network is constructed firstly based on similarities of the document’s contents. The categories constitute the corresponding community structure in the network, and modularity is used to evaluate the quality of categories. The noise texts can be revised by optimizing the modularity. This algorithm can be used in the preprocessing of text mining or taxonomy building. In this dissertation, the research proposals belonging to subject codes are regarded as texts with noise. Using the presented algorithm, the proposals that are submitted into the wrong subject codes can be transferred to the correct ones. By using the modified data, the models of the subject codes are built, and the intension and extension of each research area, expressed by code, can be confirmed. Moreover, the relationships between codes can be analyzed.Finally, inspired by the node similarity of social networks, a new definition, named community similarity, is presented based on the common connecting strengths. Based on this definition, a clustering algorithm is designed. In the initial stage each document is treated as a cluster. At each step, two clusters with the largest similarity are combined. Because the relations between and within the clusters are taken into account, some combining errors can be avoided and better clustering results are obtained. Based on this algorithm, the research proposals are clustered into subject categories, and the relations between subject categories and codes are analyzed.According to the theoretical research results, in this dissertation, some application issues on funds management of National Natural Science Foundation of China are conducted. More specially, we analyze the whole trends and regulations of basic discipline research, the current situations of all the subject fields and their relations. These works can afford powerful decision support for establishing of development programs and development strategies, and adjusting of subject coding system and management of projects.

  • 【分类号】F272;F224
  • 【被引频次】4
  • 【下载频次】1223
  • 攻读期成果
节点文献中: