节点文献
面向分布式文本知识管理的中文分词与文本分类研究
Research on Chinese Word Segmentation and Text Classification in Distributed Text Knowledge Management
【作者】 李志国;
【导师】 吴中福;
【作者基本信息】 重庆大学 , 计算机软件与理论, 2008, 博士
【摘要】 我们正处于一个知识经济的时代,知识正继传统的土地、自然资源、资本和劳动力之后成为推动社会进步与发展的重要力量。知识经济时代在客观上要求有与之相适应的管理模式和理论及有效的技术手段。基于这个背景,本论文着重研究和探讨了文本知识管理中基础性的中文分词技术以及文本分类技术,并提出分布式知识管理系统的架构等。具体有以下几个方面:(1)提出了一种自适应分词算法。中文分词的难点在于处理歧义和识别未登录词,传统字典的匹配算法很大程度上是依靠字典的代表性而无法有效地识别新词,特别是对于各种行业领域的知识管理。本论文基于“2-gram”统计模型而实现一种能很好适应语料信息的分词算法,且时间和精度都能满足文本知识管理系统的应用需要。利用“分而治之”的思想来处理句长和词长的情况,用局部概率与全局概率相结合来识别生词和消歧,取得了很好的效果,从而使本论文提出的算法能够自动适应行业领域的知识管理。(2)提出了一种新的基于降维近似支持向量机的分类算法PSVM。近似支持向量机与标准支持向量机的主要区别在于它们所对应的优化问题的约束条件不同。即支持向量机是将问题归结为线性不等式约束二次规划问题,而近似支持向量机是将问题归结成仅含线性等式约束的二次规划问题。从理论上证明了该算法的时间复杂度和空间复杂度比传统的SVM算法均有降低,在此基础上提出了新的学习算法。实验表明,提出的新算法与主要的分类算法相比有较好的性能。尽管较之标准SVM算法的精度有所下降,但训练的时间比标准SVM算法要快,可以满足文本知识管理系统对训练时间敏感和需要处理大量文本的苛刻环境要求,从而具备较大的实用价值。(3)提出了一种基于本体的层次文本分类算法。通常讨论的分类问题是单层分类,而层次分类是指多层类别关系下的分类问题。实际应用的文本知识管理系统通常是面向特定的行业和领域,并且具备一定的模糊性而存在多种分类的特性。用户对于知识的关联性及多概念粒度的分类有较高需求,这就需要采用更好的多层信息组织方式。针对文本知识管理系统中常见的多层类别关系下的分类问题,提出了一种基于本体的层次文本分类算法,该方法利用知识管理系统的知识本体和受控关键词表,并基于概念之间的相似度来实现文本的精确分类、查询和检索。而且,该方法同样也适用于单层分类。(4)提出了一种分布式文本知识管理系统模型。为了适应现有分散性组织的发展模式,使有效的分布式文本知识管理成为知识管理的发展趋势之一。本论文提出的分布式文本知识管理系统模型是将Super-P2P技术应用于文本知识管理,以解决集中式文本知识管理所遇到的问题,并对模型提供的知识服务进行了研究和论述。在以上工作的基础上,在上海“浦东科技发展基金”和宝信软件的支持下,我们实现了一个基于Super-P2P、而集成工作流驱动的文本知识管理系统eKnow。本论文总结了eKnow的设计思想、系统框架和技术路线。该系统已经应用于多个案例,取得了较大的经济效益。
【Abstract】 We are in the era of a knowledge-based economy. The traditional elements such as land, natural resources, capital and labour were replaced by knowledge as major force to promote social progress and development. The management model, theory and technical are required to satisfy the knowledge-based economy. In order to confront the challenge, Chinese word segmentation and text classification are focused and researched in this dissertation. Distributed knowledge management architecture is presented also. Specifically, several achievements are addressed as follows:(1)An adaptive Chinese word segmentation algorithm is presented in this dissertation. New words recognition and ambiguity resolving are key problems in Chinese word segmentation. The result of traditional dictionary-based matching algorithm largely depends on the representative of the dictionary so that it can not recognize new words effectively, especially in some professional domains. The algorithm in this dissertation is based on 2-gram statistical model and can meet the requirements of application in accuracy and efficiency respectively. Long sentence and long term are dealed by the idea of‘Divide and Conquer’while partial probability and overall probability are used to identify new words.(2)A classification algorithm based on proximal support vector machines (PSVM) is proposed. The main difference between PSVM and standard SVM is the corresponding condition of optimization. Classification is considered with a linear inequality quadratic programming problem by SVM while PSVM takes it as a linear equality quadratic programming problem only. This dissertation describes a new PSVM training algorithm based on descending dimension methods, which has faster training speed and smaller memory requirements advantages. In several data sets of experiments showed that the new classification algorithm has better classfication performance under the condition of time-sensitive through fairly loss of accuracy compare with SVM.(3)A new ontology-based hierarchical text classification algorithm is presented. Generally, text classification refers to flat text classication. Hierarchical text classification focuses on the classification under multi-classe. Text knowledge management systems are usually for specific fields, and have a certain ambiguity so that expose the feature of mutil classes. The text relevance and multi-concept-granularity of text are demanded by the users so we need better means to organize hierarchical text. Multi-granularity of the concepts is implemented in hierarchical classification by using the knowledge ontology and controlled keywords. Flat classification can be deal with this algorithm also.(4)Distributed knowledge management model based on Super-P2P is present in the dissertation to address the problems of centralized knowledge management. In order to satisfy the development of distribute organizations, effective distribute knowledge management has become the trends of knowledge management.Based on the above research and work, suites of Super-P2P based text knowledge management software integrated workflow called eKnow has been developed by the support of Shanghai Pudong SD Funds and Baosight Co. Ltd. Design ideas, system architecture and technical framework are summarized. The software has been used in several cases with substantial economic benefits.
【Key words】 Knowledge Management; Chinese Word Segmentation; Text classification; Hierarchical text classification; Ontology;