

Research and Realization of Clustering Guided Web Chinese Text Classification Based on SVM

【作者】 张俊艳

【导师】 林世平;

【作者基本信息】 福州大学 , 计算机应用技术, 2004, 硕士

【摘要】 随着Internet的迅速发展,网络信息不断膨胀。为了提供高效、准确的信息服务,我们需要对网络中纷繁芜杂的信息进行合理的组织与分类。本文的目标就是以Web文本信息处理为背景,从理论及应用两个层次对文本信息的聚类、分类方法进行了较为深入的研究。论文首先阐述了文本分类器的总体模型,包括:信息预处理、特征表示、特征提取、利用文本挖掘技术提取分类模式(其中涉及到文本聚类、分类)和对模式进行质量评估等五个方面。其次,对分词、特征提取、文本聚类、分类等理论知识和关键技术作了介绍,特别是有聚类指导的基于SVM的分类模式的提取。最后,构造中文文本分类器,并编程实现,通过实例测试分类器性能。论文的重点是在文本聚类指导下的分类模式的提取。与传统分类器不同,我们在缺乏类信息的情况下,采用聚类替代领域专家的人工分类获得类信息,为构造分类器提供合适的类信息,取得了较好效果。聚类部分,改进了k-means算法,克服了它的倾向缺陷,使它的结果分布比较均匀,更能体现一个聚簇的规律,提高了分类精度。针对实验数据的高维性、稀疏性等特征,我们提出了HSMBK和HSSCA两个聚类算法。(1) HSMBK算法,利用了对称划分原理;采用了一种新的计算相似性方法--布尔特征稀疏差异度;将选优思想应用到聚簇中心的计算,形成一种新的中心计算方法,减少了孤立点的影响;采用启发式思想提出了JW准则,为K值的选择提供依据。(2) HSSCA算法,分两个阶段处理:第一阶段将数据聚集成小的聚簇,不需指定聚类数目;再次聚类采用凝聚的聚类法将小聚簇进行合并得到所需聚类数。采用了另一种新的计算相似度的方法--集合的布尔特征稀疏差异度。通过对三个聚类算法进行实验验证,选择聚类效果最好的HSMBK算法指导分类模式的提取。分类部分,论文在理论上分析了文本分类采用支持向量机技术的优点,对两种具体的SVM算法-C-SVC和V-SVC进行了研究并利用实例进行验证。最后详细介绍了基于支持向量机的Web中文文本分类器的设计与实现。

【Abstract】 Along with the development of Internet, network information increases rapidly. In order to make the information service more efficient and precise, it is important to get the information in Internet organized and classified reasonably. The thesis focuses on text information processing in the network, proceeds the thorough research to text clustering、 classification from two levels which are theories and application. First, a model of automatic text classification system is described, which includes five aspects: the information pretreatment、the features denotation、the features extraction、making use of text mining technique extracting classified model(involve text clustering and classification) and evaluating model quantity. Second, the thesis introduces the theory and the key techniques which are word segmentation、features extraction、text clustering and text classification, specially the extraction of clustering guided classification model based on SVM. At last, we construct the Chinese text classification machine, take it to realization by programming and use the true data to test the classification machine. The important part of the thesis is the extraction of clustering guided classification model. Different from traditional classification machine, our research is preceded under the situation of lacking class label and class information, replacing manual classification with clustering in order to gain classification information and the rustle is good.In clustering part, we modify k-means for overcoming its trend limitation, making its clustering result more equal and mostly reflecting the character of clustering. The modified algorithm can increase the classification accuracy.It can find that the data is high dimension and sparse. We bring forward HSMBK and HSSCA algorithms to code with the problem. (1) HSMBK, it uses the bisect partition principle and adopts a new method to count the comparability-- "binary feature sparse otherness". We apply the thought of choosing excellent element to the method of calculating the center of clustering for reducing the effect of the isolated points. At last, we bring forward JW rule based on the enlighten idea. (2) HSSCA, It has two phases: First, it assembles the data to small<WP=4>child clusterings. Second, it uses the agglomerate clustering algorithm to unite these small clusterings for getting the needed clustering number. It also adopts other new method to calculate the comparability-"binary feature sparse otherness based on collection".We validate three clustering algorithm by experiment and elect the best algorithm-HSMBK to extract the classification pattern.In classification part, we analyze the advantage of using the Support Vector Machine (SVM) to text classification on theory. The two classical SVM algorithms-C-SVC algorithm and S-SVC algorithm have been done more research and the two algorithms performance has been compared by using practice data. At last, we detailed present the design of Web Chinese Text Classification machine based on SVM.

  • 【网络出版投稿人】 福州大学
  • 【网络出版年期】2004年 03期
  • 【分类号】TP393.09
  • 【被引频次】3
  • 【下载频次】445