

Research and Implementation of Chinese Web Page Multi-class Classification Based on SVM

【作者】 王绪峰

【导师】 陶跃华;

【作者基本信息】 云南师范大学 , 计算机软件与理论, 2007, 硕士

【摘要】 随着Internet技术的快速发展,人们从信息缺乏的时代过渡到信息极为丰富的数字化时代。在这个数字化的时代里,人们可以获得越来越多的数字化信息。这些信息大都是半结构化或者是非结构化的数据,想从其中迅速有效地获得所需信息是非常困难的事情。为此,中文网页自动分类被研究者提出并进行了应用研究,研究中文网页分类具有重要的理论意义和实际应用价值。自动分类不仅可以将网页按照类别信息分别建立相应的数据库,提高中文搜索引擎的查全率和查准率,而且可以建立自动的分类信息资源,为用户提供分类信息目录,并且,自动分类的好与坏,对后面的相关性排序过程也有一定的积极作用。本文在研究了传统支持向量机(SVM)分类器模型的同时,结合现有的网页分类技术,对SVM多类分类器模型构造进行了较为系统的研究,提出了一种基于SVM的多类分类器模型构造算法,在此基础上对基于分类的中文网页内容获取、中文分词、中文网页特征选择、SVM中文网页分类器提出了一定的思考和见解。(1)针对中文网页的结构和特点,分析了网页中对分类过程有贡献的信息成分,采用网页中的标题和主体部分标签中的文本来近似表达网页中的主题内容,并设计了标题和主体部分标签中文本获取的算法。(2)对中文分词和特征提取方法进行了深入地研究,系统地分析了中文分词方法,介绍了哈工大信息检索研究室的分词系统,采用改进的x~2估计方法作为本文特征选择方法,并描述了特征选择算法。(3)对SVM多类分类方法进行了深入理论研究,分析了以往SVM多类分类器构造方法,利用核函数在高维空间中距离公式,计算类别间最短距离,引入带权无向完全图来刻画高维空间中类别间的距离结构,基于最容易分割的类或类别集合先分割,提出了一种基于SVM的多类分类器的构造方法。(4)在上述研究的基础上,构建了一个完整的分类系统CWPMCS,进行了实验,并对实验结果做出了分析和评价。实验结果表明,本文研究开发的分类系统具有较高的分类准确率,比K-最近邻(KNN)分类方法的准确率要高。

【Abstract】 With the fast development of Internet technology, the era that people lack from information carries out the transition to the era of information in extremely abundant digitized era. In the era of this digitization, people can obtain more and more digitized information including text , digital, figure , picture , sound or even video . The information is data of the half structurization or non structurization ,It is a very difficult thing that obtain necessary information from this information ,so the automatic classification of webpage has proposed and carried on the application study by researcher, the research of Chinese webpage classification has theory meaning and the value of application . automatic classification of webpage not only can set up separately corresponding database according to classification information, improve recall and precision of the Chinese search engine, but also can set up automatic classification information resources , offering the classified information catalogue to user, and the automatic classification are good and bad, there are certain positive roles to the following course of related ranking.This paper combines the existing classification technology of webpage while studying traditional support vector machine(SVM) classifier model, does comparatively systematic research to construction of SVM multi-class classifier models , provides a algorithm of constructing multi-class classifier models based on SVM. To put forward certain thinking and opinion for obtaining the Chinese webpage content, Chinese word segmentation, Chinese webpage feature selection, Chinese webpage SVM classifier on this basis.(1)Direct against the structure and characteristic of the Chinese webpage, have analyzed the contributory information composition for classification course in Chinese webpage .Adapting the title in the webpage and text in some labels of main body come to express the theme content of webpage approximately , and design the algorithm of the text obtaining in title and labels.(2)The method to Chinese word segmentation and feature selection has been studied deeply, has analyzed systematically the Chinese word segmentation method, introducing the Chinese word segmentation system of Information retrieval research lab of Harbin Industry University ,to adopt the method of CHI estimation as the method of selecting feature, and describe the algorithm of feature selection.(3)Have done the theoretical research to SVM multi-class classification methods deeply, to analyzed the constructing method of past SVM multi-class classifier, Have used the formula of distance of kernel function in high dimension space to calculate the distance between every two class, bring the undirected complete graph with weight to describe the structure of distance among the classes. Have proposed the constructing method of the multi-class classifier models based on SVM.(4) on the basis of above analysis, set up a comparatively intact classification system(CWPMCS), have carried on the experiment, and has made analysis and evaluation to the experimental result. The experimental result shows , the classification system that this text researches and develops has the higher classification rate of accuracy, it is higher than the rate of accuracy of the classification method of K-near neighbor (KNN) the most.

  • 【分类号】TP393.092
  • 【被引频次】3
  • 【下载频次】124

