节点文献

基于监督聚类的专利训练数据修剪研究

Patent Training Sample Pruning Based on a Supervised Clustering Algorithm

【作者】 黄闽樟

【导师】 吕宝粮;

【作者基本信息】 上海交通大学 , 计算机软件与理论, 2010, 硕士

【摘要】 我们生活在一个信息爆炸的时代,各行各业积累了大量的,甚至是海量的数据。根据世界知识产权组织的统计,专利文献含有世界每年发明创造成果的90%~95%,世界每年的申请量以100多万件的速度递增,目前,累计总量已近4000万件,充分利用这些专利文献进行技术创新能够节约60%时间、节省40%的科研资金投入。每一件专利都会依据其内容被分类至某一个国际专利分类码(International Patent Classification,IPC)中。由于数据的规模大,完全依靠专家进行分类需要耗费大量的人力物力,这就促进了各种自动专利分类的研究的兴起。朴素贝叶斯,最近邻,决策树,以及支持向量机等已经应用到文本分类领域,并取得了一定的效果。然而,专利分类是一个大规模,不平衡,层次化以及多标号的文本分类问题,大多数的传统分类方法无法处理这样复杂的问题。即使是性能最好的分类器—支持向量机,由于其求解过程是一个二次规划问题,导致训练时间与训练样本个数接近平方级别的关系。因此,吕宝粮和他的合作者提出了最小最大模块化网络,它最显著的特点是并行的,模块化的结构。其基本思想是“分而治之”:将一个大规模问题,分解成一些独立的小规模问题,分别求解这些小规模问题,然后合并成大规模问题的解。本文的贡献在于,通过引进一种基于高斯零交叉函数最小最大模块化网络的监督聚类算法,来修剪训练数据的规模,并将其成功的应用到专利分类问题中去。文章的主要贡献在以下几个方面。1)分析了高斯零交叉函数最小最大模块化网络的特点:高度的模块化,可以输出“不知道”的能力和增量学习能力。2)分析了高斯零交叉函数最小最大模块化网络接收域的特点,根据此接收域,在学习过程中对训练样本进行聚类,去除冗余样本。3)在聚类后,可能有些聚类含有的样本数很少,这些样本点可能是噪声点。我们采用了噪声去除和聚类合并算法对样本进行后处理。4)我们在NTCIR-5专利数据库上进行专利分类的仿真实验,比较了在聚类和非聚类情况下的各项性能。实验结果证明,我们提出的聚类算法,可以去除冗余样本,并保证在较少的训练数据集下,保持甚至获得更好的泛化能力。5)通过仿真实验,我们也验证了高斯零交叉函数最小最大模块化网络具有的增量学习能力。

【Abstract】 We are living in an information explosion era; all walks of life have accumulateda great deal, even massive data. According to the statistics from the WIPO, patentdocuments contain 90% 95% of the outcome of the world’s annual inventions. Theapplications for patent in the world increase more than 100 million every year and thetotal number has accumulated nearly to 4 billion. If we can take full advantage of thesepatent documents, we can save 60 % of the research time and 40 % of the research andcapital investment for a technical innovation. Each patent will be classified to a specificcategory in international patent classification (International Patent Classification, IPC)according to the contents. In the past, we classify patents in a manual way whichgreatly relies on domain experts and is time-consuming and not effective. Automaticpatent classification is of great important in this environment and a variety of automaticpatent classification study has raised, such as Naive Bayes, nearest neighbor, decisiontree and support vector machines. All of them have been applied to text classification,and have achieved some effects.The patent classification is a large-scale, unbalanced, hierarchical and multi-labeled text classification problem. Most of the traditional classification methods can’thandle such kind of complex issues. Even the best performance classifier—supportvector machine can’t handle it. The reason is because its process of solving problemis a quadratic programming problem. And it leads to a result that the training time isnear the square level of the number of training samples. Therefore, Bao-Liang Lu andhis collaborators proposed min-max modular network, its most notable features are:the parallel and modular structure. The basic idea of the network is to”divide andconquer”: for a large-scale problem, we divide it into a number of independent small-scale problems. We solve these small-scale problems in parallel, and then combine them into the large-scale problems.The contribution of the thesis is to introduce a supervised clustering based onmin-max modular network. We use this algorithm to prune the training samples andsuccessfully apply it into the classification of patent data. The main contributions ofthis thesis are listed following:1) Analyze the feature of min-max modular network: highly modularization,incremental learning ability.2) Analyze the feature of receivable field of the min-max modular network,and propose a supervised clustering method based on the receivable field to prune thetraining sample.3) After clustering, some cluster may have few samples and some of them maybe noises. We use a noise removal and cluster center combination algorithm to postprocess the network.4) We arrange a serial of experiments on NTCIR-5 patent data and compare theperformance of clustering to no-clustering. And the results denote that the clusteringalgorithm can use as a pre-process method to prune the training samples and maintainor even improve the generalization ability.5) We also arrange an experiment on the patent data to prove the incrementallearning ability of min-max modular network.

节点文献中: