节点文献

两类仿生学算法在文本分类中的应用研究

Two Types of Bionics Algorithm in the Application of Text Classification

【作者】 宁再早

【导师】 贾瑞玉;

【作者基本信息】 安徽大学 , 计算机应用技术, 2011, 硕士

【摘要】 随着信息技术的发展,用户获取到的信息量不断地增加,其中大部分是文本类型的数据,一种高效地管理并有效地利用这些无序数据的技术—文本挖掘技术在这几十年来逐渐地成为一个热点研究领域,文本分类是该领域中的一个重要研究方向。自从90年代以来,文本分类技术中开始引入统计方法和机器学习方法,以前的基于知识工程的文本自动分类方法逐渐地被取代了,同时也涌现出一大批对文本分类中各关键技术进行深入细致研究的文献,这些研究主要包括在文本预处理、特征选择、文本表示模型、分类方法和分类性能评价等方面。在面对互联网发展带来的海量数据处理的问题时,各种文本处理方法都表现出一定的困难。如数据量大、建立的向量空间模型的特征项的维数大、预处理和计算时间长、数据集中噪声多和分类算法的精度低等问题。本文对文本分类中特征选择方法和分类算法进行了研究。佳点集遗传算法是利用数论中佳点集的理论对遗传算法中的交叉算子重新设计,以导向以高适应度模式为祖先的“家族”方向的随机搜索算法,与遗传算法相比,算法的精度和速度有所提高,避免了早期收敛现象。覆盖算法从几何的角度出发,把输入的样本向量映射到高维的空间球面上,并通过训练以尽可能少的领域覆盖各个类别形成分类网络模型。粒子群算法是一种模拟鸟群迁徙的进化算法,类似于遗传算法,从随机的初始解开始迭代搜索最优解,也用适应度来评价解的品质,但在迭代过程中没有交叉和变异这两个操作,是一种容易实现,精度高,收敛速度快的算法。本文结合佳点集遗传算法在高适应度模式的祖先上搜索更好样本的原则和K近邻算法的简单有效性,提出了基于佳点集遗传算法的特征选择方法;针对覆盖算法具有对高维数据的良好处理能力,但存在分类精度和泛化能力之间的矛盾,本文将覆盖算法和粒子群优化算法相结合,提出一种改进的粒子群优化覆盖算法。最后本文构建了文本分类系统,通过在三组数据上进行实验对比分析,以及F1测量对其性能评估,结果表明本文提出的算法可以有效地提高分类精度和效率。

【Abstract】 With the development of information technology, users can access to increasing amount of information, most of which is text-type data, an efficient management and effective use of technology in processing such disorder data-text mining technology in the past few decades becomes a hot research field, text classification is an important research direction in the field. Since 90 years, text categorization has introduced in statistical method and machine learning method, replacing the previous knowledge -based engineering classification method, also emerge a large number of studies about the key technologies of text categorization, These studies include in the text preprocessing, feature selection, text representation model, classification algorithm and classification performance evaluation and so on. in processing massive data development of the Internet brought, a variety of text processing methods have shown some difficulties. Such as the amount of data is large,the large dimension of the established vector space model, a long time for pre-processing and computing, a lot of noise data in the data set and low accuracy problem of classification algorithm. In this paper, feature selection in text categorization and classification algorithm is studied.Good point set genetic algorithm is a random search algorithm, re-designs crossover with the theory of good point set of number theory, to guide the ancestors of higher fitness model "family" orientation, Compares with the genetic algorithms, this algorithm improves the accuracy and speed, and avoids early convergence. Covering algorithm starting from geometric point of view, mappes the vector of input sample to the sphere of high-dimensional space, and cover each type of sample with areas as little as possible through training to form classification network model. Particle swarm algorithm is a evolutionary algorithms of simulating migratory birds, similar to genetic algorithm, starting from random initial to iterative search for the best solution, and evaluates the quality of solution with the fitness, but it has no two operations of crossover and mutation in the iteration process,and is easy to implement, high precision and fast convergence of the algorithm.This paper combinates the principles of search for better sample in the ancestors of higher fitness model of good point set genetic algorithm with simple and effectiveness of simple K nearest neighbor algorithm, proposes a feature selection method based on good point set genetic algorithm; For covering algorithm is good for high dimensional data processing, but there is a contradiction between classification accuracy and generalization ability.this paper combines cover algorithms and particle swarm optimization algorithm, gives an improved particle swarm optimization covering algorithm. Finally, text classification system is constructed in this paper, through experiment and comparative analysis in three groups of data and performance evaluation with F1 measure,its results show that the proposed algorithm can effectively improve the classification accuracy and efficiency.

  • 【网络出版投稿人】 安徽大学
  • 【网络出版年期】2012年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络