节点文献

针对不平衡文本的分类方法研究

Research on Imbalanced Text Classification

【作者】 杨鸿骏

【导师】 徐国爱;

【作者基本信息】 北京邮电大学 , 计算机技术, 2014, 硕士

【摘要】 随着以WEB2.0为代表的互联网技术的飞速发展,互联网文本非结构化和高自由度的特点为文本分类带来了新的挑战,其中包括不平衡文本分类问题。不平衡文本指的是类别间样本数目存在显著差异的文本空间。传统的文本分类方法在处理不平衡文本分类问题时会出现显著的性能下降,尤其是其中少数类的分类性能会随着文本倾斜程度的加重而迅速恶化。而在“非法网页识别”、“垃圾邮件识别”等典型的不平衡文本分类应用中,对于少数类成员的预测和判定反而更加具有意义。本文针对不平衡文本分类性能下降,尤其是其中少数类分类困难的问题,在对常用的不平衡文本分类进行研究的基础上主要完成了以下工作:第一、提出了一种基于同义词扩展的不平衡文本分类算法。该方法是一种基于数据层面的少数集补偿方法。不同于传统的过采样方法,该方法通过同义词矢量概念的引入,实现了文本特征空间聚簇表示;并从同义词使用的语言学特性和统计学规律出发,通过少数集同义词矢量和实际同义词矢量间的关系进行特征预测和补偿。实验结果证明,该方法可以有效提升不平衡文本分类性能。第二、设计了一套以“哈工大同义词词林”为蓝本的同义词词典重构方法,该方法构建出的同义词词典不仅具有语境特征,同时实现了对词典维度的精确控制。第三、针对同义词扩展过程中的判决需求,提出了“左侧扩展”和“特征预训练”的概念,解决了扩展执行的边界界定问题。第四、设计并实现了一个具不平衡文本处理能力,同时又可以进行常规文本分类的统一系统。系统提供了对多种特征选择方法和多种分类算法的支持,用户可以通过集中式配置,快速制定系统分类策略。

【Abstract】 With the rapid development of WEB2.0and other Internet technologies, Internet texts are now largely free from restriction in content and structure. This reality brings new challenges for text classification, which includes the imbalance text classification. Imbalanced text refers to such text where discrepancy lays between different classes. The performance of traditional text categorization methods, especially the categorization performance for minority classes, often deteriorates dramatically when training set is imbalanced. Actually, categorization performance of minority classes is much more important than categorization performance for majority classes in many applications of imbalanced text classification such as identification of illegal web pages or Junk Mail.Based on the study of existing imbalanced text classification methods, this paper has completed the following tasks in order to solve problems presented above.1. The design of an imbalanced text classification method based on synonyms expansion.This method is one of the minority-compensating methods which functions like data over sampling methods. Unlike traditional over sampling method, this method concentrates on the clustering-representation process of feature space by synonyms vector. Supported by linguistic rules and statistical laws in synonyms, this method implements the feature-prediction and feature-compensation process. The experimental results show the categorization performance is improved with this method.2. The design of a new synonym dictionary generating method which is based on a thesaurus which is named TongYiCi CiLin.This thesaurus is developed by Harbin Institute of Technology Center for Information Retrieval (HIT-CIR). This synonym dictionary generating method makes sure that the new dictionary is context-adaptable. At the same time, method provide precise control of dictionary dimension.3. An expanding rule and a expansion judging method are proposed.By left-side expanding rule and feature pre-selection method, the issue of boundary decision is smoothly solved.4. The design and implement of imbalanced text classification system.Combined with ordinary text classification ability, the classification system can also deal with imbalanced text. The system provides a variety of feature selection methods and classification algorithms. Users can make their own classification strategies with config file.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络