节点文献

基于神经网络的文本分类系统NNTCS的设计和实现

【作者】 刘钢

【导师】 范植华;

【作者基本信息】 中国科学院研究生院(软件研究所) , 计算机应用技术, 2003, 硕士

【摘要】 文本分类是文本挖掘的基础与核心,是近年来数据挖掘和网络挖掘的一个研究热点,在传统的情报检索、网站索引体系结构的建立和Web信息检索等方面占有重要地位。 本文首先对当前文本分类领域几个关键问题的常用解决方法进行了研究,同时阐述了典型文本分类系统的核心技术和系统结构,对文本分类的应用范围进行了描述。然后着重介绍了一个基于神经网络的文本自动分类系统NNTCS,重点阐述了特征提取、空间降维、层次分类和分类器训练等技术的实现方法。 在NNTCS中,第一步是对中文文档进行汉语分词,从文档中抽出特征词,并且统计各特征词的词频。 系统使用神经网络作为分类器,特征词的词频组成原始特征向量,和神经网络输入层的神经元一一对应。在文本训练的时候,利用标记好的训练文档集进行网络训练,误差反馈算法对网络进行权值调整,得到固定的权值作为分类知识存储。而在文本分类的时候,输入待分类文档的特征向量,运行固定权值的网络,得到的输出值与阈值比较确定类别。 系统中引入了信息检索中的常用技术——潜在语义索引,把原始向量空间转换到抽象的k维语义空间,实现原始向量空间的降维,提高网络训练速度和性能。 神经网络在一般的模式识别中很常用,但是在文本分类中较少采用,主要原因是向量空间太庞大,网络性能受限制,而引入潜在语义索引对空间降维可以避免这种缺陷,两者相得益彰。 训练过程中结合遗传算法,优化神经网络的初始权值。遗传算法有全局搜索的特点,可以避免神经网络局部收敛的问题,充分发挥遗传算法和神经网络各自的优势。 最后对NNTCS进行了开放性测试,实验表明NNTCS对文本分类具有较高的平均查全率和平均精度。

【Abstract】 Text classification is the basis and core of text mining, and plays an important rule in traditional information retrieval, construction of web site architecture, and search for web information. It has become a hot research project in recent years.At first the traditional solutions to some key technical problems in the field of TC are studied, also core techniques and system architecture of the typical TC systems are discussed, the applications of TC are described in this paper. Then this paper presents a text classifier based on neural networks (NNTCS) as the main topic. Some key techniques implemented in this classifier, such as feature extraction, dimension reduction, hierarchical classification and classifier training, are discussed in details.The first step in NNTCS is Chinese word segmentation on Chinese documents. Feature Terms are selected from documents. Term frequencies of each term are recorded.In NNTCS, we use artificial neural networks (ANN) as the classifier. The recorded term frequencies form the original feature vector, matching with neurons in the input layer of ANN one by one. In the stage of training, NNTCS applies labeled documents to ANN for training, and the error back propagation algorithm (BP) is employed to adjust weights of the networks. After training, the final fixed weights are saved as knowledge of classification. While in the stage of document classifying, NNTCS inputs feature vectors of the document to be classified, runs network with fixed weights, then compares the output with the predefined threshold to judge the class of the unlabelled document.NNTCS imports a traditional technique called Latent Semantic Indexing (LSI) for dimension reduction. LSI comes from the field of Information Retrieval. It transforms the original vector space to abstract k-dimension semantic space. So the huge dimensions of the original vector space are reduced greatly, also the training speed and system performance are improved.ANN is often used in common pattern recognition systems, but rarely in TC. It’s because the vector space is so huge that the performance of ANN is weakened. LSI’s advantage in dimension reduction can avoid this flaw. So both ANN and LSI are improved.NNTCS employs genetic algorithm (GA) in the stage of training to optimize initial weights of ANN. Because of GA’s advantage of globally searching, it can avoid ANN’S problem of local convergence. Thus the advantages of both GA and ANN are brought into play completely.Finally an open test is done on the developed system NNTCS. As experiment results show, NNTCS can reach both high precision and high recall on average.

  • 【分类号】TP311.52
  • 【被引频次】3
  • 【下载频次】388
节点文献中: 

本文链接的文献网络图示:

本文的引文网络