节点文献

中文Web文档倾向性自动分类研究

Automatic Classification Research on Chinese Web Document Orientation

【作者】 胡蓉

【导师】 唐常杰;

【作者基本信息】 四川大学 , 计算机应用, 2003, 硕士

【摘要】 如何在浩若烟海而又纷繁芜杂的文本中最快捷地获取有效信息始终是信息处理的一大目标,也是一大难题。文本自动分类系统,作为信息处理的重要研究方向,旨在根据文本的内容自动判别文本类别。目前,国际上对于英文文本分类的问题研究已经比较成熟,而中文文本分类问题以中文环境和语义为特色,引入了特殊矛盾和特殊困难,成为特别的研究课题。 其中中文文本倾向性分析研究更是一个崭新的、充满挑战的研究领域。为了维护网络安全的健壮性,因此我们提出了实验型中文Web文档倾向性分类鉴别器项目。鉴于以往的鉴别基于关键词的简单匹配和人工处理,效率低下;为此本项目旨在加强中文Web文档鉴别的实时性和高效性。 在研究的过程中,我们系统考察了中文Web文档自动分类的各个环节以及具体的实现技术:从语料库的建立,中文Web文档的分词,索引的选择,权重的设计方案及分词系统SMCW的建立,到特征选择方法的研究讨论,各种分类方法的研究讨论,最后到中文Web文档倾向性分类系统(SCUSCTC SCU Smart Chinese Text Classifier)的结构提出及用Java语言开发实现该系统,并对最后的分类结果及中间分词结果进行了细致的实验和考察。系统功能特色有:1)分类方法智能准确:基于领域和语言学知识结合的方法,使文本分类的精度较以往机械匹配的方法大大提高;2)文本分类高速及时:精巧的算法设计配以高效的实现技术,使分类处理既保质又保量;3)输出格式标准通用:采用标准通用的XML作为系统的输出格式,这不仅方便了信息的交换、再加工,而且有利于实现与不同数据库和应用系统的进一步集成。 最后,本文和本系统的成果表现为:l)研究了现代网络情况下,对于中文W七b文档倾向性分类的方法和技术,并提供了一个可供研究并具有一定实用价值的原型系统;2)提供了相关的论文和开发文档,对于以后的研究有极大的帮助:3)对在网关上利用的中文w七b文档分类器进行了实践性的研究:4)编制了中文Web文档倾向性分类的性能要求及相关参数的测试评定;5)实现了实时性的中文w七b文档倾向性分类,达到了一定的速度要求和精度要求. 在以后的工作中考虑如下问题:1)数据集的标准化;2)分词系统精度的提高,对歧义处理以及未登录词识别的能力的提高:3)进行合理的语义分析:4)利用用户反馈信息动态更新训练集;5)定t分析分类器不同要素对分类系统性能的影响,使用合适的模型来比较和评价分类系统;6)自然语言理解问题,如“引用”问题;7)对于敏感词汇伪装的识别问题。 本文组织如下:第一部分为引言,第二部分描述了文本分类解决的问题并对其性能评估方法和阅值选取原则进行了介绍,第三部分描述了文本的模型表示及其方法和比较,第四部分介绍了特征提取的方法,第五部分探讨了不同的文本分类方法:Nalve Bayes、kNN、决策树以及SVM自动分类系统的关键技术,第六部分是该系统的测试数据和实验结果,第七部分是结束语.

【Abstract】 Since 1990s, as volumes of information available on the Internet continue to increase, there is a growing demand for tools to help people find, filter, and manage these resources more efficiently. Text categorization, the assignment of free text document to one or more predefined categories based on their content, is an important component in many information management tasks. Since Chinese text classification has a distinct feature based on Chinese language context and semantics, it becomes a special research field with special difficulties and controversy, among which Chinese text orientation analysis is especially frontier and challenging.With the development of modem network techniques, network becomes an essential tool for people to communicate with others. In order to maintain the robustness of network security, we start our project of Laboratorial Chinese Web Documents’ Orientation Text Classifiers. In previous classifiers this process is very time-consuming and costly, thus limiting its applicability. So our classifiers may meet the requirements of real-time and high accuracy.In this thesis, we give a survey of the state-of-the-art in Chinesetext categorization, from the building of the corpus, the divided syncopation system of Chinese Web document, the selection of index, and the design of weight to the structure of SCUSCTC (SCU Smart Chinese Text Classifier) and its implement in Java. Finally, we give a thorough analysis of the experiments results and ascertain the main advantages and features of SCUSCTC as follows: 1) artificial intelligence and accuracy, 2) high speed and realtime, 3) Using XML as a standard and universal output format.The main contribution of this thesis includes: 1) Research the methodology and technology of Web text classification under modern network, and process a practicable system prototype; 2) Provide many correlative papers and development documents for further research; 3) Process a practicable research of Web text classification on gateway; 4) Design the performance request and related parameters’ evaluation of Web text classification; 5) Implement a real-time Web text classification system (SCUSCTC), which satisfies certain high speed and high accuracy.In further research, the following issues must be considered: 1) The standardize of corpus; 2) Improve the accuracy of Chinese words divided syncopation system, handle the different meanings of one word and recognize the words that do not appear in the dictionary; 3) Process semantic analysis; 4) Dynamically update the training sets fed back by the user; 5) Quantitatively analyze the system performance influenced by different factors, use an appropriate model to compare and evaluate the Web text classification system; 6) Natural language process; 7) Distinguish the disguise of sensitive words.This thesis is divided into seven chapters, with Chapter 1 as theintroduction. In Chapter 2 we formally define 1C and introduce performance measures and thresholding strategies for TC. Chapter 3 describes the needed steps to transform raw text into a representation suitable for the classification task. Feature selection methods are surveyed in Chapter 4. In Chapter 5 we describe four methods that have been successfully applied to text categorization: kNN, Naive Bayes, Decision Tree and SVMs. In Chapter 6 we describe our own work using the "Korean and World Cup Corpus", while Chapter 7 concludes the whole thesis and discusses open issues and possible avenues of further research for TC.

  • 【网络出版投稿人】 四川大学
  • 【网络出版年期】2004年 01期
  • 【分类号】TP393.09
  • 【被引频次】11
  • 【下载频次】361
节点文献中: 

本文链接的文献网络图示:

本文的引文网络