节点文献

市长公开电话汉语文本标签的确立

【作者】 张晓明

【导师】 郝立柱;

【作者基本信息】 黑龙江大学 , 应用数学, 2010, 硕士

【摘要】 随着计算机网络事业的快速发展和人民群众参政议政、自我保护意识的不断增强,信息处理已经成为人们获取有用信息不可缺少的工具.许多城市设立了市长公开电话服务平台,于是,各行各业的文档信息每天都在剧增.采用传统的人工手段分类信息,不仅耗时长,而且面临的困难越来越多,尤其政府承办部门职能的调整,使得如何将这些信息及时准确地分类到调整后的处理单位成为迫切需要研究的问题.文本自动分类是信息检索和数据挖掘领域的研究热点与核心技术,基于机器学习的文本自动分类系统是信息处理的重要研究方向,它是指在给定的分类体系下,根据文本的内容自动判别文本类别的过程.本文基于长春市市长公开电话汉语文本分类的实际问题,介绍文本自动分类的概念,市长公开电话系统,对文本分类中所涉及的关键技术,包括分词、特征选择、特征提取,进行了总结和研究,探讨了基于半监督学习的文本标签的分类问题,研究了基于EM算法、随机森林、Boosting算法的汉语文本的分类问题,使用C++语言实现了三种算法的文本分类程序,并对实验效果进行了分析.

【Abstract】 With the rapid development of computer networks career and continuously improvement of people’s consciousness of suffrage and self-protection, information processing turns more and more important for us to get useful information, lots of cities have established mayor’s public access lines,therefore,the government and institutions accumulate a large amount of documents everyday. If we adopt manual classifier to tackle the work, the efficiency will be too low to deal with many new problems, especially with the adjustment of government functions, it’s very urgent for us to find a method on text categorization timely and exactly to meet novel institutions.Automated text categorization is one of the hotspots and key techniques in the information retrieval and data mining field, text categorization based on Machine Learning, the automated assigning of natural language texts to predefined categories based on their contents, is a task of increasing importance.The paper based on the practical problems in ChangChun mayor’s public access line project introduces the definition of automated text categorization, the system of mayor’s public telephone, it also gives a summary and research to several key techniques about text categorization, including Word Segmentation., Feature Se-lection、Feature Extraction, maining discusses how to label the documents based on semi-supervised learning,including EM algorithm, random forest, boosting al-gorithm. We use C++language to implement the three classified approaches and analyse the results.

  • 【网络出版投稿人】 黑龙江大学
  • 【网络出版年期】2010年 12期
  • 【分类号】TP391.1
  • 【下载频次】24
节点文献中: