

The Key Techniques Research on Text Mining

【作者】 陈晓云

【导师】 胡运发;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2005, 博士

【摘要】 面对浩如烟海的电子信息,如何帮助人们有效地收集和选择感兴趣的信息,如何帮助用户在日益增多的信息中发现潜在有用的知识已成为信息技术领域的热点问题。数据挖掘就是为解决这一问题而产生的研究领域。自90年代产生以来,对数据挖掘的研究已经比较深入,研究范围涉及到关联分析、分类分析、聚类分析、趋势分析等多个方面。由于现实生活中绝大部分信息资源是以非结构化数据的形式存在,而数据挖掘则普遍以结构化数据如关系数据库中的数据为对象,因此对非结构化信息进行挖掘成为继数据挖掘之后出现的又一课题。 在常见的非结构化数据如文本、图像、视频中,文本数据是应用最为广泛的一种形式,常用于数字图书馆、产品目录、新闻组、医学报告、组织及个人主页。在自然语言理解、文本自动摘要、信息提取、信息过滤、信息检索等领域,文本挖掘技术都有着广泛的应用,因而比数据挖掘具有更高的商业价值。 本文以文本数据为研究对象,对文本挖掘的若干关键技术进行研究,主要包括文本特征提取和特征选择、文本关联分析、文本关联分类,并提出更有效的文本挖掘算法。本文的研究工作和创新内容包括以下几个方面: (1)利用最小词频阈值的文档频特征评估函数减少噪声特征的比例,提高文本分类的质量。 目前,文本特征选择普遍采用特征评估函数的方法,各种评估函数根据其使用的是词频还是文挡频有所不同。我们针对噪声特征的词频普遍较低的特点,提出利用最小词频阈值的文档频方法进行特征选择。分别对互信息、信息增益、x~2统计三种特征评估函数采用该方法进行实验,结果表明最小词频阈值有效地减少特征集中噪声特征所占的比例,并且发现随着阈值的提高不同评估函数得到的特征集趋于一致。 (2)针对文本关联分析中难以确定最小支持度阈值的问题,提出N个最频繁项集挖掘算法。 在文本关联分析中,频繁项集挖掘是重要的环节,但在频繁项集挖掘过程中,用户难以定义合适的最小支持度阈值这一问题始终存在。本文提出基于最小支持度阈值动态调整策略的N个最频繁项集挖掘算法,算法通过指定需要产生的频繁项集的数量N来控制频繁项集的规模。挖掘过程中,不断根据已有结果调高最小支持度阈值,从而达到降低搜索空间、改善挖掘性能的目的。根据这一策略分别提出类Apriori算法和基于倒排矩阵的IntvMatrix算法挖掘前N个频繁项集。

【Abstract】 With the rapid development and spread of Internet, electronic information greatly increases. It become a hotspot for information science and technology that how to collect and find the interested information of user, and discovery latent, useful knowledge quickly, exactly and fully. Data mining technology is a new research fields to solve the problem. Since 90’s the concept of DM was produced, the researches on DM have been very deep, and involved association analysis, categorization analysis, cluster analysis, trend analysis and so on. Structural data such as relational database is main research object for DM, but a majority of information exists with the form of unstructured data in realization; some datum show the unstructured data take 80% of existing information sources, so mining the unstructured information succeeds DM as a new challenge.Text data is a kind of information form used most spread among common unstructured data such as text, image, and video and so on. It is often used in digital library, product catalog, news group, medicine report, organization or individual homepages, and is also applied broadly to natural language understand, text summarize, information extract, information filter, information retrieval etc fields. So its value of business is higher than DM.Research on the key techniques of text mining is done in the paper, including text feature extract and feature select, text association analysis, text association classification. Several methods and techniques are presented from aspects of improving the speed, precision and stability. Our primary works are as follow.(1) The paper present feature evaluating function based document frequency with minimum term frequency threshold to reduce the proportion of noise features and improving the quality of text categorization.At present, the feature evaluating functions are main methods to select text feature for text categorization. These evaluating functions are different because some of them use term frequency and others use document frequency. Feature evaluating function based document frequency with minimum term frequency threshold is present in the paper. The result of experiment shows mutual information, information increase or x~2 Statistic with minimum term frequency thresholds is more effective than with document frequency.(2) Research on mining the top N most frequent item sets in text collection.The frequent item set mining is important step in text association analysis, but it is very difficult to ensure fit minimum support threshold. The paper present a strategy

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2005年 07期
  • 【分类号】TP311.13
  • 【被引频次】76
  • 【下载频次】4423
