节点文献

WEB文本挖掘中关键问题的研究

Research on Key Problems in WEB Text Mining

【作者】 何慧

【导师】 郭军;

【作者基本信息】 北京邮电大学 , 信号与信息处理, 2009, 博士

【摘要】 随着互联网和通讯网的迅猛发展,网络文本成为信息的主要载体及人们生活中不可或缺的主要信息来源,文本挖掘技术的研究意义和实用价值越来越突出。另一方面,随着Web2.0时代的到来,出现了越来越多的由用户创作的网络数字内容。用户数字内容的大量产生和传播使得短文本计算、Web文本信息抽取、文本情感分析等逐渐成为Web文本挖掘研究的热点问题。针对这些问题,本文进行了以下研究:(1)基于统计语言模型的短文本计算。针对短文本包含字符少、文本语言不规范、文本数量巨大的特点,本文提出了一种基于N-gram的特征提取和RPCL(Rival Penalized Competitive Learning)的短文本聚类算法。首先进行基于字符级的N-gram特征提取,即从未分词的语料中抽取中文块。中文块可以是一个汉字、一个词或者字符串,这样,中文块不但可以表达短文本的语义信息,而且能够保留语序结构和字符之间的依赖。然后通过统计子串约减和互信息过滤得到候选中文块集合。最后,使用一种神经网络聚类算法RPCL对短文本进行聚类。实验结果表明,这种基于N-gram的特征提取和RPCL的短文本聚类算法能够有效的对短文本聚类,并能有效的降低特征的维度。(2)面向广告推荐和情感分析的Web文本信息抽取。针对广告推荐中的复合词抽取问题,本文提出了基于隐马尔科夫模型的半监督中文复合词抽取算法。从少量种子复合词出发,通过设定一个BEMI(Begin,End,Middle,Independent)模板,使用隐马尔科夫模型识别与种子复合词具有相同或相似信息的复合词。算法采用Bootstrapping的学习方法,通过自学习不断增大复合词列表的规模。实验结果表明,本算法可以满足广告系统关键词推荐的信息抽取需求,并具有较高的准确率和可以接受的召回率。针对文本分析问题中情感词抽取的问题,本文提出了基于最大熵和LMR(Left,Middle,Right)模板的中文情感词抽取算法。通过对文本设定一个滑动窗口,使用LMR模板标记词的位置信息,使用词、词的先后位置信息、词性信息作为特征,对情感词进行识别和抽取。实验结果表明,本算法具有较高的召回率和准确率,同时在某些特征组合的情况下,情感词抽取具有良好的鲁棒性。(3)基于监督和半监督的文本情感分类。针对网络上大量流行音乐、网友原创、改编的音乐,本文提出了一种对音乐歌词的情感分类方法。首先,通过对歌词语料库的词进行统计发现其分布基本符合齐夫定律,但与中文分类通用语料库(863计划文本分类测试数据)中词语分布略有差异。由于对歌词表现的情感进行的分类不同于按照主题对普通文本的分类任务,所以需要抽取更多表现情感色彩的特征。本文在N元模型的框架下采取了三种不同的预处理方法(不同N-gram模板、消去停用词、按词性过滤)抽取更多的歌词情感语义特征,并提出了带有高斯先验和指数先验的最大熵模型的分类算法对歌词的情感特征进行建模。实验结果表明,具有高斯先验和指数先验的最大熵模型非常适合用于歌词情感分析问题。针对实际的情感分类中标注数据不足的情况,本文提出了一种基于半监督学习的文本情感分类算法。假设空间中存在一个情感流形结构,将待分类文本看作是这个情感流形上抽样的点。首先,利用这些点的邻域信息进行构图,每个点与它近邻的边的权重使用它的近邻线性加权表示;然后,将该图看作是一个概率转移矩阵,各类别的标签在此矩阵上扩散完成情感分类过程。在电影评论和中文歌词语料集上的实验结果表明,该算法在文本情感分类上具有良好的性能。(4)文本观点检索。以本文作者2008年参加的COAE2008中的面向主题的中文文本观点检索任务为主线,介绍了本文参评系统PRIS-SAS。本系统采用两阶段处理方式,在经过编码转换、分词等预处理后,PRIS-SAS首先使用Indri检索系统对语料集建立索引,使用任务中的主题词进行ad-hoc检索,然后使用本文中文本情感分类算法建立倾向性模型和极性模型,对检索得到的相关文本进行文本倾向性判断,并对检索结果重新排序。在COAE2008数据集上的评测指标表明,本文设计的文本观点检索系统达到了较高的性能水平。

【Abstract】 With the rapid development of Internet and communication networks, web documents have become one of the major modern information media as well as an indispensable information source in people’s lives. Text mining has become a technology of great research and practical significance. While the Web2.0 is coming, more and more users are involved in the generation of information, and more and more personal opinioned contents are full of the Internet. Such contents are meaningful and valuable for many applications, such as e-commerce, network community, network information security, web search engine and so on. However, it is enormous challenges to process these texts by traditional text mining.In this dissertation, three problems are investigated, which includes short text computing, web text information extraction, and text sentiment analysis. The main contributions of this dissertation are summarized as follows:(1) Short text computing based on statistical language model. We introduce an algorithm to cluster Chinese short texts based on N-gram feather extraction. Aiming at the characteristics of Chinese short texts, the algorithm employs N-gram feather extraction, statistical substring reduction and mutual information filtering to capture Chinese chunks from texts, which reflect the text semantic structure and character dependency. Then RPCL algorithm is applied to realizing text clustering with high precision, which needs not know the exact number of clusters. Experiment results show that this approach can remarkably reduce the dimensionality and effectively improve the performance of Chinese short texts clustering than traditional methods.(2) Web text information extraction based on keyword recommendation system and sentiment analysis. In keyword recommendation system in advertisement, we propose a semi-supervised Chinese compounds extraction approach based on HMM using bootstrapping in this paper. First, we define a set of tags BEMI {beginning, end, middle, independence}, which means the position of words in compounds. Then we employ HMM to extract compounds automatically in BEMI tagging algorithm. We rank the Compounds extracted from corpus by their word frequency and length in descending order, and add top N compounds in seed compounds list. The algorithm learns more Chinese compounds from corpus by bootstrapping. Experimental results show that this approach get much higher performance than Unsupervised one. Different from those extracted by traditional methods, these Chinese compounds contain category information, which can be used in text classification/clustering as features. Also, this approach can be applied in keyword recommendation system in advertisement for different kinds of advertisers because of its expansibility and versatility.For word level sentiment analysis, we propose an algorithm based on Maximum Entropy model and LMR template. LMR template is used to tag word position. Words, word position and POS are used as feature in ME. A text window sides and the sentiment of the word in M poisiton is labeled. Experimental results show that this algorithm has good performance in sentiment word extraction. And, this algorithm is robust in some feature combination.(3) Text sentiment classification based on supervised and semi-supervised learning. Most of pop music songs have suited lyrics, which play an essential role to semantically understand songs. Therefore, analysis of lyrics must be a complement of acoustic methods for music retrieval. One basic aspect of music retrieval is music emotion classification by learning from lyrics. This problem is different from traditional text classification in that more linguistic or semantic information is required for better emotion analysis. We investigate the lyrics corpus based on Zipf’s Law using word as a unit, and results roughly obey Zipf’s Law. Thereby, we study three kinds of preprocessing methods (including different N-grams, deleting stop words, and filtering based on POS) and a series of language grams under the well-known N-gram language model framework to extract more semantic features. Besides that, we also improve Maximum Entropy model with Gaussian and exponential priors to model features for music emotion classification. Experimental results show that feature extraction methods improved music emotion classification accuracy. ME with priors obtained the best results.Since labeled data in sentiment classification is scarce, we are interested in such situation. We introduce a novel semi-supervised learning algorithm to address such task. We assume that there is a sentiment manifold structure, and documents are sampled from such manifold. We do so by creating a graph on both labeled and unlabeled data, which is linearly constructed by data points’ neighborhood information. Then, labels are spread though the graph, which is regarded as probabilistic transition matrix in the process of spread. This algorithm is capable for learning sentimental manifold structures within texts. Promising experimental results are shown in lyrics and movie review data.(4) Opinion retrieval. Following the Chinese Opinion Analysis Evaluation (COAE2008), we discuss text opinion retrieval. Our sentiment analysis system named PRIS-SAS employ a two-stage approach. After preprocessing, corpus given by COAE2008 is indexed by Indri retrieval system, which is used to ad-hoc retrieval. And then sentiment model and polarity model trained by ME with priors are used to classify text returned by Indri. The retrieval results are reranked by classification results. Experiments on COAE2008 datasets show that, the system proposed in this dissertation is a state-of-the-art opinion retrieval system.

  • 【分类号】TP311.13
  • 【被引频次】15
  • 【下载频次】2364
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络