节点文献

博客数据特征提取与基于分类的垃圾博客过滤

Data Feature Extraction of Blogs and Filtering of Splogs Based on Classification

【作者】 闫瑞

【导师】 曹先彬;

【作者基本信息】 中国科学技术大学 , 计算机软件与理论, 2009, 硕士

【摘要】 随着Internet的迅速发展,博客成了继Email、BBS、QQ/ ICQ之后的新一代网络交流方式,并以极快的速度融入到人们的日常生活中,成为基于互联网的基础服务。随着博客空间的急速增长,垃圾博客也迅猛蔓延到博客空间的各个角落;而大量垃圾博客的存在,严重影响了信息检索的准确性,从而使得用户体验变得越来越差,如何精确地判断垃圾博客成为信息检索领域亟待解决的难题之一。在信息安全领域,博客内容倾向性分析成为新的研究热点之一,但大量垃圾博客的存在将严重影响倾向性分析的结果,大大降低其正确性和可信性。因此,必须对博客进行垃圾过滤,以便进行进一步的分析和检索。本文在已有的垃圾博客特征提取基础上,提出了采用词性分析手段对博客特征进行进一步提取的方法。首先考虑到在中文的语法结构中,一个句子由主谓宾构成,尤其在口语话的语句中,还会有很多省略句,这些句子通常只有主语和谓语或仅仅只有谓语。而且博客作者大都在博客文章中记录一些关于自己感兴趣的事情,或者记录自己的心情和近况,会在博客正文中使用丰富的形容词和语气词来表达自己。而垃圾博客通常只是为了提高用户的点击率,或者希望通过增加链接和关键词的方式来提升某个网页在搜索引擎中的重要程度,因此在文章中会出现大量的名词,尤其是跟行业相关的专有名词。所以,对博客文章进行词性分析,提取出跟词性相关的一些特征会大大增加特征之间的互补性,提高垃圾博客分类与过滤的效果。进一步,本文设计了一种针对垃圾博客过滤的动态组合分类算法。该算法首先构造出一种树状组合分类器结构来支持分类,并进一步利用了一种动态调整策略来训练组合分类器。与已有的基于单一分类器或简单集成分类器的方法相比,该方法可以根据样本的分布特点,自适应地调整分类器的组合结构,从而有效缓解样本特征稀疏和样本高度不均衡对分类性能的影响。基于垃圾博客过滤的测试实验表明,该算法在用于垃圾博客过滤时,可以获得较好的准确率和召回率。最后,本文设计并实现了一个基于博客内容的信息检索原型系统,并将垃圾博客过滤算法用于该系统,取得了较好的效果。

【Abstract】 With the rapid development of Internet, blogs become a new application of network communication following Email, BBS, QQ / ICQ, and it goes into people’s daily lives quickly to become the basic services based on Internet. Meanwhile, splogs(spam blogs) also spread rapidly to every corner of the blogosphere; and the existence of a large number of splogs has seriously affected the accuracy of information retrieval, which makes the user’s experience worse and worse. So how to determine the splogs precisely has become one urgent problem in the field of information retrieval. In the information security field, the opinion analysis of blog content has drawn more and more attention, but the existence of splogs will affect the result of opinion analysis seriously, and reduce the accuracy and credibility greatly. Therefore, it is necessary to filter the splogs for further analysis and retrieval.In this paper, we proposed a method of part-of-speech analysis based on the existing feature extraction of splogs. Firstly, in the grammatical structure of Chinese, a sentence is composed by subject、predicate、object, and especially in the oral statement, there are a lot of elliptical sentences which are composed by subject and predicate, or predicate only. Secondly, most blog authors record in their blogs what they are interested in, or their own feelings and situations, so in the blogs, there are rich adjectives and mood words to express themselves. Thirdly, usually, splogs are written to increase the users’ click-through rates, or hope to improve the importance of a page in the search engine by increasing links and keywords, so there are a lot of terms in the articles, especially industry-related terminology. Therefore, analyzing the part-of-speech of blogs and extracting some part-of-speech-related features will increase the complementarities between features greatly and improve the effectiveness of classifiers.We also designed a dynamic assembly classification algorithm for filtering splogs. Firstly, the algorithm constructs a treelike assembly classifier to support the classification. Then it presents a dynamic adjusting strategy to train the assembly classifier. Comparing with the traditional classifiers such as single classifier and simply ensemble classifier, this algorithm also adjust the combinational structure of the classifier in an adaptive way, so as to reduce the impact of the sparse features and unbalanced data of the splogs. The experiments show that this algorithm can get better precision rate and recall rate for Filtering of Splogs.Finally, we designed and realized an information retrieval prototype system based on blog content with the filtering of splogs, and it achieves good performance.

节点文献中: