节点文献

Web 2.0环境下互联网信息过滤理论与方法研究

Research on Theories and Methods of Information Filtering under Web 2.0

【作者】 李东方

【导师】 俞能海;

【作者基本信息】 中国科学技术大学 , 信号与信息处理, 2009, 博士

【摘要】 互联网近年来得到了迅猛发展,伴随着Web 2.0等技术的不断进步,互联网承载的应用与信息活动越来越多,人们对互联网的依赖程度也越来越高。在Web 2.0时代,一方面,互联网上的媒体类型呈现多样化特点。多媒体信息携带的听觉和视觉信息与传统的文本信息成互补,有效的丰富了互联网上的信息内容和用户浏览体验。如何针对多种媒体信息进行有效过滤是Web 2.0下信息过滤的重要任务。另一方面,在Web 2.0时代,用户为互联网的中心。互联网呈现出社会性与动态特性,大量动态的数据涌现。这些数据极大的丰富了互联网内容,给人们提供了众多的信息来源。如何从这些用户创造的数据中学习用户的习惯并过滤其中的热点信息成为互联网的重要的研究课题。此外,海量的用户参与为互联网带来了海量数据,如何改进传统算法以适应这些海量数据成为重要的研究课题。本文的研究重点是Web 2.0下信息过滤。本文分析了Web 2.0下信息过滤任务面临的挑战,我们分别对多种媒体信息综合过滤、应用于海量数据的学习算法和挖掘Web 2.0用户丰富的反馈数据进行了研究,并提出了应对这些问题的理论与方法。论文的主要研究内容与创新成果如下:本文针对Web 2.0时代多种媒体信息并存的特点提出了综合多种媒体特征的信息过滤方法。并针对互联网中广告图片过滤问题,综合利用网页中文本信息、图片内容信息等,结合SVM和AdaBoost学习算法,有效的实现了对广告图片的过滤。本文提取了丰富的媒体内容特征、相关的页面布局特征和文本特征。并基于AdaBoost提出了特征选取办法,对特征集合进行筛选和有机的整合。本文还构建了一个大规模的实验数据集来对算法进行验证。验证结果证实了算法特征集选取的合理性及特征选取算法的可行性。本文还对比了各种特征的分类效果及分类有效性。本文基于Normalized Cut提出了一种快速谱聚类算法FSC来对互联网上的海量的文本数据进行快速聚类。本文中分析了谱聚类算法应用到大规模文本聚类中的难点,并给出了解决办法。FSC首先利用GSASH算法将大规模的高维文本数据快速表示为图,并利用AMG数值分析方法将谱分析对应的大规模特征值系统迭代化简为较小规模特征值系统,进而取得近似解。本文还从理论角度分析了这种近似的有效性。实验结果表明,FSC保持了谱聚类算法优点,并且成功的将算法复杂度降低到O(nlogn),进而可以应用到大规模文本聚类问题上来。本文基于热量扩散模型提出了一种针对Web 2.0环境下的信息热度评价与挖掘算法。本文针对Web 2.0时代互联网呈现出的社会性与动态特性,对Web 2.0时代的互联网进行建模。本文将互联网上用户的信息活动看作为热度活动,建立互联网热量扩散模型,利用用户反馈信息对互联网上的信息进行热度评估,并挖掘其中的热点。本文对热度模型进行了详细的定义,并证明了其稳定性和算法收敛性。实验结果表明本文的算法能很好的模拟互联网上的信息活动。

【Abstract】 Rapid development has been achieved of Internet in recent years. As the technologies such as Web 2.0 advance, more and more information activities and applications are carried on Internet, people becomes more and more dependent on internet than ever.In Web 2.0 era, on one hand, there are diversified media format on Internet. The auditoryand visual information combined with traditional text information, greatly enriched contents of Internet and improved user experience. To filter the multimedia information becomes the important task in Web 2.0 information filtering. On the other hand, users become the center of the Ineternet. The vast amount of information is consumed and created by users. Those user-created information enriched the contents of the Internet and provided people many information sources.Besides, the huge amount of users and user actions has bring Internet vast amounts of data. How to modified traditional machine learning algorithms to fit large scale computing circumstances is a difficult research topic.We focus on the study of information filtering in Web 2.0 era. We analysed the challenges of information filtering in Web 2.0, and studied the problems on filtering of various media types, large-scale machine learning algorithms and mining user feedbacks. We proposed theory analysis and solutions to these problems. The main research contents and innovation achievements of this paper as follows:1. We proposed a unified information filtering algorithm based on multiple features of multiple media types in Web 2.0 era. Specific to advertising image detection problem, we utilize the features like image content and image’s surrounding text feature, and integrate machine learning algorithms like SVM and AdaBoost. The filtering results demonstrate the effectiveness of our algorithm. The feature set combines of media content feature, web page visual layout feature and text feature. These features are verified to be useful in classifying advertising images. Moreover, we proposed a feature selection algorithm based on AdaBoost, which can select useful features out of the original full feature set. We construct a large dataset to verify our algorithm. The experiment results demonstrate that our feature selection algorithm is feasible and reseanable. In addition, we compared the effectiveness in classification of each feature.2. We proposed a fast spectral clustering algorithm(FSC) based on Normalized Cut, which can peform clustering on large scale text corpus. We analysed the bottleneck of utilizing spectral clustering algorithm on large scale text corpus, and proposed solutions. Firstly, FSC uses GSASH methods to build a graph from large-scale text corpus. Secondly, FSC utilized AMG method to iteratively reduce a large-scale eigenvalue system into a samller one, and obtained an approximating solution. We perfomed verification of FSC from both theory and experiment aspects. The experiment results demonstrate that the complexity of FSC reduces down to O log while keeping the good performance of spectral clustering.We proposed a hot topic evaluation and mining algorithm based on heat diffusion model under Web 2.0 environment. First, we model the Internet under Web 2.0 according its dynamic and social property. Second, we regard the information activities on Internet as heat acitivities, then we use heat diffusion model to model these activities. We use the feedback of web users as heat input, and evaluate the hot degree of information on Internet and mining the hot topics. This paper makes a detailed definition of heat diffusion model, and proved its stability and convergence. The experiment results demonstrate that our algorithm can simulate information activities on Internet.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络