节点文献

基于群智能和随机索引的网络聚类算法研究

Research on Web Clustering Algorithms Based on Swarm Intelligence and Random Indexing

【作者】 万淼

【导师】 王枞;

【作者基本信息】 北京邮电大学 , 信号与信息处理, 2011, 博士

【摘要】 聚类(Clustering)是将数据对象划分为有意义的组(或簇)的过程。作为数据挖掘中的一种重要的技术,聚类分析在很多领域中都扮演着重要的角色。尤其是,随着当今世界各种信息的数据量不断增大、研究问题的复杂度不断增加,现有的聚类分析技术也面临着越来越多的挑战,因此,研究新的聚类算法已经成为数据挖掘、机器学习、统计学和生物学等多个相关研究领域中的前沿和热点问题之一。群居昆虫的社会性行为,诸如寻找最好的食物源、搭建结构最优的巢穴、孵卵、保护幼虫、守卫种群等都表现出群体层面的宏观智能行为。群体智能(Swarm Intelligence,简称为SI)是为了解决复杂优化问题而创立的一类分布式智能范式体系,其灵感最初就源于对自然界中昆虫群体的观察,通过模拟自然界生物的这种群体行为来实现人工智能。因为聚类策略在多个领域应用的重要性,一些基于群体智能思想的优化算法,如蚂蚁种群优化和粒子群优化等,已经被引入数据挖掘领域,来解决聚类问题。由于聚类用的评价标准函数(Criterion Functions)通常是非凸的或者是非线性的,传统的聚类方法,特别是k均值(k-means)算法,具有对聚类的初始值敏感并且容易使搜索陷入局部最优的缺点。随着实际应用中数据集的维度不断增长,寻找标准函数的最优解是一个"NP-难”的问题。Web用户在浏览网站时,会根据他们不同的信息需求或潜在的任务和目的,而表现出多种多样的行为,这些行为都被Web访问日志跟踪并且记录下来。Web日志挖掘可以通过发现和分析网络用户访问行为的特征和规律,从而达到识别网站的潜在客户、提高对用户的服务质量的目的。基于聚类算法的Web日志挖掘与用户行为分析方法发展的较晚,并且在Web聚类技术中,目前比较常见的是针对Web会话和Web页面内容的聚类方法研究,针对Web用户浏览模式的聚类方法的研究还相对较少。而且,目前已有的Web用户行为分析和聚类技术只关注用户在页面级别的浏览行为,而对于Web用户活动之间的潜在联系或隐含特征很少关注,并且对与特定浏览模式之间隐藏或无法观察的因素也研究的很少。因此,需要研究和开发新的Web用户聚类技术和用户建模技术,发掘用户行为中潜在的隐藏信息,从而有助于有效地改进用户聚类技术的性能。Web用户行为聚类的结果可以用于各种途径的高级应用任务,例如Web缓存和预取。目前有很多Web挖掘方法被用于提高从Web访问日志中预测用户访问模式的准确率,以便高效地对Web对象进行预取。目前在预取领域,现有的这些技术大部分都仅仅局限于对单个用户请求的预测,而关于对群体用户的请求预测方面的研究还较少。本论文的主要创新工作可以归纳如下:(1)针对现有的聚类算法大多存在局限于单一类型的数据集、在搜索时容易陷入局部最优并难以在高维数据集上达到理想效果的问题,本论文在已有的混沌蚂蚁群(Chaotic Ant Swarm,简写为CAS))算法的基础上受蚂蚁混沌搜索和蚁群全局智能优化等行为的启发,根据数据聚类应用的特点,提出了一种新的基于蚂蚁混沌行为的聚类算法(简称为CAS-C算法))。本论文拓展了混沌蚂蚁群算法的应用领域,大量的数值仿真对比实验结果表明了本论文所提的CAS-C算法具有对中心初值不敏感、能够找到全局最优解、具有较高的算法稳定性和准确率的优点。本论文所提的算法更适合于对真实的数据集进行聚类。(2)菌群觅食(Bacterial Foraging,简写为BF)优化算法是一种基于细菌群体行为和进化过程的优化搜索算法,但目前它还不够完善,菌群觅食优化算法的改进及参数调整是目前研究的一个重要问题,尤其是,基于菌群觅食行为的聚类算法方面目前的研究还很少。本论文受菌群觅食行为的启发,提出了一种新的基于菌群觅食优化思想的聚类算法(简称为BF-C算法),通过模仿细菌觅食过程,寻找聚类的最优中心。本论文同时对算法中的各个参数在数据聚类领域的设置进行了详细地讨论与分析。与其他全局优化算法相比,本论文所提出的BF-C算法具有易于理解、计算简单、收敛速度快的优点,但其趋化步长由于缺少对环境的自适应性,需要根据具体应用问题的不同而需要进行具体的讨论。(3)应用传统的数据挖掘方法进行Web用户行为识别时,具有初值敏感、容易陷入局部最优和在高维数据的挖掘上性能有所下降的缺点。本论文针对Web聚类技术中目前面临的这些问题,将所提出的基于蚂蚁混沌行为的CAS-C聚类算法应用到Web日志分析与用户聚类当中,以发现用户的浏览模式,从而提高Web用户聚类的性能。为了检验所提方案的有效性和可行性,本论文将基于CAS-C的Web用户聚类结果与目前在Web挖掘领域广泛应用的两种算法(k值聚类算法和FCMdd算法)的Web用户聚类结果进行了比较。大量的计算机数值仿真实验表明了使用我们所提出的CAS-C算法能够获得凝聚度和分散度更好的Web用户聚类结果,可以有效地识别用户的公共兴趣。(4)在对Web用户日志进行分析和挖掘的过程中,需要对Web用户的浏览行为进行形式化的表示,这个过程一般被称为用户建模。目前已有的Web用户行为分析和聚类技术只关注用户在页面级别的浏览行为,而对于Web用户活动之间的潜在联系或隐含特征却很少关注,并且对与特定浏览模式之间隐藏或无法观察的因素也研究甚少。因此,我们提出基于随机索引的用户建模方式,借助自然语言处理领域“上下文”的概念,对URL进行分段索引建模。这样,在用户建模的过程中,能够将浏览模式中的隐藏信息加入其中,进而有效地指导Web用户聚类算法,改进聚类的效果。我们通过聚类实验比较了这两种建模方式:特征向量方法和随机索引方法,大量相关的聚类实验的结果表明了随机索引建模方式的优越性。(5)本论文所提的聚类算法可以用于各种高级应用任务,例如Web缓存和预取。同时,为了检验我们用户聚类算法的聚类效果,本文基于随机索引建模方法和CAS-C算法,提出了一种新的群体用户的行为预测和网页预取方案,通过建立用户公共档案,总结用户的共同兴趣,并且基于用户聚类结果,建立群体用户的网页预取规则,预取用户未来可能点击的网页,并存入网站的缓存中。为了使实验结果具有说服力,我们仍然选取经典的k均值聚类算法和在Web挖掘领域广泛应用的FCMdd算法作为比较算法。大量的预取实验结果表明了在随机索引用户模型的帮助下,基于CAS-C的Web用户聚类方案能够获得较高的网页预取的准确率。

【Abstract】 Clustering divides data into meaningful or useful groups (clusters) without any prior knowledge. It is a key technique in data mining and has become an important issue in many fields. In particular, with the amount of all kinds of information and data in the world increasing and the study problems becoming more and more complex, the existing clustering techniques are also facing increasing challenges. So the study about new clustering algorithms is an important issue in the research fields including data mining, machine learning, statistics, and biology.The social insects’behavior such as finding the best food source, building of optimal nest structure, brooding, protecting the larva, guarding, etc. show intelligent behavior on the swarm level. Swarm Intelligence (SI) is an innovative distributed intelligent paradigm for solving optimization problems that originally took its inspiration from the biological examples. It can achieve artificial intelligence by simulating the natural biological behaviors. As the importance of clustering strategies in many fields, global optimization methods based on swarm intelligence have been applied to solve clustering problems. Since criterion functions for clustering are usually non-convex and nonlinear, traditional approaches, especially the k-means algorithm, are sensitive to initializations and easy to be trapped in local optimal solutions. As the increasing numbers and dimensions of data sets, finding solution to the criterion functions of the clustering has become an NP-hard problem.Users of a Web site usually exhibit various types of behaviours associated with their information needs and intended tasks by clicking or visiting Web pages. These behaviours can be traced in the Web access log files of the Web site that the user visited. Web usage mining, which captures navigational patterns of Web users from log files, could detect and analyze the characteristics of Web user behavior patterns of access to a Web site, and therefore identify potential customers and improve the quality of service to users. Clustering techonology is a newly developed paradigm in Web usage mining and Web user behavior analysis. Therein, current Web clustering methods are mostly based on Web sessions Web page content, while there are relatively few approaches to clustering Web users’ navigation patterns. Moreover, the conventional Web usage mining techniques for analyzing user behavior only capture stand alone user behaviours at the page view level, but cannot identify the intrinsic characteristics of Web user activities, nor quantify the underlying and unobservable factors associated with specific navigational patterns. Thus, it is necessary to develop new Web user clutering and modeling methodologies to identify the latent factors or hidden relationships among users’navigational behavior and improve the performance of clustering technology effectively.The results produced by Web user clustering can be used in various advanced applications, for example, Web prefetching and catching. Many techniques, including Web Mining approaches, have been utilized or improving the accuracy of predicting user access patterns from Web access logs, making the prefetching of Web objects more efficient. Most of these techniques are, however, limited to predicting requests for a single user only. Predicting groups of users’interest have caught little attention in the area of prefetching.The main works of the dissertation could be summarized as follows:(1) Most existing clutering algorithms have some limitations, such as, limited to a single type of data set, easy to fall into local optimum during search process, and difficult to achieve encouraging results on high-dimensional data sets. To overcome these drawbacks of traditional clustering techniques, according to the characteristics of data clustering applications and on the basis of the existing chaotic ant swarm (CAS) algorithm, in this thesis we propose a clustering algorithm (referred to as the CAS-C algorithm) based on behaviors of ants’chaotic activities. Our work extends the application fields of the chaotic ant swarm algorithm. Numerical simulation experiments show that the proposed CAS-C algorithm has advantages such as not sensitive to initialized centers, finding a global optimum clustering result, and suitable to high-dimentional data and clusters with different shapes.(2) The Bacterial Foraging (BF) algorithm is a new stochastic search technique and optimization model based on the foraging behavior of bacteria swarm. However, as a new kind of intelligent bionic algorithm, BF is still not good enough. The algorithm improvement and parameter adjustment are important issues in the present study of the Bacterial Foraging optimization, where study about clustering based on bacteria foraging behavior is especially rare. Inspired by bacterial foraging behavior, this thesis proposes a new clustering algorithm (called, the BF-C algorithm) based on bacterial foraging optimization. Meanwhile, the thesis also gives out detailed investigations and analysis on setting BF-C parameters in data clustering. Compared to other global optimization-based clustering techniques, the BF-C algorithm is easier to understand, more fast and simple. However, the chemotactic step size is sensitive to the envorionment changement, and needs to be investigated for different condition-settings.(3) Traditional clustering methods are sensitive to the initial values and may get trapped in a local optimal easily. According to the problems and characteristics of traditional Web user clustering techniques, in this theis we introduce the clustering algorithm based on chaotic ant swarm to Web log analysis and user clustering to discover user navigation patterns, and as a result, improve the performance of Web user clustering. To evaluate the effect of the proposed methodology, the clustering results based CAS-C are compared to two methods that are widely used in Web mining (the k-means algorithm and the FCMdd algorithm). Large amount of numerical simulation experimental results show that our proposed CAS-C approach could get more compact and well-separated cluster clusters, and can effectively identify common interests of users.(4) During the process of analyzing and mining Web user access logs, Web user navigation behaviors need to be processed and fomalized to a certain form. Generally this process is called as user modeling. Current Web user behavior analysis and clustering techniques only capture stand alone user behaviours at the page view level, but cannot identify the intrinsic characteristics of Web user activities, nor quantify the underlying and unobservable factors associated with specific navigational patterns. Thus, we propose a Web user modelling approach based on Random Indexing (RI), segmenting and index modeling URL with the concept "context" in natural language processing. Thus, in the user modeling process, hidden information under the browse patterns could be mixed in, and furthermore, help the Web users clustering algorithm effectively and improve clustering results. Clustering experiments are conduct for two kinds of user modeling techniques, the feature vector method and the Random Indexing method, to show the superiority of the RI-based user model.(5) The results produced by our Web user clustering algorithm can be used in various advanced Web applications, such as Web caching and prefetching. Meanwhile, in order to evaluate the results of the proposed Web user clustering approach, we present a program of predicting behaviors of grouped users and Web page prefetching. Common interests of users are summarized through common user profile creation. Furthermore, based on results of Web user clustering, we establish prefetch rules for group users and put pages that users may click in the future into the cache of the Web site. To make our experimental results more convincing, our clustering and prefetching approaches are also compared to the k-means algorithm and the FCMdd algorithm. Numerical experimental results of prefetching show that with the help of the RI-based Web user model, the Web user clustering technique based on CAS-C could get higher accuracy of Web page prefetching.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络