节点文献

基于Web挖掘的网页动态推荐系统研究

【作者】 段利君

【导师】 钟亦平;

【作者基本信息】 复旦大学 , 计算机应用技术, 2010, 硕士

【摘要】 使用Web挖掘技术提取用户访问模式具有重要的现实意义。在用户浏览网页时为用户提供预取服务,在电子商务中为用户推荐商品以及改善网站的组织结构等。然而,在信息爆炸的今天,从网站内容到用户浏览行为都时刻发生着变化。这对网页推荐系统的设计提出了新的要求。推荐系统为了预测用户下一步可能访问的网页,需要向前参考浏览序列。而序列模式考虑了页面浏览序列,因此本文以序列模式相关理论为基础。在基于序列模式的用户浏览模式挖掘相关研究中,比较流行的有基于Markov模型和PLSA模型。本文分析发现这两种模型在适应网站内容和用户浏览行为迅速变化方面都存在不足。本文首先介绍了该领域的国内外研究现状和Web数据挖掘的一般流程。在Web日志数据预处理方面,本文给出了一种过滤日志数据的方法。在网页聚类方面,先分析了现有的各种聚类方法,接着提出了在网站组织结构良好的情况下基于URL的聚类方法包括:基于URL间距离和基于路径树的方法。由于URL间距离的算法不适应动态增长的Web页面结构,本文主要采用的是基于路径树的方法。在序列模式挖掘阶段,本文分析了PLSA方法的不足并提出了RTA算法,此方法基于路径树。随后,本文给出了推荐系统的更新方法。接下来本文分析了用户在访问网站时的使用习惯,并据此给出了网页推荐系统的设计方案。本文最后采用命中率来评价推荐系统,给出了推荐页面数、支持度以及滑动窗口长度与命中率之间的关系。并将实验结果与基于PLSA算法的实验进行了对比,结果表明在一定条件下,RTA算法优于PLSA算法。

【Abstract】 It is meaningful to extract user navigation model by utilizing web data mining: pre-fetching webpage while user access the website, recommending goods to the user in the scenario of e-business and optimizing the structure of the website. However, under the environment of information exploding, the content of the website or the behavior of user navigation is changing at any given time. All this require a high standard for the designing of webpage recommendation system.In order to predict which page the user would need in the next step, the recommendation system need to reference to the pages which had been navigated before. Since sequence model take the page’s navigation history into consideration, this paper take the related theory of sequence model as foundation. In the domain of user navigation model based on sequence model, the prevalent models are Markov model and PLSA model.But after detailed analysis, these two models have defects when handle the problem under the condition that the content of the website and the behavior of user navigation are changing.This article first introduces the current situation of this domain and the common process of web data mining. It gives a filtering way to preprocess the web log data. For the webpage aggregation, this article introduces several existing methods and then proposes two ways based on URL to solve this problem on the premise that the structure of the website is sound:based on the distance between two URL and based on the path of URL tree.Since the way based on the distance between tow URL can’t adapt to the dynamic changing situation, this paper will take the later method. For extracting of the sequence model, it point out the flaws of PLSA and then propose RTA algorithm which is base on path tree. Also, this article tells how to update the recommendation system.Then it gives a solution to designing the webpage recommendation system, which based on the behavior of user navigation.This article employs hit ratio to rate the recommendation system. At the end of this article, the experiment shows the relationship between the number of recommendation pages、the support degree、the length of sliding window and the hit ratio. The result proves that PTS is better than PLSA under a specific condition.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2011年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络