节点文献

Web日志挖掘中数据预处理算法的研究

Research on Data Pre-processing Algorithm in Web Log Mining

【作者】 朱鹤祥

【导师】 李瑞;

【作者基本信息】 大连交通大学 , 计算机应用技术, 2010, 硕士

【摘要】 Internet的迅猛发展,尤其是Web的全球普及,使得Web上信息量无比丰富。通过对Web的挖掘,可从Web页面中提取所需的知识:对总的用户访问行为、频度、内容的分析,可得到关于群体用户访问行为和方式的普遍知识,用以改进Web服务设计。更重用的是,通过对这些用户特征的理解和分析,有助于开展有针对性的电子商务活动。Web日志挖掘利用数据挖掘技术分析和挖掘网络日志,获取网站使用情况的有价值模式,应用于个性化服务、网站设计和商业决策等方面。而数据预处理在Web日志挖掘过程中起着至关重要的作用,其中用户识别和会话识别是主要环节,也是整个过程的基础和关键步骤。本文将对提高用户识别和会话识别算法进行研究。本文系统地阐述了从数据挖掘、Web数据挖掘到Web日志挖掘整个过程,重点研究了Web日志挖掘技术及其步骤,研究了数据预处理的过程和方法,包括用户识别技术和会话识别技术等。本文的主要工作是,首先提出了一种以活动用户为基础的用户识别算法,它使用IP地址和用户访问截止时间去识别日志中的不同用户,实验结果表明,该算法比基本用户识别算法有着更好的性能,甚至对于小型日志文件系统也适用。其次,给出了会话识别的定义,并对传统的预先设定时间间隔方法进行了优化,在给出算法数据结构的基础上具体描述了算法,实验证明会话质量得到了提高。

【Abstract】 The swift and violent development of Internet, especially the whole worlds of Web popularizes and Web incomparably abundant amount of information.Through Web mining, we can draw necessary knowledge from Web page:to analyze the contents to total user receive and visit behavior and frequentness, we can get the general knowledge of behavior and mode of users, and use that to improve our web serve.And more importantly, through the understanding and analyzing of user’s characteristic, it can help and develop the electronic commercial activities.Web log mining utilizing the technology of data mining to analyze and mining the data of network, obtains the visited the valuable patterns of information about Web.It is applied to personalization, improving Web sites and business.And data preprocessing plays an essential role in the process of Web log mining.User and sessions’identification is a basal and pivotal process in the data preprocessing.This paper will research how to improve the accuracy of user and sessions’identification algorithm.In this thesis, the process of data mining, web data mining and web log mining was reported, the technologe and process of web log mining was focused on, the method of data pre-processing is researched, including user and session’s identification technologies.The mostly work of this paper is: Firstly, an active user-based user identification algorithm is presented. The algorithm uses both an IP address and a finite users’inactive time to identify different users in the web log. Our experiments result prove that the active user based algorithm shows much better performance over the basic algorithm even for small web log sizes. Secondly, the definition of session identification is given, the traditional method of pre-established time interval is optimized and the algorithm is described concretely based on the data structure. The empirical analysis prove that the quality of session is improved.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络