

Research on User Session Identification and Clustering Technology of Web Log Mining

【作者】 朱晋华

【导师】 陈俊杰;

【作者基本信息】 太原理工大学 , 计算机应用技术, 2008, 硕士

【摘要】 随着Internet在流量、规模和复杂度等方面的飞速增长,网络成为人们进行信息交流和信息处理的平台。面对网络上如此巨大的信息量,如何有效地发现个性化的信息,成为困扰用户的一大难题。为此,Web挖掘技术应运而生,其中Web日志挖掘是Web挖掘研究领域中一个重要的方面,它是将数据挖掘技术应用于Web服务器日志,通过分析日志文件发现用户访问站点的浏览模式。基于Web的日志挖掘一般分为三个过程:数据预处理阶段、模式发现阶段及模式分析阶段。在Web日志挖掘过程中,首先要进行的是数据预处理,因为现实世界中的数据多半是不完整的、含噪声的和不一致的,而且这些数据的格式多种多样。对于数据挖掘算法而言,不正确的输入数据可能导致错误或者不准确的挖掘结果,同时数据挖掘算法通常处理的是具有固定格式的数据,现实中存在的数据各式各样,因此需要将这些数据加工处理成可以被挖掘算法使用的数据。如何修补现实世界的数据的不完整及不一致、如何剔除噪声数据、如何将现有的数据转化为挖掘算法可用的格式、如何抽取有用的数据、如何将多个数据源集成在一起,这些都是数据预处理中要完成的任务。数据预处理技术是整个数据挖掘过程的主要组成部分,数据预处理的结果是挖掘算法的输入,它直接影响挖掘的质量。因此,数据预处理技术也是Web日志挖掘中的重要研究方向。数据预处理是在将日志文件转换成数据库文件时进行的,它包括数据清洗、用户识别、会话识别、事务识别四个阶段。本文深入学习研究了数据预处理的主要任务,提出了一种新的Web日志预处理会话识别及根据用户浏览兴趣进行事务识别的方法。该方法根据用户的下载时间、用户对页面内容的兴趣度及页面的信息量及页面的链入、链出数等几个参数的综合得到每个用户对每个页面的访问时间阈值,然后根据该个性化阈值来识别用户会话。会话识别后,根据用户访问页面的时间、页面的兴趣度删除用户不感兴趣的页面和链接页面,重新定义用户的Web访问事务,成为最终有效的Web页面访问序列。实验证明,本文提出的方法可以识别出页面浏览时间较长的会话,也可以把小于固定阈值的页面划入下一会话,发现的真实会话比例大,贴近用户真实的访问目的,同时依据用户浏览页面的兴趣度来删除无关链接页面,形成新的Web访问事务,为下一步的聚类分析提供了良好的数据,提高了聚类的效率。数据经过预处理后,就可以根据具体的需求来选择聚类、分类等挖掘技术。本文研究分析了聚类技术及当前的Web聚类的内容和方法,通过聚类用户访问的Web事务,发现相似的用户群。

【Abstract】 With the swift development of Internet in amount, scale and complexity, web has become an effective platform on which people communicate and process information. Based on so tremendous information in network, how to discover individual information effectively has become a difficulty to users. So technique of Web mining emerges as the time requires, and the technique of Web log mining is an important part in the research field of Web mining. It applies the technique of Data mining to Web server log, and analyses log files to discover users’ visiting pattern of accessing sites. There are three processes in Web log mining: Data preprocessing, Pattern discovering and Pattern analysis.In Web log mining, the first process is Data preprocessing. Because most amounts of data are half-baked, noisy, and inconsistent, and their formats are various in real world. For algorithm of Data mining, incorrect input may result in fault or inaccurate result, at the same time, algorithm of Data mining usually process data with fixed format. There are various data in real world, so these data need to be processed into other data which can be used in mining algorithm. Data preprocessing should accomplish these tasks, such as, how to restore data’s half-baked and inconsistent in real world, how to eliminate noisy data, how to transform existing data to the format can be used in mining algorithm, how to extract useful data, how to integrate multiple data source, and so on. Data preprocessing is a main part in the whole data mining process. The result of Data preprocessing is the input of mining algorithm, it can influence mining quality directly. So the technique of data preprocessing is an important research aspect in Web log mining. Data preprocessing is processed when log files are transformed to database files. It includes four phases: data cleanout, user session, session identification, transaction identification.This paper further studies the main task of Data preprocessing, and puts forward a new method about session identification in Web log preprocessing and transaction identification according to users’ visiting interest. This method integrates such parameters as users’ downloading time, the users’ interest to pages, pages’ information and pages linking into and out to calculate every user’s visiting time for every web page, then divides sessions according to individual threshold. After session identification, according to the users’ visiting time and pages’ interest deletes the pages that the users are not interested in and linked pages, and redefines the Web transaction which is effective page visiting sequence.Experiment turns out that the method in this paper can identify session in which users take long time to visit pages, and merges pages whose threshold is less than fixed threshold to next session, discoverable real session accounts for great proportion, and be similar to users’ real visiting intention. At the same time, deletes independent pages according to users’ interest to pages, and forms new Web transaction. It provides valuable data for clustering analysis, and improves cluster’s efficiency.After data preprocessing, it is time to select a mining technique such as clustering, classifying according to specific demand. This paper analyses cluster’s technique and current Web cluster’s content and methods. Through clustering Web transaction, we can find the similar users.

  • 【分类号】TP393.09
  • 【被引频次】3
  • 【下载频次】313

