节点文献

Web日志挖掘及其实现

Research and Realization on Web Log Mining

【作者】 刘滨

【导师】 杨静;

【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2007, 硕士

【摘要】 伴随着Internet技术的发展,WWW的应用也越来越多,Web站点越来越普及。在当前竞争激烈的网络经济中,只有赢得用户才能获得竞争中的优势。客户浏览行为的数字化,使得通过收集大量用户浏览行为数据来深入研究客户行为变为可能。如何利用这个机会,从这些“无意义”并且繁琐的数据中得到有价值知识和信息成为目前面临的最紧要的问题之一。为了解决这个问题Web站点的数据挖掘技术诞生了。本文重点研究了日志挖掘技术及其步骤,研究了数据预处理的过程和其中难点的解决方法,包括用户识别技术,路径补充技术等技术。详细介绍了关联规则的经典算法Apriori算法。在研究一些Apriori改进算法的基础上,本文通过缩减数据库和对连接方法进行改进实现了对Apriori算法的改进,提出了I_Apriori算法,并且在理论上证明了I_Apriori算法的空间复杂度和时间复杂度比Apriori算法小。为了验证所提出的I_Apriori算法的空间复杂度与时间复杂度,并且把所研究的技术应用到实际应用中去,本文以哈尔滨工程大学50周年校庆网站为日志挖掘对象,分别使用Apriori算法和I_Apriori算法对经过数据预处理后的日志文件进行分析。实验的结果表明I_Apriori算法的空间复杂度和时间复杂度都比Apriori算法有改善。为了使比较结果具有普遍性,在给定不同的最小支持度的情况下,把Apriori算法和I_Apriori算法分别对同样的日志文件进行挖掘,实验结果表明在给定不同的最小支持度的情况下,I_Apriori算法的效率比Apriori算法高。最后,通过采用I_Apriori算法对日志文件进行分析找到了在网站结构和内容中存在的问题,并且给出了解决方案。

【Abstract】 With the help of the development of the technology on the field of internet, www becomes more and more popular. As a result, many websites are being built. As the violent competition in the internet economy, only the one who attracts the customers can survive. The behaviors of the customers become digital, which makes it possible to collect a lot of data in order to further investigate the behavior of the customers. It is one of the most important problems which we confront that how to find the valuable and understandable information from the "no sense" and boring data. The technology of Web data mining is the method to solve this problem.In this thesis, the investigation of the web log mining technology and its process are focused on and the process of the data preprocess, method of this process and the solution of the problems, including identifying the users and completing the path of the users are investigated. The classic algorithm of association rule Apriori algorithm is introduced. After investigating some of the improvement of the Apriori algorithm, the IApriori algorithm is given, which is based on the the technology of reduce the scale of the database and the improvement of the process of join. The time complexity and space complexity of IApriori algorithm is less than Apriori in theory. In order to demonstrate the efficiency of IApriori algorithm and to apply the technologies which are investigated into practice, the logs of the 50th birthday of heu celebration website are processed and analysed through IApriori algorithm and Apriori algorithm respectively. The result of this experiment shows that IApriori algorithm is much better than Apriori algorithm in time complexity and space complexity. In order to make the compareion more universality, after given different minsupp, the same logs are analysed by IApriori algorithm and Apriori algorithm respectively, the result of this experiment shows that I_Apriori algorithm is more efficient than Apriori algorithm when given different minsupp. Finally, the logs of the website are analysed by I_Apriori algorithm. With the help of the result the disadvantages of the website are found and then the improvements are given.

  • 【分类号】TP311.13
  • 【被引频次】6
  • 【下载频次】432
节点文献中: 

本文链接的文献网络图示:

本文的引文网络