

Research of Data Mining Based on Web Log

【作者】 田海山

【导师】 彭玉青;

【作者基本信息】 河北工业大学 , 计算机应用技术, 2003, 硕士

【摘要】 近年来,Internet正以令人难以置信的速度在飞速发展,越来越多的机构、团体和个人在Internet上发布信息、查找信息。虽然Internet上有海量的数据,但由于Web是无结构的、动态的,并且Web页面的复杂程度远远超过了文本文档,人们要想找到自己想要的数据犹如大海捞针一般。网站不能对用户及其页面进行聚类,因此也不能针对特定的用户给出特殊的服务。另外,网站的拓扑结构与用户期望之间也存在着差距。而有些特殊用户的硬件资源有限,他们使用掌上电脑浏览网页,如何为他们实现页面预取也是应当研究的课题。 如何解决这些问题?将传统的数据挖掘技术与Web结合起来,进行Web挖掘就是一个途径。Web挖掘就是从Web文档和Web活动中抽取感兴趣的潜在的有用模式和隐藏信息的过程。Web挖掘可以在很多方面发挥作用,如对搜索引擎的结构进行挖掘,确定权威页面,Web文档分类,Web Log分类、智能查询等。 本文首先介绍了Web挖掘的定义、任务、分类,Web挖掘的模型及处理过程。 接着,提出了一种适用于Web日志挖掘的数据结构及相应的算法。数据结构是一个用户/页面(User_URL)关联矩阵,用来表示用户对页面的访问信息。挖掘算法采用矩阵聚类(Matrix Cluster),可以实现客户、页面聚类和频繁访问路径识别及访问预测等。 本文最后总结了工作尚存的不足,并指出了Web挖掘研究的方向、应用前景和它所面临的挑战。 实验证明,采用以上算法对校园网的Web日志进行挖掘效果良好。另外,把算法应用于电子商务网站,可以建设一个自适应网站(Adaptire Website),进而实现针对具体客户的个性化服务,最终为商家的决策提供有力的支持。

【Abstract】 Internet has developing with incredible speed for several years, in rencent years, more and more institutions, groups and individuals issuance and lookup information in the Internet. There is a mass of information in the Internet, but Web is unstructured and dynamic, and the composition of Web page is more complicated than text archive, so looking for data which someone want in the Internet is such difficult as looking for a needle in a bottle of hay. The website can’t c luster it’s users and web pages, so i t can’t provide special service for a given people. Besides, the organization of websites’ content may be quite different from the organization expected by visitors to the website. What’s more, thers are some peculiar users whose hardware resource is finite, they use palmtop (such as Palm Pilots,Pocket PC,Handspring etc.) browse web page, then how to prefetch web page for them is worth to research.How to resolve these problems? Web mining which combine classical data mining technology with web is an appropriate approach. Web mining is a process that extracting some interesting and latent useful pattern and recondite information from web archives and web activitys. Web mining can react on several fields such as search engining structure’s miningx confirm authoritative web page, classifing web archives,classifying web log, intelligent query etc.The thesis intruoduce the definition, mission, classification of web mining as well as the model and process of it at first.Then, a data structure and the corresponding arithmetic which suit to web mining are bring forward. The data structure is a User_URL martrix, it show the information that use access webpage. Mining arithmetic which utilize matrix cluster will cluster user, webpage and identity the frequent path as well as predict access.In the end, make a summarize of disadvantage which exists in the thesis,at same time, point out the direction, future and challenge of the web mining.The result of experiments show that the arithmetic which is applied to campus net’s web log is efficient. In addition, applying the arithmetic to e-business website will construct an adaptive website, this will provide personal service to a special user, finally, this will provide trader powerful support to decision.

  • 【分类号】TP311.13
  • 【被引频次】11
  • 【下载频次】336