节点文献

Web挖掘技术及其在互联网中的应用研究

Research on Web Mining Technologies and Its Application on Internet

【作者】 王伟

【导师】 江铭炎;

【作者基本信息】 山东大学 , 通信与信息系统, 2013, 硕士

【摘要】 随着信息技术的不断发展,计算机与通信技术不仅推动着现代社会的信息化发展,而且同时影响并在改变着人们的现代生活。然而信息技术同时带来了数据的爆炸式增长,人们迫切需要一种对海量数据进行有效利用和处理的解决方案。在这样的大数据背景下,数据挖掘技术应运而生。Web挖掘技术作为该领域的一个分支,针对的是万维网海量数据的有效梳理和运用。由于互联网技术日新月异,而Web挖掘技术相对发展较晚,因此本文以Web挖掘作为研究核心,并深入分析其在互联网领域的应用。本文首先介绍了Web技术的研究背景、现状、技术难点和未来发展方向等方面,以及对数据挖掘、机器学习等相关概念做了深入说明。然后,继续关注Web挖掘技术的实现过程和应用场景,介绍了文本预处理的核心实现过程和话题检测与追踪、用户行为分析两个应用的技术背景。作为Web内容挖掘技术的一个重要应用之一,话题检测与动态追踪旨在检测未知话题并且追踪已有话题的后续发展。针对网络媒介上新闻事件报道类文本对象的话题检测与动态追踪问题,本文实现了一种混合聚类解决方案。本方案基于“贡献度”对话题模型做了层次化调整,更加适合于构建互联网新闻话题,而且效率性能有了大幅提升。实际互联网新闻数据表明,与K-Means算法相比,本方案准确率和召回率有了显著提升,并且构建的话题树模型层次化效果明显。针对中文微博类文本对象的话题检测与动态追踪问题,本文提出了一种基于主题词的增量式模糊聚类解决方案。本方案首先根据微博自身的文本特点,提出了一套信息反垃圾的过滤方案。然后利用时效性和词频两个因素,为主题词建立适应微博特点的权重。最后利用增量式模糊聚类方法完成突发话题的检测过程。实际微博数据表明,本方案可以有效地检测出突发事件、热点话题等,而且时间效率较为理想。作为Web使用挖掘技术的一个重要应用之一用户行为分析旨在了解用户习惯、兴趣点等,分析评测用户的产品满意度,以便改善产品提升用户体验。针对搜索引擎的用户满意度评测,本文阐述了一种基于用户使用行为的自动化解决方案。本方案首先介绍原始网络日志预先处理过程,即从日志数据中得到具体用户操作行为数据并进行特征抽取。然后,提出了一种基于CURE算法的推荐技术,人工对选取的样本进行标注。最后,利用动态建模技术完成对用户满意度的模型构建。实际搜索引擎数据表明,基于机器学习的自动化评测方案已经接近人工评测水平,达到了实际应用要求,并且动态模型通过多模型构建、自动更新、反馈纠正等机制可以有效延长生命周期,提高了学习的延续性。

【Abstract】 With the arrival of information age, computer and communication technologies are not only promoting the informatization development of modern society, and also influencing and even changing our modern life. However, information technologies also brought explosive growth in the amount of data. People urgently need a technical solution for effective utilization and disposal of massive data. Under these circumstances of big data age, data mining technologies aroused. As a branch of data mining, Web mining is especially for massive Internet data. Due to Internet fast changing pace and also the late start of Web mining technologies, this thesis mainly research Web mining technologies and its application on Internet.The paper firstly introduces the research background, research situation, technical difficulties and future development direction of Web mining and further illustrates data mining, machine learning and other relative concepts. Then the paper continues to focus on the realization process and application scenarios of Web mining, briefly introduce the Web text preprocessing and two relative applications, one is topic detection and tracking, and the other is user behavior analysis.As one of the most important application of Web content mining, topic detection and dynamic tracking aims to detect unknown topics and track the latest development of already known topics.According to Internet news topic detection and tracking problem, the paper proposes a solution based on hybrid clustering algorithm. And this solution applies the concept of contribution to build hierarchical topic model with better efficiency. Especially, this model has better adaptability of Internet news. Real Internet data proves that this solution shows better accuracy rate and recall rate than the traditional K-Means methods. And the generated topic tree model has better hierarchical performance.According to Chinese Micro-blog topic detection and tracking problem, this paper proposes a solution based on incremental fuzzy clustering algorithm. This solution firstly introduces a set of anti-spam filtering rules based on the characteristic of Micro-blog text. Then considering timeliness and frequency of keywords, the solution proposes keyword weight computing method. And lastly the core incremental fuzzy algorithm complete detection process. Real Micro-blog data proves that this solution could detection sudden incidents effectively with big data processing capacity and low time complexity.As one of the most important application of Web usage mining, user behavior analysis aims to understand the usage habits and interest of users and evaluate user satisfaction, in order to further improve user experience.According to the evaluation of uses satisfaction of search engine, this paper proposes an automatic solution based on user behavior analysis. The solution firstly introduces log preprocessing method including user behavior transformation and feature extraction. Then the solution proposes a recommended sample tagging method based on CURE algorithm. Lastly, the solution generates a dynamic model for user satisfaction. Real search engine data proves that the proposed automatic evaluation method based on machine learning is close to artificial evaluation level and meets the requirements of practical application. And also with mechanisms of multiple model construction, automatic updating and error feedback, the dynamic modeling method could extend life cycle and promote continuous learning effectively.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2013年 11期
  • 【分类号】TP311.13;TP391.1
  • 【被引频次】3
  • 【下载频次】425
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络