节点文献

基于web文献的数据挖掘研究应用

Research and Application of Data Mining Based on web Literature

【作者】 龚真平

【导师】 黄文培;

【作者基本信息】 西南交通大学 , 教育技术学, 2011, 硕士

【摘要】 随着高等教育的大众化,高校人数由过去的几十万上升到几百万,国家也会提供大量的资金资助大量科研项目,每年都会有数以万计的文献产生。由于Web文献的大量累积,人们很难从海量的文献数据中寻找到有用的信息,也就起不到提高工作效率的作用。本文的主要目的就是利用数据挖掘技术从大量的文献数据中找到有用的信息,以便进一步的指导工作。为了选择适合大量文献数据的数据挖掘算法,本文首先对数据挖掘的理论知识做了简要的介绍,给出了文本相似度计算的一般流程和公式,对几种聚类算法进了分析比较,发现一些不足的地方。根据聚类效果的评估原则和增量聚类算法的思想,设计了一个基于内聚度的增量聚类算法,弥补了上面几种算法的不足,然后通过相关实验对该聚类算法的参数进行了优选。查阅相关文献和分析PaperPass软件的检测结果,得出了一个计算文献相似度的计算方法,以便对文献抄袭现象进行检查。根据采用空间向量计算文本相似度的方式,改进了计算相似度的算法。为了获取大量的Web文献数据,本文研究了爬虫的相关知识,设计并实现了一个文献聚集爬虫。本文为了应用上面的算法和为用户提供可操作的平台,设计了一个基于Web文献的数据挖掘系统。本文对该系统的目标和特点进行了分析,选择了相关的技术路线,完成了系统架构、功能及主要模块的划分与设计,设计了系统数据库。最后,给出了系统的运行部署方法和相关功能的演示。

【Abstract】 With the development of higher education, the number of university students has been increased from hundred thousand to several million during the past few years, the government will provide substantial fundings, and thus a large number of research projects are generated each year. Due to the accumulation of a large number of Web documents, it is difficult to find useful information from the mass of literature data, let alone improve the efficiency. The main purpose of this thesis is to find useful information from a large number of literature data for further guidance by using data mining technology.To find data mining algorithms suited for a large number of literature datas, firstly, this thesis gives a brief introduction to theoretical knowledge of data mining, and gives a general similarity calculation process and formula of the text, where we present an analysis of several clustering algorithms and find some deficiencies. According to the principles of clustering effect sassessment and the thinking of incremental clusterings, we design a cohesion-based incremental clustering algorithm, which makes up the deficiency of several above-mentioned algorithms. Then the parameters of the clustering algorithm are optimized by some relevant experiments. By referring to relevant literatures and analysizing the test results of PaperPass software, a method for caculating the similary degree is obtained, which contributes to the examination of the phenomenon of plagiarized documents. Moreover, the algorithem of calculating the similarity degree is improved based on the way of space vector. Finally, according to the relevant knowledge of the web cralwer, a literature focused crawler is designed and implemented so as to obtain an overwhelming of web documents data.In order to apply the above-mentioned algorithms and provide users with an operational platform, a Web-based data system of data mining is designed. This paper analyzes the goal and characteristics of the system, and selects the relevant technical line, then completes the system structure, function and division of main modules’s divide, and finally designs the system database. In the end, the methods of the operation and deployment for our system are given, and the demos of some relevant functions are presented.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络