节点文献

微博客数据的获取与分析方法研究

Research on the Method of Microblog Data Acquisition and Analysis

【作者】 田董涛

【导师】 王根英;

【作者基本信息】 北京交通大学 , 通信与信息系统, 2012, 硕士

【摘要】 微博客是继博客后迅速发展起来的一种新的社交网络形式,在信息传媒领域形成了很大的影响力。对于传统的社交网络形式,数据的获取与分析技术已日趋成熟,但对于微博网络数据的获取及微博网络特性的研究,还不够完善。本文研究了微博的特点及作用,微博数据获取的两种技术,以新浪微博为例,设计并实现了微博数据获取与分析系统,仿真并分析了微博网络的网络特性。本文主要的工作目的是在获取微博数据的基础上,分析微博数据,由此得出微博网络的特性。具体的工作如下:1、研究了使用网络页面爬虫获取数据的相关技术,包括通用网络爬虫,聚焦网络爬虫,网页预处理,文本分类等的基本原理和工作流程。2、深入研究了利用微博系统的SDK获取数据的工作流程,该技术通过调用微博平台提供的API来获取用户数据,调用API需通过用户身份的鉴权,目前主要用到的是OAuth鉴权,该方法步骤简单,抓取数据的准确度和效率高,本文应用该方法获取微博数据。3、从简化认证步骤,提高获取效率,避免重复爬取等方面考虑,对SDK程序进行了改进,经多次实验证明经过改进的程序能长时有效的获取微博数据,此方法获取的微博数据作为研究微博网络特性的数据集。4、设计了微博数据获取和分析系统的总体框架,系统的数据库,功能模块和界面,实现了微博的数据获取和分析的基本功能,借助于该系统可对微博网络做更深入的研究。5、对微博的网络拓扑,节点的入度分布,出度分布等进行了分析,通过分析得出微博网络具有小世界特性,无标度和高聚类特性。

【Abstract】 Microblogging is quickly developed into a new form of social network following blog. It has great influence on the field of information media. For the traditional form of social network, data acquisition and analysis technology has matured, but the microblogging network data acquisition and the research of microblogging network characteristics is still not perfect. This paper studies the characteristics and the effect of microblogging, and two microblogging data acquisition techniques. Using Sina microblogging for example, microblogging data acquisition and analysis system was designed and achieved, network characteristics of microblogging were simulated and analyzed. The main purpose is to analysis the characteristic of microblogging network according to the data obtained in microblogging. Specific work is as follows:1、Study on the technologies of getting data using web page crawler, including the basic principles and workflow of general web crawler, focused crawler, web pre-processing, text classification etc.2、Study on the workflow of getting the data using microblogging system SDK, this technology gets the user data by calling the API provided by the microblogging platform, and calling the API requires the user identity authentication. Currently, the major authentication is OAuth which is described in detail in this paper, and this method has simple steps and it can get microblogging data accurately and efficiently.3、The SDK program has been improved by several experiments to simplify the certification procedures, improve crawling efficiency and avoid duplication of crawling. The improved program can acquire data continually. The microblogging data fetched by this method is data set of researching microblogging network characteristics.4、Designe the framework of data fetching and analysis system. System database, function modules and interface were also designed. The basic functions of microblogging data acquisition and analysis were achieved. Using the system, microblogging network can be studied in-depth.5、Analysed microblogging network topology, the in-degree distribution and out-degree distribution, the conclusion is that microblogging network has small-world, scale-free and high clustering properties.

  • 【分类号】TP393.092
  • 【被引频次】11
  • 【下载频次】1302
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络