节点文献

基于微博平台的事件趋势分析及预测研究

On Trends Analysis and Prediction Based on Micro-Blogging Platforms

【作者】 田野

【导师】 何炎祥;

【作者基本信息】 武汉大学 , 计算机软件与理论, 2012, 博士

【摘要】 社交网络服务是近年来迅速兴起并逐渐渗透到社会各用户群体的计算机应用服务,微博是其中一个重要应用,并且在最近几年得到迅速发展。平台用户的高覆盖性、内容的自生产性和信息传播的及时性,使微博平台成为目前重要的消息传播媒介。平台上的巨大用户规模和海量信息内容,为研究者们提供了良好数据以进行群体用户的信息挖掘。本文尝试利用微博平台的海量文本资源,抽取出各种特征数据,对传统研究中难以量化的事件趋势这一社会内容进行计算和分析,并根据基于样本范围内数据的趋势建模,来预测范围外的事件未来趋势。本文旨在通过这一方面的工作,阐述对难以进行形式化描述的非确定性社会内容进行计算的可行性。本文研究了在微博平台上进行事件趋势分析及预测的几个关键问题,包括群体行为的定义与计算方法;事件趋势的样本回归分析和未来趋势预测模型;事件相关微博内容的识别及获取方法;微博平台上的用户特征和博文文本特征抽取;以及事件趋势的形式化描述和特征指标抽取方法。主要的研究工作和创新点概括如下:1.提出了一种基于群体行为的社会计算方法。首先根据样本用户的特征抽取和分类,获得特征相应的指标和计算方法,再通过对大规模用户特征值的综合计算,获得该用户群体的整体特征,直接对用户整体进行量化计算。结果表明,采用该方法进行群体特征计算具有可行性。2.提出了一种基于微博平台的事件趋势分析和事件未来趋势预测的算法,并给出了具体过程。首先通过对样本范围内数据的计算,获得事件趋势各相关指标的数据值,再通过回归分析,构建基于样本数据的回归模型。然后通过对最佳拟合模型的分析,计算预测点之前单位时长内的回归模型函数值,根据差值斜率的融合模型计算预测点的未来趋势。在实际语料基础上进行的实验结果表明该方法可以辅助人工决策,与实际数据的绝对差异较小,且在针对情感比重一类相对值的实验中有较好结果。3.提出了一种事件内容的抽取方法。该方法结合了MACD算法(MovingAverage Convergence and Divergence,指数平滑异同移动平均线)和LDA算法(Latent Dirichlet Allocation,潜在狄利克雷分布),分别进行突发事件内容的获取和已知事件的相关文本内容扩展。利用MACD算法,计算微博文本中单位时间片的词频变化,利用短周期移动平均线和长周期移动平均线之间的聚合及分离情况,识别平台文本流中讨论量突增内容,以此抽取有可能成为讨论热点的事件。而LDA算法,则被用来计算事件相关的“词袋”内容及各相关词在该事件中的关联权重。根据若干词组合的方式作为关键词查询的补充,以此扩展事件相关内容的抽取结果。实验结果表明此抽取方法效果明显。4.本文提出了一套微博平台上相关内容的形式化定义方法和一种简单高效的用户特征识别方法,以及事件特征的定义和事件趋势指标的建立方法。首先对用户群体和事件趋势等非数值化的社会内容进行量化,通过此方法对平台系统、平台涉及的网络、平台用户、用户消息内容等各项指标进行具体的数值计算,用可计算的数学模型对非量化的社会趋势内容进行描述。然后在此基础上,基于社会学、传播学和心理学中的个体及群体特征分析,以样本数据中标注用户的特征取值构造规则集,再以规则集为筛选标准,根据测试用户关键特征的数据值关系,来区分微博平台上的关键用户和垃圾用户,较好的支持了针对研究对象的计算与分析。

【Abstract】 SNS (Social Networking Services) rise rapidly in recent years, and graduallypenetrate into the user groups all over the world. Microblogging is one of theimportant applications and have been rapidly developed in the last few years. Highcoverage, timeliness of content production and dissemination of information make themicroblogging platform a major news media. Huge number of users on the platformand the mass content, provide effective corpus for information mining in groups ofusers. This thesis attempts to extract various features of the data in the massive textresources of the microblogging platform, and calculate and analyze the trend of eventswhich is also in social computing area and difficult to quantify in the traditionalresearch. We model the trend according to the sample data, and predict the futuretrend according to the data outside the scope. This thesis is motivated to descript thepossibilities of computing the social contents.In this paper, we discussed several key issues of the event trend analysis andprediction on the Weibo platform. Including the calculation of group behavior;regression analysis of bursty event trends and future trends modeling; recognition andacquisition of event-related microblogging content; user characteristics and the text ofthe blog features extraction in microblogging platform, as well as the formaldefinition of the event trends. Main research and work results are summarized asfollows:1. Present a social computing framework based on group behavior. Within thisframework, we first define the indexes of ursers fetures. And then we have awhole portrait of massive users features. Thus to quantify the groups of users.Experiments results show the possibilities of group features computing.2. Present the framework of event trends analysis and prediction, and the methods indetail. Based on the sample data, we calculate the data value of each trend index.Thus we have a sample-based regression model. Then we calculate the futuretrend by the fusion model. Results show that it is a good way to aid the artificialdecision. Besides, there is little different between the absolute number ofpredictive data and actual data. Results of emotional proportion data also have a relative value.3. Present an event extraction method. This method combines the MACD algorithm(Moving Average Convergence and Divergence) and LDA algorithm (LatentDirichlet Allocation), and they are assigned to find the emergencies conten andrelated words of the known events expansion. By MACD algorithm, we calculatethe term frequency change of the unit time slice in the text of the microblogging,the use of aggregation and separation between the short-period moving averageline and long-period moving average line to recognize the burst content. The LDAalgorithm is used to calculate the event-related content of the "word bag" andrelated words in the event weight. Experimental results show that it is an effectmethod to extract the key content.4. Present a set of the formal definition of related content on the microbloggingplatform, including the platform, the users’ network, user data, and data itemsinvolved to the platform. We also present a feature recognition method to classifythe users, simple but effective. All the formal definitions support well for thecalculation and analysis of the study.

  • 【网络出版投稿人】 武汉大学
  • 【网络出版年期】2012年 10期
  • 【分类号】TP393.09
  • 【被引频次】28
  • 【下载频次】9295
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络