

【作者】 谢乾龙

【导师】 徐蔚然;

【作者基本信息】 北京邮电大学 , 信号与信息处理, 2013, 硕士

【摘要】 随着网络舆情的快速发展,特别是随着微博的流行,民众对于公共事务公开表达的具有影响力意见的平台正在迅速转移,微博平台上的舆情分析已经成为一个热点研究方向。本文以微博舆情分析系统作为切入点,重点对微博平台下的检索系统和突发话题检测算法进行研究。主要内容如下:1.论文首先对检索模型和突发话题检测算法进行了深入研究,分析了国内外相关技术的研究成果和现状,接着重点介绍了一些经典的检索模型和突发话题检测算法。2。针对微博数据设计和实现了短文本话题检索系统,该算法对于用户给定的话题,使用基于词激活力(WAF)的查询扩展算法进行查询扩展,通过“二次检索”返回同时具有高相关性和高时效性的微博。最后将该算法用于2011TREC Micro-blog Track中,取得了第二名的优秀成绩。3.提出一种基于状态自动机的突发特征检测算法,针对微博数据长度小,语言不规范,噪声大,数据量大的特点,优化预处理过程和状态自动机模型参数;提出一种突发话题聚类算法,对特征词的词频向量表示进行改进,并引入基于词激活力(WAF)的词法特征,使得聚类效果更加准确,得到的突发话题可读性更强。最后通过实验方法验证了算法的可行性。

【Abstract】 With the rapid development of network public opinion, especially on mi-cro blog, the domain platform for people to express their opinions in public events has rapidly changed. Public opinion analysis has become a popular re-search direction.In this paper, we mainly focused on the retrieval system and burst event detection system in Microblogging environment, the main research includes the following three aspects:1. This paper first discusses current research in text retrieval and burst detection, and then introduces some classic retrieval model and burst detection algorithm.2. Implement a short text retrieval system based on microblogging corpus. For a given topic, This system use query expansion algorithm based on the word active force (WAF) and "twice retrieval" algorithm to returns micro blog that is not only high correlation and high timeliness. Finally, the algorithm was used in2011TREC Micro-blog Track, and achieved good results.3. Provide some optimizations for the classical burst detection algorithm base on automaton adapted to the context of the micro blog. The two main di-rections include:first, due to the differences between micro blog and traditional webpage, pre-processing and model parameters could be different. Second, fo-cus on topic clustering method. That included promotion of eigenvectors and introducing lexical characteristics based on the word active force to similarities between words. Finally, there are experiments verified the feasibility of the algorithm.
