

Hot Topic Detection Based on Microblog Data

【作者】 孙励

【导师】 王小捷;

【作者基本信息】 北京邮电大学 , 计算机科学与技术, 2013, 硕士

【摘要】 微博作为新兴的互联网媒体,已经逐渐成为广大用户发表观点、共享信息的平台,其每日发布信息数以百万计、信息量庞大,用户难以浏览所有微博。同时,微博话题传播速度快、传播范围广,社会影响力高,因此从微博数据中获取热点话题并返回重要微博能帮助用户迅速把握社会关注热点,对于各类微博用户快速了解关键信息具有非常重要的价值。而当前微博平台基于用户关系的构建方式使得微博用户只能接收与其相关的微博内容而不能直接得到整个微博网络中的热点话题信息,所以从微博数据中挖掘热点话题返回给用户,可以获得更好的用户体验。虽然目前微博平台上已经提供了类似于热点话题榜的应用,但是介入了大量人工编辑因素导致热点话题的生成并不客观,并且以话题热度判断以讨论频次作为主要衡量指标,难以反映真实情况。本文首先研究了话题检测与热度判断的国内外相关技术,之后结合对微博热点话题的分析与总结、对已有微博热点话题相关应用的研究,提出了基于LDA模型的热点话题检测方法。该方法首先从微博内容特征出发,利用N元递增模型抽取重复字串,依据绝对词频、相对词频及互信息、邻接信息熵等统计特征过滤垃圾字串从而进行新词识别提取微博新词,并利用此结果提升分词结果的准确性;之后利用LDA模型挖掘微博数据的主题信息,将主题作为话题从而得到候选话题列表,同时可确定话题、词语、文档之间的关系;最后利用GibbsLDA++[具的结果,将词语与其所属话题看作一个整体即单义词单元,并通过计算单义词单元的权重即热度得到话题热度,对话题按热度排序以得到热点话题。该方法从微博的时问及内容特征出发、较有针对性,排除了人工编辑因素,因此挖掘的话题更为客观,并且通过实验验证了该方法在新词识别及话题检测上的有效性。为了使用户对热点话题有更全面的了解,本文进而提出了一种基于微博内容与话题相关性及发布者价值的相关微博返回方法,改进了目前微博平台仅以关键词语的匹配作为微博与话题相关性的判断机制,并结合影响微博内容价值的直接因素即微博自身评论数和转发数、间接因素即发布者影响力,对微博价值进行有效评估,从而实现对返回的话题相关微博的排序,使得用户可以以较小的阅读代价迅速了解热点话题相关事件及有代表性的用户讨论内容。

【Abstract】 As an emerging Internet media, microblog has gradually become a platform for majority of users to express their views and share information, there can be millions of microblogs released each day, the huge amount if information makes it difficult for users to browse all of the microblogs. At the same time, the propagation velocity of microblog topics is fast, the transmission range is wide and the social influence is high, therefore accessing hot topics from microblog data and return the relevant important microblogs can help users to quickly grasp the Public Interest, this has a high value for all kinds of microblog users to quickly understand key information. Meanwhile, the building way of microblog platform based on user relationship makes users can only receive relevant microblog information but can not directly receive the hot topic information of the entire microblog network, therefore hot topic detection from the microblog data mining can obtain a better user experience. Although microblog platform now has application such as hot topic list, it needs a lot of manual editing factors and main measure is term frequency, so it is difficult to reflect the true situation.This paper studies the topic detection and heat judgment related technologies at home and abroad first, then analyze hot topic of microblog data and related research on the application of the existing microblog hot topic, proposed a hot topic detection method based on the LDA model which can fully tap the theme information of the text for the shortcomings of the existing methods without traditional clustering methods. First, starting from the microblog content features, using N-gram model to extract repeated strings, then use statistical characteristics including both absolute and relative term frequency, mutual information, and adjacency information entropy to filter spam strings and extract microblog new words, so as to enhance the accuracy of segmentation results. Then use LDA model to mining theme information of microblog data, and treat theme as topic so getting a list of candidate topics, meanwhile determine the distribution of the topics on the words and the distribution of the documentation on the topics. At last, untilizing the results of GibbsLDA++tool, make each word and its respective topic a whole unit which is called single-word unit, calculate the weight of single-word units corresponding to words, so as to calculate the heat of topic, and finally find the hottest topics. The method using both the the time features and content feature of microblog and is more targeted, and rule out human-edited factors, so the topics are more objective, and validity of the method is verified by experiments both on new word identification and topic detection.To make the users have a more comprehensive understanding of the hot topics, proposed a topic-related microblog return method based on the relevance of the microblog content and topics, and also words matching. And then combine with the direct and indirect affect factors of the value of microblog content in order to effectively assess the value of microblog and sort the return microblogs, which make users can quickly understand the hot events related to hot topics and the focus of discussion of hot topics of majority of users with a small reading cost.

