

Research on Micro-blogging Community-finding

【作者】 禹航

【导师】 余鑫;

【作者基本信息】 华中科技大学 , 通信与信息系统, 2011, 硕士

【摘要】 作为当今最火的互联网应用,微博客正以燎原之势俘获广大用户,在2010年10月份的一份统计中,仅仅是新浪微博就有超过五千万的用户量,而twitter(推特)用户数量更是突破了两个亿,成为世界上使用用户最多的互联网应用之一。面对这样庞大的用户数量,无论网络管理者还是网络使用者都面临一个全新的课题:如何找到与自身相关的人群来互动,也就是传统上的社群概念。为了解决这个问题,我们打算基于数据挖掘的理论,寻找一种有效的社区挖掘算法。和传统的社区挖掘不同,本算法应用领域将是基于真实信息并有着庞大数量级的微博客用户,这要求算法的领域模型和以往将有着比较大的区别并且在时间复杂度上有了更高的要求。针对微博客的特点,我们尝试建立了聚类模型,以朴素贝叶斯模型为基础,构造了用户和社区之间的概率评分机制,给出了对用户进行社区划分的一种思路。此外,为了解决社区挖掘中寻找中心节点的需求,我们还研究了用户重要性算法,将微博客用户抽象成一个模型,给出了多维变量的一个评分标准。为了验证以上算法,我们还进行了离线实验。采取网络爬虫从国内著名微博客服务商新浪微博处获取七十万左右的用户数据,在这个数据集上进行前文提到的用户重要性算法和社区挖掘算法的实验,都取得了不错效果。另外,我们将通过社群挖掘技术研究如何寻找微博客平台上的“意见领袖”,这将使得针对网络的分析和管理更加有的放矢。

【Abstract】 As the most fashion internet application, Micro-blogging attracts cyber citizen. In the report from sina on Oct, 2010, there are over 50 million people who use sina Weibo. And the user of twitter is even more over 0.2 billion.The Micro-blogging fast increase leads to a new question: how can we find a companion from so many people, which we call community. We will find the answer to this question aim to some data mining knowledge. It is different from the tradition community-finding, because the micro-blogging has its own feature, which needs the algorithm must be more efficient and the model must be more complex. We will research Community-finding, build a model base on the Bayesian model comparison, and find a way to quantify the relationship between one user and one community. We also research on user-influence in the micro-blogging. We build the model from the data of user, quantify the influence of users. To check these algorithms, we do many experiments on the off-line data. We collect data from the sina Weibo through the web crawler, and obtain over 700,000 user data. After these experiments, we prove our algorithms are effectiveBesides this, we also try to finding the opinion leader in our study. We brings an algorithm about user influence, and the demo based on this theory show it’s effective to find the opinion leader.

  • 【分类号】TP393.092
  • 【被引频次】6
  • 【下载频次】578

