节点文献

WEB环境下的社会网络挖掘研究

Mining Web Social Networks

【作者】 林琛

【导师】 汪卫;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2009, 博士

【摘要】 社会网络研究是理解社会现象,预测人类行为,分析社会结构的重要工具。进入Web 2.0时代以来,庞大的Web用户群体、频繁的Web用户互动和海量的Web内容构成了巨大的Web社会网络,使Web环境下的社会网络挖掘成为信息技术领域的新热点问题。在Web环境下进行社会网络挖掘对于理解Web用户的行为模式,改进各种Web应用如推荐、信息检索、网络舆情监测等系统的效果,从而带来更好的用户体验,提高社会生产效率具有重要的作用。Web环境下的社会网络挖掘需要面临以下几个主要的问题。首先,Web中的社会网络是隐含的、模糊的;其次,Web数据中包含着用户创造的海量内容,具有丰富的语义;第三,Web数据中有大量垃圾内容和垃圾链接;第四,Web数据的高度异构和类型繁杂使得Web上的社会网络不能用单一类型的节点和单一类型的关系来描述.研究Web环境下的社会网络挖掘需要重点解决以上这些问题。本文主要研究目标是Web上的文本数据,针对Web隐含的模糊的社会网络问题,Web社会网络的丰富语义问题,Web垃圾内容问题,以及多关系和多节点类型的多模社会网络问题,通过对用户行为的分析,采用基于矩阵的、基于生成模型的和基于马尔可夫链的Web社会网络建模方法,以达到抽取隐含社会网络、理解社会网络语义、识别垃圾内容、评测数据质量和挖掘多模社会网络的目标,并实现专家检索等Web应用。本文的研究对象包括Web论坛和企业、学术领域的数据。采用线程讨论的Web论坛是Web上宝贵的海量知识库,企业、学术领域数据包含大量专业知识,他们是进行数据挖掘和知识发现的重要对象。Web论坛中具有大量的垃圾内容。企业、学术领域数据中具有多种类型的实体和关系。针对这两个数据源,本文的研究工作和创新内容包括:用户行为分析在网络论坛中,用户发帖参与讨论,由此和其他用户进行密切的互动。为了更好的理解网络论坛中用户的社交行为和发文行为,本文通过大量统计分析,发现论坛用户的发帖数量和质量差异很大,揭示论坛社会网络的回复关系、好友关系和相识关系对于论坛用户的兴趣传播和专家知识传播具有明显作用。基于稀疏编码的论坛数据建模线程讨论具有结构和语义同步变化,相互影响的特性。针对现有的研究工作普遍对语义和结构分开建模的问题,提出基于矩阵的SMSS模型,同步的对线程讨论的结构和语义建模。同时,针对线程讨论中语义和结构的稀疏性,即每个帖子只覆盖少数几个主题、以及每个帖子只回复讨论线程中的少数几个帖子等特性,提出引入L1正则项在模型中对结构和语义进行约束。该模型能够抽取出较为精确的社会网络、能够较好的解决Web社会网络的丰富语义和数据质量问题,在垃圾内容识别和专家检索等应用中取得了较好的结果。基于生成模型的论坛数据建模方法针对SMSS模型对于垃圾内容识别和专家检索的解决方案较为直接简单的问题,本文同时提出基于生成模型的论坛数据建模方法。在PLSA的优化目标中加入反映帖子结构关系的正则项,以刻画线程讨论的结构和语义同步变化互相影响的特性:针对LDA模型不能准确刻画垃圾主题的问题,提出引入垃圾主题,以区别于有意义的主题;针对论坛作者发帖质量不同的问题,引入作者的发帖模式约束帖子的生成过程;针对现有专家检索模型对未观测到词的概率估计不准确问题,引入在上述模型中学习到的主题,扩展专家生成查询的过程;针对发帖数量很多但质量很低的噪声作者问题,在专家检索排序中引入作者的发帖模式信息;上述模型成功应用在语义解读、垃圾内容识别和专家检索中。基于马尔科夫链的多模社会网络建模方法企业、学术领域中存在多种类型的实体,如作者、论文、个人主页等,以及多种类型的关系如引用关系、合作关系等。为了能够更好的利用类型信息,调整类型的影响强弱,本文针对多模网络上的专家检索问题,提出在Web数据中抽取多模网络的框架;通过在文本中根据给定查询自动生成转移概率矩阵,基于马尔可夫链对专家进行排序;针对在多模网络上的马尔可夫过程计算到达专家节点的概率问题,提出在多模网络上的马尔可夫随机游走过程,并证明该过程是遍历不可约的;针对在如Enterprise和学术领域的应用场景中专家检索的实际需求,提出在社团中的专家检索问题,并提供解决方案。上述模型在专家检索和社团中的专家检索等应用中取得了较好的结果。

【Abstract】 Social Network Analysis has been widely recognized as an important tool for understanding human behavior and analyzing social structure. As we are in the age of Web 2.0, more and more users join Web communities. With numerous users’ content contribution and frequent communications and collaborations among them, Web has become a huge social network with volumes of social content. As a result, recent years have witnessed an emerging research trend on mining Web social networks. Research efforts on mining Web social networks have been proved to be helpful in capturing Web user behavior patterns, enhancing performances of Web applications (such as recommender systems, information retrieval systems, and public sentiment systems), bringing better user experience, and increasing working efficiency.However, mining Web social networks is challenging, due to the difference between virtual interactions among Web users and actual interactions among people, and the difference between Web content and traditional content. In general, the following reasons prevent researchers from fully exploiting Web social networks. First of all, Web social networks are implicit while traditional social networks are explicit. Secondly, the wide availability of social contents created by Web users offer abundant semantics. Thirdly, with various kinds of social interactions and heterogenous social actors, Web social networks are multi-mode networks. Finally, since users are encouraged to contribute contents, there are a mass of junk contents and junk links, along with diverse content qualities.Towards those challenges, this dissertation focus on mining Web text data to fulfill the goals of extracting implicit social networks, revealing semantics, identifying junk contents, measuring content quality and mining multi-mode social networks. Several techniques based on matrix, generative model, and Markov chain are proposed, and implemented on Web applications including expert search, junk identification and text clustering etc.The first part of this dissertation pays attention to mining social networks in Web forums.Threaded discussions are popular choices for Web users to exchange information, hence they have been employed in a wide range of Web applications, including Web forums, instant messages, chat rooms and Web logs(blogs) etc. Hence threaded discussions are valuable data sources for knowledge mining. This research addresses three aspects of mining large-scale social networks in Web forums.- User behavior analysis In Web forums, users interchange ideas and opinions with each other by posting comments and discussions. By analyzing the diverse posting behavior and social behavior of forum users, this contribution reveals that reply, knows and friend relations significantly affect interest and expertise diffusion in Web forums.- Modeling forum data based on matrix Semantics and structures are couple with each other in threaded discussions: replies indicate sharing of topics and vice versa. To model this property, a matrix based SMSS model is proposed to simultaneously model semantics and structures of threaded discussions. The model imposes two sparse constraints to force a sparse post reconstruction in the topic space and a sparse post approximation from previous posts. SMSS model is successfully employed in three applications including social network extraction, junk identification and expert search.- Modeling forum data based on generative models Inspired by the intuition of SMSS model, generative models are presented to model the semantics and structure of threaded discussions. In particular, a PLSA-style model is presented with a regularizer to extract the reply relationships; a LDA-style models are presented to distinguish junk topics and meaningful topics; user posting patterns are learned to leverage the quantity and the quality of related posts in ranking experts.The second part of this dissertation focus on mining multi-mode social networks. Towards the problem of mining experts in multi-mode networks, an ergodic markov chain model for multi-mode network is presented to discover experts. Mining experts in communities is studied to satisfy the personal information need in enterprise and academic environment.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2010年 12期
节点文献中: