节点文献

面向在线社区的用户信息挖掘及应用研究

Research on Online Communities Oriented User Information Mining and Its Applications

【作者】 刘璟

【导师】 洪小文; 刘挺;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2014, 博士

【摘要】 近些年,随着各种在线社区的发展,网络上积累了海量的用户信息,包括了用户账户信息(例如用户名)、用户人口信息(例如性别和年龄等)、用户社交关系(例如朋友关系和回复关系等)以及用户生成内容等。一方面,这些用户信息可以帮助企业更好的理解和定位客户,另外一方面可以为用户提供更好的个性化信息系统,同时可以帮助社会学家更好的理解人类行为。因此,挖掘在线社区中的用户信息是构建新的社会化应用以及理解人类行为的关键。然而,在线社区中的用户信息挖掘存在着各种挑战,包括了非结构化的挑战、跨社区的挑战和非度量化的挑战。非结构化的挑战是指在线社区中的用户信息以非结构化的形式呈现在各种不同类型的网页中,这些网页的布局结构的多样性和动态性为用户信息的自动抽取带来了困难。跨社区的挑战是指一个用户的信息碎片化的分布在不同的社区中,这为全方面理解一个用户带来了很大的困难。非度量化的挑战是指各种用户属性信息(例如影响力、专业水平等)缺少显式的直接度量,这为用户属性信息的直接应用带来了困难。本文主要针对这三个挑战进行了研究,并对用户信息的应用研究进行了一定的探索。具体的,本文的主要研究内容可概括如下:(1)针对用户信息的非结构化挑战,本文研究了面向用户生成内容网页的用户名抽取问题。本文提出了一种基于弱指导学习的方法。该方法利用少量的、由统计意义上稀有的字符串构成的用户名,自动收集和标注大量训练数据,解决了目前有指导学习方法需要人工标注训练数据的问题。同时,本文方法仅依赖于从单页面中抽取出的特征,克服了已有方法对于多页面特征的依赖性。实验结果表明,本文方法显著性优于仅基于单页面特征的有指导学习方法,并且和基于多页面特征的有指导学习方法性能相当。(2)针对用户信息跨社区的挑战,本文研究了跨社区的用户链指问题。本文将用户链指问题分为两步:(a)同名消歧,即判断使用相同用户名的用户是否属于同一个自然人;(b)不同名消解,即收集一个自然人所使用的所有不同的用户名。本文关注解决同名消歧任务。首先,本文进行了用户问卷调查和基于About.me数据的分析,量化的说明了解决同名消歧任务的重要性。这是第一个量化的研究人们使用用户名行为习惯的工作。然后,本文提出根据用户名的语言模型概率自动获取训练数据的方法。同时,本文在Yahoo! Answers的数据集上实验验证了该方法所基于的假设的合理性。本文方法解决了目前有指导学习方法需要人工标注数据的困难。实验结果表明,本文方法在自动标注的训练集上学习到的分类器是有效的。(3)针对用户信息非度量化的挑战,本文以用户专业水平估计为例研究了用户信息的度量。具体的,本文研究了问答社区中用户专业水平的估计问题。本文提出了基于竞赛模型的用户专业水平估计方法。该方法将用户专业水平的估计问题转换成了根据一系列二人竞赛的比赛结果估计选手的能力水平的问题。具体的,本文方法克服了基于链接分析的方法不能将问答关系和答案质量信息等异构信息进行统一建模的问题。同时,本文方法通过对每场比赛的难度进行建模,克服了基于答案质量的方法将每个问题相等对待的问题。实验结果表明,与基于链接分析的方法和基于答案质量的估计方法相比,本文提出的竞赛模型在估计活跃用户的专业水平时性能有显著性提高。(4)本文从应用的角度出发,在结构化、度量化、跨社区链指的用户信息基础上,研究了基于用户信息的众包任务难度估计。具体的,本文以问答社区中的问题难度估计为例进行了研究。本文利用用户专业水平的度量信息,提出了基于用户竞赛的模型估计问题的难度。用户专业水平的度量为问题难度的估计提供了指导,解决了之前方法不能处理观察值为偏序关系的问题。实验结果验证了本文所提出的模型的有效性。最后,本文利用跨社区的用户链指信息,研究了跨社区的问题难度估计问题。总之,本文一方面致力于解决用户信息挖掘中非结构化、跨社区和非度量化的挑战,另一方面从应用的角度出发,尝试了将结构化、度量化、跨社区链指的用户信息应用到众包任务难度估计的问题上来。本研究取得了一些初步的成果,期待这些成果能对本领域的其他研究者提供借鉴。随着用户信息挖掘技术的不断完善,相信用户信息挖掘技术会为各种社会化应用以及社会计算相关的研究带来更大的帮助。

【Abstract】 In recent years, with the development of various online communities, there is a hugeamount of user information cumulated on the web, including user account information(e.g. usernames), user demographic information (e.g. gender, age and location), usersocial relation (e.g. friend relation and reply relation) and user generated content. Onone hand, the user information can help enterprises better understand their clients andtarget new clients more accurately. On the other hand, the user information can be usedto build better personalized information systems. Additionally, the user information canhelp sociologists to understand human behavior better. Hence, the technologies of mininguser information from online communities are the keys to build new social applicationsand help understand human behavior.However, there are several challenges for mining user information from online com-munities, including unstructured data challenge, cross-community challenge and no mea-surement challenge. Unstructured data challenge means that the user information in on-line communities are shown as on the web pages in an unstructured way. The diversityand the dynamics of the web page layouts brings challenges to the automatic extraction ofthe user information as structured data. Cross-community challenge means that the difer-ent aspects of the user information are distributed in diferent online communities, whichmakes it difcult to fully understand all aspects of users. No measurement challengemeans that there is no explicit measurement of user characteristics (e.g., user influencelevels and user expertise levels), which makes it difcult to directly apply the user infor-mation. This paper mainly focuses on addressing these three challenges, and explores thethe applications of the user information. Specifically, the main contents of this paper canbe summarized as follows:(1) To address the unstructured data challenge, this paper studies the problem ofextracting usernames from the web pages containing user-generated content. This paperproposes a weakly supervised learning approach. The proposed approach utilizes a smallamount of statistically rare usernames to automatically collect and label large-scale train-ing data, which solves the problem with previous work that requires manually labeledtraining data. The proposed approach relies on only single page features, and addresses the problem with previous work that requires multiple page features. The experimen-tal results show that the proposed approach significantly outperforms the start-of-the-artapproach with single page features, and has comparable performance with the start-of-the-art approach with multiple page features.(2) To address the cross-community challenge, this paper studies the problem of link-ing users across multiple online communities. We define that the problem of linking usersacross multiple communities can be divided into two tasks:(a) the alias-disambiguationtask, which is to diferentiate users under the same usernames; and (b) the alias-conflationtask, which means to find all diferent usernames used by a natural person. In this paper,we focus on the alias-disambiguation task of the user linking problem. We start quantita-tively analyzing the importance of the alias-disambiguation step by conducting a surveyand an experimental analysis on a dataset of About.me. To the best of our knowledge, it isthe first study to quantify the human behavior on the usage of usernames. We then demon-strate an approach to automatically create a training data set by leveraging the knowledgeof the n-gram probability of a username. We verify the efectiveness of this approachby using the dataset of Yahoo! Answers. This approach addresses the problem with theprevious work that requires manually labeled training data. Additionally, we verify theefectiveness of the classifiers trained with the automatically generated training data.(3) To address the no measurement challenge, this paper studies the problem ofestimating user expertise scores as an example of measuring user characteristics. Specifi-cally, this paper considers the problem of estimating the relative expertise scores of usersin community question and answering services. This paper proposes a competition-basedmethod to estimate user expertise score. This method casts the problem of estimatinguser expertise scores as a problem of estimating relative skill levels of players in two-player games. Compared with the link analysis based approaches, our proposed methodsimultaneously models question-answer relation and answer quality information in a u-nified way. Compared with the answer quality based approaches, our proposed methodconsiders the difculty levels of diferent competitions, rather than weighting diferen-t questions equally. The experimental results show that our proposed competition-basedmodel significantly outperforms the link analysis based methods and answer quality basedapproaches on the dataset of active users.(4) Taking an application viewpoint, this paper studies the problem estimating thedifculty levels of crowdsourcing tasks based on the structured, linked and measured us- er information. Specifically, this paper studies the problem of estimating question (i.e.crowdsourcing task) difculty levels in community question and answering services. Thispaper proposes a user competition-based approach to estimating question difculty lev-els by leveraging the measurement of user expertise levels. The measurement of userexpertise levels can help address the problem with previous work that cannot deal withthe partial order observations. The experimental results show the efectiveness of ourproposed model. Finally, this paper studies the problem of calibrating question difcultyscores across communities by leveraging linked user information.In conclusion, this paper not only focuses on addressing the unstructured data chal-lenge, cross-community challenge and no measurement challenge, but also studies anapplication of structured, linked and measured user information, which is the problem ofestimating the difculty levels of crowdsourcing tasks. This research has achieved somepreliminary results, and we hope this can be helpful to other researchers in this area. Webelieve that the development of user information mining technologies will help buildingnew social applications and the research of social science.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络