节点文献

基于用户特征的社交网络数据挖掘研究

Research on User Features Based Data Mining in Social Networks

【作者】 廉捷

【导师】 何德全; 刘云;

【作者基本信息】 北京交通大学 , 信息网络与安全, 2014, 博士

【摘要】 数据是互联网中最宝贵的资源之一,海量数据中蕴含着巨大的潜在价值,深入挖掘这些数据对于互联网电子商务、企业决策与推广、信息传播与预测均具有重要的意义。随着Web2.0网络应用与移动终端设备的发展,社交网络的普及率与使用率日益提高。相比传统网络应用形式,社交网络具有用户主体性强、网络特征多样、数据内容丰富、群体交互密切、信息传播迅速等特点。传统的研究方法与模型难以准确地描述社交网络中用户的行为特征,因而难以实现符合社交网络特性的数据挖掘与分析。鉴于此,论文结合交叉学科的研究方法,针对现有算法与模型运用于社交网络时存在的效果与性能问题,分别从互联网数据采集与处理、社交网络数据实证分析、用户影响力与行为分析、用户个性化推荐算法以及基于机器学习的信息预测算法等角度,对社交网络中的数据挖掘方法进行了研究。论文的研究工作得到了国家自然科学基金项目(No.61172072、61271308)、北京市自然科学基金项目(No.4112045)和高等学校博士学科点专项科研基金(No.20100009110002)的支持,论文的主要研究内容如下:1.研究了互联网数据采集与预处理技术。针对数据挖掘相关研究对于数据样本精度与模型处理性能的具体要求,提出了一套数据抓取与处理的完整方案。首先,优化了基于Nutch的分布式网络爬虫系统,实现了爬虫系统的并行化同步运行方式,提升了爬虫处理性能。之后,重点研究了网页信息解析算法,提出了基于规则与基于wrapper的网页解析模型。基于规则的网页解析模型逻辑简单且普适性强,适用于互联网海量网页的处理工程;基于wrapper的网页解析模型具有较高的信息抽取精度,且能够实现来自相同网站信息的结构化处理。最后,研究了网页快速消重算法与自动摘要算法,以到达降低样本特征的数量与维度,提高数据质量的目的。2.实证分析了微博社交网络特征与用户特征。对新浪微博在线数据进行了多维度分析,包括用户特征、微博特征、时间与演化特征等,探讨了作用于用户影响力与微博传播关系的主要因素。在上述分析的基础上,提出了一套微博社交网络用户权重计算模型。该模型由用户活跃度特征与基于HITS算法的用户影响力特征加权实现,并在数据分析的基础上改进了HITS算法的实现方式,降低了传统HITS模型用于迭代的运算时间。社交网络中更强调人与人的交互关系,本文用户权威性分析,为进一步研究社交网络中的信息推荐与传播机制提供了理论基础。3.研究了社交网络中的用户个性化推荐算法。针对传统推荐算法不足以描述社交网络中的用户偏好性问题,提出了基于统计特征的微博推荐算法。该算法由用户微博内容偏好性、微博作者影响力水平与用户交互关系三大特征加权构成,算法逻辑简单,计算性能较高,适用于在线微博平台的应用级研究。为进一步提高模型的推荐精度,论文借助基于二元网络的NBI推荐模型,对NBI模型初始矩阵与计算中连接权重进行了优化,并将具有社交网络特色的用户特征对于微博的偏好性影响加入到模型中,实现了微博的个性化推荐。试验结果表明,该算法相比NBI模型或单一偏好特征推荐模型,具有更好的个性化推荐效果。4.提出了基于机器学习的信息预测方法。结合微博社交网络的真实数据,分析了影响用户连接关系以及微博传播的主要特征因素,建立了基于SVM的用户连接关系预测模型与基于逻辑回归的用户微博转发模型。为提高算法的预测性能与big-data模式下模型的实现方式,初步探讨了相关机器学习模型的并行化参数训练方法,提出了SVM模型的松弛变量权重优化算法,提升了模型的预测精度。最后,以用户微博转发模型计算结果作为个体决策先验概率,利用蒙特卡罗仿真方法模拟了微博在社交网络中的传播过程。该方法通过微观个体决策模型,结合全局仿真,不但能够预测信息的宏观传播趋势,还可以发现传播路径中可能存在的关键用户节点,为信息的传播预测研究提供了参考和借鉴。

【Abstract】 As one of the most valuable resources of the internet, data usually contain important information and can be utilized for many ways. Thereby data mining is significant for e-commerce, enterprise strategy and promotion as well as information diffusion and prediction. With the development of Web2.0techniques and mobile terminals, social network services are increasing their popularities and utilizations in people’s daily life. Compared with the traditional networks, it emphasizes more on user’s proactive role, diversity of network features, huge information, user interaction&reciprocity, and fast message propagation in social networks. Traditional approaches and models are inadequate to describe user behavior features. Therefore current methods are insufficient for data analysis and mining in SNS networks. In view of this, this paper utilizes interdisciplinary methods and theories to study on the researches of data mining in social networks, including data retrieval and preprocessing, network analysis, user influence and behavior, personalized recommendation, and machine learning based prediction methods, in order to enhance efficiency and effectiveness of the existing algorithms and models.The work of the dissertation is supported by the National Natural Science Foundation of China (No.61172072,61271308), Beijing Natural Science Foundation (No.4112045), and the Specialized Research Fund for the Doctoral Program of Higher Education of China (No.20100009110002). Main contributions of the dissertation are as follows:1. We research on information retrieval and data preprocessing techniques. Due to the idiographic requirements on data accuracy and computational performance of the models, we propose an integrated framework of data retrieval. Firstly, we optimize the distributed web crawler based on Nutch system by a synchronous operational architecture. This improvement can enhance the efficiency of the web crawler. Then we study on information extraction methods and propose two webpage extraction models based on rules and wrappers. Rules based extraction model is pervasive and has a simple computational complexity which is applicable for mass information process in the internet. Wrappers based extraction model can implement a highly accurate data retrieval and realize a structurized information extraction within the same domain. Besides, we research on rapid webpage de-duplication algorithms and automatic summarization algorithms for the purposes of reducing the data magnitude&dimensionality and enhancing the quality of the information.2. We empirically analyze the network features of SINA Microblog social network including user characters, tweet features, network evolvement, etc and discuss the dominating factors which act on user influence and tweet oriented information dissemination. Motivated by statistical results and conclusions above, we establish a user weight model for Microblog networks. This model is composed by user active degree and HITS based user influence factor. We improve the HITS algorithm by eliminating the iteration in node authority calculation process. The interaction among users is one of the most important identities in social networks. Therefore the analysis of user authority can take an active role in the researches of information recommendation and diffusion.3. We study on personalized recommendation algorithms in social networks. For the problem that the existing recommendation models can hardly describe user preferences in social networks, we introduce a tweet recommendation algorithm based on statistical features for Microblog networks. This algorithm combines content similarity, author influence and user reciprocity. It has a low computational complexity and is adaptive for the applications of real online Microblog systems. In order to enhance the recommendation accuracy, we employ a bipartite network based model by which is named NBI into our tweet oriented recommendation research. We improve the traditional NBI model by the original matrix and the link weights in the resource allocation processes. We combine the improved NBI model with user features for the final model which can eventually address the distinct characteristics for social network recommendations. The experimental results illustrate that the effectiveness of our proposed model is better than either the traditional NBI model or singe preference based recommendation model.4. We propose the information prediction approaches based on machine learning algorithms. According to the empirical analysis of SINA Microblog network, we confirm and quantitate the exact features for the eigenvector of the data samples. Consequentially we establish the user link prediction model based on logistic regression and the user re-tweet model based on SVM. In order to improve the classification accuracy and enable a robust implementation for big data, we preliminarily explore the parallel computing patterns for the relative machine learning models in the parameter training process, and optimize the coefficient weight of the slack variable in SVM model. Eventually, we employ the computational result of the user re-tweet model as the prior probability for an arbitrary node, and then utilize Monte Carlo method to simulate the tweets propagation process in SINA Microblog. This method is based on the microcosmic user model, and integrates with the macroscopic simulation of the information dissemination. Therefore, it can not only predict the general trends of a web topic, but also can discover the key users in the information diffusion trace. These approaches could provide positive ideas for the researches of information dissemination and prediction.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络