节点文献

社会网络社团挖掘若干关键技术研究

Key Technologies Research on Community Detection in Social Networks

【作者】 林旺群

【导师】 吴泉源;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2012, 博士

【摘要】 社会网络泛指由世界上具有复杂连接关系的、与研究目的相关的事物或对象所组成的一种网络,广义上的社会网络包括人际关系网络、酶蛋白质网络、生物链网络、新陈代谢网络以及以互联网为代表的各种社交网络等。社会网络可以抽象为一个具有复杂连接关系和结构的可演化图。通常,社团是一组内部连接紧密,外部连接稀疏的节点集合。社会网络社团挖掘对于加强对社会网络的拓扑结构理解、功能分析和行为预测具有重要的理论意义及实用价值,已被广泛应用于恐怖组织识别、新陈代谢路径预测、蛋白质相互作用网络分析、搜索引擎开发和网络舆情监控等诸多领域。本文针对社团挖掘的不同应用场景,分别就多重图模型下的社会网络、含有不完全连接信息的社会网络、二部图模型下的社会网络和多类型节点组成的社会网络的社团挖掘关键技术进行研究,主要贡献有:1、提出了一种基于层次化子图树的多重图社会网络社团并行挖掘方法。根据多重图数学模型,社会网络中任意两个节点之间允许有多条连接边,每条边对应于一个连接属性,赋以一个自然数的权重。对于社会网络图中的任一子图,本文运用该数学模型,通过综合该子图各边权重之和与节点个数引入子图密度的概念,并基于从权重较低的边出发进行逐步分解的思想,构造了子图密度逐层递增的层次化子图树HAT,提出了一种针对静态社会网络的层次化社团并行分解更新算法。进而,根据动态社会网络在相邻时间段内的局部变化原理,提出了针对动态社会网络的层次化社团并行分解动态更新算法。模拟数据集上的实验表明,以上算法在社团挖掘的准确度上优于GN等同类算法;数千万边和节点规模的真实数据集上的实验表明,以上算法适于高度并行计算,可以有效应用于大数据社会网络的社团挖掘。2、提出了基于节点亲近度和马氏距离的不完全信息社会网络社团挖掘方法。假设考察的社会网络各节点具有相同的属性,每个节点对应于一个n维属性值向量;节点之间的连接信息具有不完全性,即某些局部区域的某些连接未知。本文选取具有完全连接信息的某些区域作为学习的样本区域,首先引入节点亲近度的概念,任意一对节点的亲近度由各自连接的邻居数和公共邻居数的某个经验公式给出,亲近度越接近于1,节点对越亲近,为0的节点对视为不亲近。接着对样本区域的每个节点的n维值向量,乘以某个矩阵,将它们变换到一个称为马氏空间的新空间中,使越亲近的节点对,其值向量在马氏空间中的距离,即马氏距离越小。于是,问题便转化为能够迭代求解的数学规划问题:寻找马氏变换矩阵,使所有亲近的节点对,其马氏距离的亲近度之加权和达到最小,同时使不亲近的节点对的马氏距离之和达到最大。最后,通过以上学习过程得到的马氏矩阵,将整个社会网络中的所有节点变换到马氏空间中,经贪婪搜索,给出一种基于节点马氏距离的聚类算法,以及一种-近似社团启发式分解方法。经数千个节点规模的真实数据集的实验,表明本文提出的方法和解决方案是有效的。3、提出了基于图像变换的二部图社会网络重叠社团挖掘方法。二部图社会网络指由两组子网组成的社会网络,边连接仅存在于不同子网的节点之间,子网内部没有任何边连接。其社团挖掘问题是要找出那些分属两个子网的两个社团,称为社团对,尽管社团内部各节点互不直接相连,却与共同的对方社团的节点有密切关联。设两组子网分别包含m个和n个节点,二部图连接关系可以用一个m×n的0/1关系矩阵或黑白图像描述。其社团对的挖掘问题转化为:如何通过整行(列)图像的不断交换与调整,使图像中的黑色像素点最大程度的聚合在一起。本文首先基于哈夫曼图像编码理论,以信息量最小化为目标,改进了一种行(列)贪婪搜索与交换的双聚类迭代算法,使整幅图像分解成若干矩形块,找出那些黑像素点密度较高的块,也就挖掘出了相应的行社团和列社团。接着,进一步针对某些行(列)节点可能同时属于多个不同的行(列)社团的应用场景,提出了一种基于增益函数的重叠社团模式搜索双聚类算法,其核心思想是,对于某一其它行(列)社团的节点能否同时属于本社团,取决于该节点的加入能否使本行(列)社团得到正收益,为此引入增益函数,它是新老社团所对应的块状区域密度差异的加权和。模拟数据集和真实数据集上的实验表明,算法具有良好的二部图社团挖掘效果。4、提出了一种具有多类型节点的社会网络社团挖掘方法。在多类型节点组成的社会网络中,连接既存在于同类节点之间,也存在于不同类节点之间,而社团组成仅限于同类节点。具有n类节点的社会网络的连接关系可以用n×(n+1)/2幅黑白图像描述。本文首先将第一类节点相关的n幅图像拼接成一长幅图像,与第二类节点相关的n-1幅剩余图像也拼成一长幅图像,以此类推。接着把社团挖掘问题转化为与二部图相似的长幅图像的行和列的交换问题,其差异主要在于列交换时必须注意保持各列的语义一致性,其复杂性主要表现在需要综合各长幅图像的黑像素点聚集度。本文运用哈夫曼图像编码理论,以最大化压缩描述上述所有图像为目标,分别在社团数量已知和未知条件下,提出了相应的多聚类贪婪搜索算法。模拟数据集和真实数据集上的实验表明,算法在聚类精度上优于NMF等同类算法。

【Abstract】 A social network is a social structure made up of a set of entities such asindividuals or organizations, and the dyadic ties between these entities. Generally, asocial network is related with the research goals and can be used to describe differentrelationships in different fields. Commonly, social networks include human interactingnetworks, enzymatic protein networks, biologic chain networks, metabolic networks anddifferent kinds of virtual interacting networks of the Internet. The social network can bedescribed by a dynamical evolution graph, in which each node stands for an entity andeach edge between two nodes stands for the interaction between them. Communities areimportant components of the social networks. Usually, the community is formed with aset of nodes, in which the nodes are densely connected internally and loosely connectedexternally. Community detection is critical for the topological structures understanding,function analysis, and node actions prediction of the social networks. Hence, the socialnetwork has important theoretical and practical value on research and applications.Moreover, community detection is commonly used in many applications such asterrorist organization identification, metabolic pathway prediction, search enginedevelopment and public sentiment monitoring, etc. In this paper, according to differentapplication environment, we focus on community detection in multigraph socialnetwork, bipartite social network, incomplete information social network and the socialnetwork with multi-type nodes respectively. The contributions of this paper include:Firstly, a parallelizable method based on hierarchical subgraph trees forcommunity detection in multigraph social network is proposed. According to themathematical definition of multigraph, the multigraph is permitted to have multipleedges between any two nodes, and each edge is permitted to attach with a naturalnumber weight. We use the multigraph to describe the social networks. Based on themultigraph model, the density of any subgraph in the social network is defined bysimultaneous considering the sum of the weight and the total number of the edges in thesubgraph. In order to generate the hierarchical subgraph tree (called HAT), aparallelizable algorithm for generating hierarchical communities in static social networkis proposed. Considering the dynamical social network only changes in a small localarea in the short continuous time, a parallelizable algorithm for generating hierarchicalcommunities in dynamic social network is designed. Through the experiments on realand synthetic datasets, we demonstrate our algorithms are more accurate than thecompared algorithms such as GN etc. Further experiments on big data social network,which has more than ten millions of nodes and edges, validate our algorithms arescalable and fit for community detection in the big data social networks.Secondly, a community detection method based on closeness and Mahalanobis distance for the incomplete information social network is presented. We assume eachnode in the incomplete information social network is attached with a set of attributes,which is described by a vector. We also assume there are many local completeinformation regions in the incomplete information network. Moreover, in each localcomplete information region, the complete link information is provided, but other linkinformation outside of the local complete information areas is unknown or incomplete.The main idea of detecting the community in incomplete information networks is asfollows. We first learn a global distance metric from the local information regions withcomplete link information. Then, we use the global metric to measure the distancebetween any pair of nodes in the network. Because the metric is learned from thestructure of the network, the distance should also reflect the missing link structure in thenetwork. In order to learn such a global distance metric, the closeness between any pairof nodes is defined by considering how many neighbors they shared and how manyneighbors they have respectively. The value of closeness ranges from0to1, and thelarger value of closeness between any two nodes, the closer between them from theview of link structure. Subsequently, the closeness between any pair of nodes in eachlocal complete information region can be computed. Consequently, the problem oflearning the global distance metric is transformed into learning a Mahalanobis matrixwhich satisfies: for any pair of nodes in any local complete information region, thehigher value of closeness between them, the short distance between them after beingprojected in the Mahalanobis space. Finally, a distance-based community detectionalgorithm for incomplete information social network is presented. To speed up theclustering process, a heuristic method for generating the-approximation communityis designed. The experiments on real data sets demonstrate the effectiveness andefficiency of the proposed methods.Thirdly, an overlapping community detection method based on imagetransformation for bipartite social network is presented. The bipartite social network ismade up of two sub-networks. Particularly, in bipartite social network, edges can onlyexist between nodes from different sub-networks, and not exist between the nodes fromthe same sub-networks. The problem of community detection in bipartite social networkis to discover the community pairs which exist in different sub-networks of the bipartitesocial network. Assume there are m nodes in one sub-network and n nodes in anothersub-network. Then, the bipartite social network can be described by a0/1relationalmatrix or a white and black image. Then, the community detection problem in bipartitesocial network is transformed into how to adjust the rows and column of the white andblack image for clustering the black pixel together as much as possible. Based on theHuffman coding theory for image, we try to compress the total information which isused to describe the image by adjusting the positions of rows and columns. Then, a greedy co-clustering algorithm, which can discover the row clusters and column clusters,is improved. The blocks, which are at the crossing of the row clusters and columnclusters, either have a very high or a very low density of pixel. In order to detect theoverlapping row clusters and column clusters, a gain function is defined. Next, anoverlapping co-clustering algorithm based on the gain function is designed. The mainidea of this algorithm is that: for any row (column), if it can be put into more than onerow (column) clusters, it should improve the value of the gain function, otherwise, itshould not be put into the new row (column) cluster. The experiments on synthetic andreal data sets validate the effectiveness and efficiency of the proposed methods.Fourthly, a community detection method for social networks with multi-type nodesis presented. In the social networks with multi-type nodes, the edges exist betweennodes from not only the same type but also different type. Assume there are n types ofnodes in the social network, we can use n(n+1)/2images at the most to describe therelationships between different types of nodes in the social network. We firstconcatenate the n images, which share the same type of nodes. Then, for the second typeof nodes, the left n-1images are concatenated together as the first type of nodes. Thisprocess is repeated until all of the images are dispatched to different concatenatedimages. According to the Huffman coding theory for image, the problem of communitydetection in social networks with multi-type of nodes is transformed into a new problemwhich is how to compress the set of concatenated images by adjusting the positions ofrows and columns in the concatenated images. Since the new problem was proved to beNP-hard, two greedy algorithms, which are used to detect communities when thenumber of communities are provided and not provided, are presented respectively.Experiments on synthetic and real datasets demonstrate the presented algorithms aremore effective than the compared methods such as NMF.

节点文献中: