节点文献

基于聚类分析的网络用户兴趣挖掘方法研究

Clustering Based Net User Interest Mining

【作者】 马力

【导师】 焦李成;

【作者基本信息】 西安电子科技大学 , 电路与系统, 2012, 博士

【摘要】 网络应用的深入发展使网络信息服务系统的服务模式从集中统一的被动型向分布式个性化的主动型演进。实现这种服务模式转换的一个前提条件是对网络用户需求规律的深入理解,进而依据这些规律指导信息服务系统的信息资源组织与调整,使用户的需求信息与系统提供的尽可能一致。网络用户兴趣作为网络用户信息需求规律的一种形态,是构造新一代信息服务系统中资源组织自适应机制的工作基础。本文围绕用户兴趣模式提取这一目标,以用户访问的网页中文文本信息为对象,利用复杂网络理论、图论、随机过程理论、人工免疫网络原理及中文语义计算等方法与技术,较为深入的研究基于文本聚类的用户兴趣挖掘算法及相关问题,以期在降低聚类算法的计算复杂度,实现软聚类及探索新的处理方法等方面进行有益的尝试。主要研究内容包括下述四个方面:(1)用户兴趣挖掘模型。网络用户兴趣模式是用户个体和用户群体使用网络行为规律的描述,网络兴趣挖掘模型则是获取用户兴趣模式的一组规范处理流程。针对Web用户访问Web站点的行为过程,本文依据全信息理论中的信息过程模型,提出了一种网络用户兴趣挖掘概念模型,其核心是从信息认知角度描述挖掘用户兴趣模式的处理过程,这种信息认知是由语法认识和语义认知二个层次来描述。该挖掘模型的重要特点是将多层次多角度的用户兴趣处理过程统一到一个框架中。为了具体指导网络用户兴趣挖掘工作,文本给出基于聚类分析的用户兴趣模式及迁移模式的挖掘模型。应用实践表明所提出两个模型是合理的。(2)文本聚类中的降维处理算法。针对文本特征集维数较大这一典型问题,利用小世界网络模型具有描述自然界和人造系统的动态属性和结构特征之间关系的特点,本文采用K-最近邻耦合方式构造文本词语网络图,该文本词语网络中的节点表示文本中的词语,边表示词语间的某种空间距离上的相邻关系。引入词语聚类系数变化量和平均最短路径变化量度量词语的重要性。通过计算词语的这两个变化量来确定词语是否存在小世界特征,进而实现特证词的选择。该方法的特点从基于空间距离的文本组织结构中选择特征词。实验结果表明该方法是有效的,为文本特征提取提供一条新的解决途径。(3)文本聚类算法研究。虽然已有许多成熟聚类方法较好地实现文本聚类分析,但由于词语的多义性,文本特征的稀疏性以及文本类别分布的多样性,使得聚类结果很难保证生成文本类别与人们所期望的类保持高度一致。为此,仍需从多种技术途径研究聚类算法。鉴于传统基于优化方法的聚类算法普遍存在需要事先知道聚类类别数,对类边界不清晰的数据处理不当及易陷入局部极大等问题,将人工免疫系统(ArtificalImmune System,AIS)方法引入到文本聚类处理之中,提出自适应多克隆聚类算法,其主要处理环节是引入重组算子来增加抗体种群中个体的多样性,以扩大解的搜索范围,避免过早出现早熟现象;引入非一致变异算子增强局部求解的自适应性,优化局部求解性能,加快解的收敛速度;用亲和度函数调节聚类类别。另外,利用Markov链证明算法的收敛性。针对文本数据,对上述算法进行适当的裁剪,提出基于人工免疫网络的文本聚类算法,实验结果表明算法聚类的有效性高。现实生活中许多事物都可以用一个复杂的网络来描述,在这些实际网络中都存在着一个共同的性质:社团结构。复杂网络中的社团结构发现本质上就是网络上节点的聚类处理,本文将复杂网络理论中的方法引入到文本聚类分析中,提出基于社团结构发现的文本聚类算法,利用知网(Hownet)语义相似度计算公式,定义文本相似性度量方法,依据文本相似性构造文本关联图,利用称为Newman聚类算法实现文本的聚类分析。这种方法的特点是可处理大规模问题。针对目前的大多数文本聚类算法都将文本进行严格归为一类和计算复杂度高的问题,考虑后缀树模型能有效的表示特征词间的关系、具有增量式更新以及遍历时间短等特点,本文将后缀树模型引入文本聚类中,提出了基于语义计算的后缀树聚类算法,该算法通过对特征词语义相似度和权重的判断构建后缀树,选择基类节点构造基类连通图,求解树连通性以便实现聚类处理。为了降低算法的时间和空间复杂度,进一步提出基于语义后缀网的聚类算法,本算法的改进之处是:通过计算特征词间的语义相似度来构建后缀网,使后缀网的节点数和分支数减小,并通过特征词的权重判断来选择基类。实验结果表明这两种算法都能实现文本的软聚类,时间复杂度小,且聚类的类簇标识可读性强。(4)网络用户兴趣模式及变迁模式发现。用户兴趣模式实际形式是用一组有显著类别的特征词集合组成。本文通过计算文本簇中的大部分文本中出现同一个词语或者出现一类词义相似的词语的词频来选择生成用户兴趣模式的。用户兴趣的迁移模式是用户兴趣模式随时间动态变化的一种描述。针对文本存在多主题性这一问题,提出了一种基于隐马尔可夫原理的用户兴趣序列获取方法,该方法以用户访问序列和用户兴趣为对象,建立基于用户兴趣序列的隐马尔可夫模型,采用其解码问题相关算法实现用户最优兴趣序列的获取。采用序列模式挖掘算法获得用户兴趣序列的频繁模式。这些频繁模式就是用户兴趣的迁移模式,其本质是一种具有顺序特征的用户兴趣关联规则。为了提高挖掘效率,采用基于频繁链表-存取树(FlaAT)结构的挖掘算法获取频繁模式,该算法的优点是处理速度快且能通过更新FlaAT结构实现序列的增量式挖掘。实验表明所提方法是可行的,挖掘出的用户兴趣迁移模式不仅能够表现出用户兴趣的变化,也能够反映出用户兴趣之间的关联和变化规律。

【Abstract】 With the development of net application, the service model transforms from theintegration and uniform model to the contribution and personalization model. To realizethis model conversion, one precondition is having in-depth understanding of demandrules to net users, results in guiding the organization and adjustment of informationresources of information service systems according to these rules, and making theinformation of user requirements and system supply as far as consistent. As one form ofnet user information demand rule, the net user interest is the foundation to construct anew generation system of information service which has the character ofself-configuration for the resource organization.To make beneficial attempt such as reducing the computational complexity ofclustering algorithms and realizing the soft clustering and exploring new methods tosolve the clustering problems, this paper conducts deeper research on user’s interestmining algorithm based on text clustering algorithm and other related questions,focusing on the goal to extract the user interest, making Chinese text information ofusers accessing webpage as object, and using theories and techniques like complexnetwork theory, graphic theory, stochastic process theory, artificial immune networktheory and Chinese semantic calculation. The main content contains the following foursections:(1) The interest mining model. The network user interest model is the description for thebehavior law of individual user and user groups using net, and the mining model is a setof standardized processes for getting the user interest model. According to the behaviorprocess of Web users accessing to Web sites, this paper proposed a concept model of netuser’ s interest excavation, which was based on the model of information processcontained in the theory of the full information. The core of this model is a processingprocedure to describe and mine user’s interest mode from the Angle of informationcognition, which is described by grammar and semantic cognition. The importantfeature of this mining model is unifying the user’s interest process which is multi-leveland multi-perspective to a frame. To guide mining task in detail, this paper gave themining model of user’s interest mode and migration mode upon clustering analysis. Andat last, we use the experiment to demonstrate the rationality of the models.(2) Dimension reduction algorithm in text clustering. Focused on the typical problemsof the big dimension number, we used the features of description the dynamic properties and structure factors between the nature and the artificial system of the model ofsmall-word net and we used the K-nearest coupling algorithm to construct the textwords network diagram, and in this diagram, the nodes stands for the words in the textand the ledges stands for the neighbor relationship on distance. We also exposed theimportance of the changing of the clustering number and the shortest path length inmeasuring the words. Through calculating the variation of the words, we can verifywhether it has the feature of small-word network and to realize the selection of thefeature words. The results of experiments demonstrated that the method is rationalityand this is a new way to extract the feature of the text.(3) On the research of the clustering algorithm. Although there are many goodalgorithms for realizing the analysis of text clustering, it is hard to guarantee theconsistency between the text types and the requirements, because of the ambiguity ofthe words and the sparse of the text feature. So, it is necessary to research the clusteringalgorithm from other technical ways. Based on a simple description of the basicprinciple of biology immune and colonel process, a poly-colonel clustering algorithmwith self-adaptive feature is put forward. The main idea of the algorithm is to putvarious operators in artificial immune system into clustering process and adjustclustering numbers automatically by affinity function. The recombination operator isintroduced to increase the diversity of antibody group so as to broaden the search scopeof the global optimization solution and avoid early mature phenomenon of the group.And the non-consistent mutation operator is introduced to enhance the adaptability andoptimize the performance of local solution seeking; meanwhile convergence of thealgorithm is speeded up. The experimental result shows that reasonable clustering couldbe realized by the proposed algorithm.In this paper we introduce the method of complex network theory to textclustering analysis, based on the algorithm for detecting community structure incomplex networks, a new method of clustering algorithm is proposed. In the method,we define text similarity measure methods through HotNet similar calculation formula.Structure an association diagram according to the text document similarity by using aclustering algorithm named Newman to cluster texts and analysis. This method isappropriate for dealing with large-scale problems.Focused on the problems of strict classification and the high calculation complexityof the normal algorithms, we considered the feature of suffix tree in expressing therelation between the different words, the short ergodic time, and the increment refresh process, and brought the suffix tree model into the text clustering, and we also exposedthe suffix tree clustering algorithm based in the semantic calculation and the clusteringalgorithm based on the suffix net. The results of the experiments showed that bothalgorithms can realize the soft clustering and have the features of small time complexityand strong readable of class cluster identification.(4) On finding the interest model and drift pattern of net user. The actual form ofthe user interest model is composed with a group of feature-words which have acharacter of significant category. The method, calculating the frequency of the samewords or the similarly words in the most texts, was used in generating the interest of thenet user. It is a dynamic expression with the time going that the interest drift pattern ofnet user. Focused on the problem of multiple themes of text, a method for getting theinterest sequence based on Hidden Markov was proposed in this paper. In this method,the Hidden Markov Model of net user interest was created with the objects of the accesssequence and interest, using the decoding problem related algorithm to obtain the bestinterest sequence. Through sequential pattern mining algorithm to get the frequentsequence mode which is the interest drift pattern. The essence of the pattern is a kind ofinterest related rules with the feature sequence. In order to improve the miningefficiency, a mining algorithm based on Frequent Link-Access Tree (FLaAT) was usedto mine the frequent mode,, and this algorithm has some advantages, such as fastprocesses speed and the incremental mining through refreshing the structure sequenceof FlaAT. Experiments show that the proposed method is viable, the interest pattern digout can not only show the interest changes, but also can reflect the relationship and thechange rules between the interests.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络