节点文献

基于潜在语义索引和免疫学习的BIRCH聚类算法研究

Research on Birch Clustering Algorithm Based on Latent Semantic Indexing and Immunological Learning

【作者】 岳淑珍

【导师】 关毅;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2009, 硕士

【摘要】 网络已经发展成为人们生活的重要部分,网络上存储的信息是海量的,而且处于不断变化中。网络用户期望得到个性化的服务,网络服务端需要为其推出个性化服务提供决策参考,用户兴趣挖掘技术也就应运而生了。用户兴趣挖掘技术对用户的兴趣进行有效地记录、分析,并围绕着描述用户兴趣的计算模型开发应用。考虑建立用户兴趣模型的可用性及准确度,我们选择隐式建模方式,即不需要用户中断网络浏览过程,通过收集反映用户兴趣的信息来建立用户模型,推断用户的兴趣。本文采用记录了用户的搜索和访问等信息的日志文件。处理过程主要分为三个阶段:预处理、用户兴趣建模、应用。为了更好地处理大量的,并且增量式加入的网络文档,系统的主要建模技术采用了处理时间为线性的BIRCH聚类。经过日志过滤、正文抽取等预处理之后,采用传统的向量空间模型的网络文档的文本表示特征往往呈现出高维而且稀疏的特点,本文提出了加入改进的潜在语义索引处理,对比实验证明,处理时间明显地缩短,恰当地选择BIRCH参数以及LSI中的k值,能够得到适应所用数据和应用领域的更好的聚类结果。验证了潜在语义索引技术可以在保留主要语义结构的基础上降低文本表示的维数,在形成的潜在语义空间中提取最有意义的维度作为特征表示。有效性度量是评价结果的关键,其中有效性函数的选择是一个关系到判定效果的关键。针对不同的数据,BIRCH聚类需要找到优化的参数才能得到更好的结果。本文研究了人工免疫网络算法,探索将其自适应机制引入BIRCH聚类的参数调节优化过程中,根据调节得到的参数设定最适合应用领域和数据特点的有效性函数。本文利用上述技术建立用户模型,以模型为基础,开发了用户聚类和好友推荐应用,人工校对证明,可以认为模型能够对不同用户的不同兴趣领域较好地描述和计算。

【Abstract】 Network has become an important part of people’s lives, network information presented to the users is vast and constantly changing. Internet users expect personalized services, and the network servers who want to supply personalized service need accurate description of the users’interest to do decision-making, users interest mining technology come into existence as the situation requires.Users interest mining technology records and analyses the user’s interest effectively, then models it and developes applications based on the model. Considering the availability and accuracy of the model, we choose implicit modeling approach to predict users’interest, which will not alter normal patterns of browing and reading, and mining log files which recored searches and accesses. Implicit users interest mining systerm consists of three steps: preprocessing, modeling users interest, and applications.In order to better handle the massive and incremental network documents, implicit users interest system adopts BIRCH clustering method, whose processing time is linear, as the mining technique. After preprocessing step, characterized vector of network documents that adopts Vector Space Model is high-dimensional and sparse, so we combine latent semantic indexing with BIRCH, comparison experiments show that the method can get more effective clustering result, if we can select parameters of BIRCH and K in LSI suitably, and validate that latent semantic indexing technology can retain the main structure of semantic basis, and select the most efficient dimension.Measurement of the effectiveness evaluation of clustering results is one key to the clustering. Aiming at different data set, we need explore the most suitable parameters to get more better performance. this paper researchs artificial immune network algorithm, and explore to introduce the adaptive mechanism into BIRCH Clustering Optimization of the tune parameters to get the optimized effectiveness functions.We model the user interest based on above techniques, and develope the user clustering and friends recommendding applications, human check proves that the model is a promising way to describe different interest areas of different users.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络