节点文献
基于遗传算法与模糊聚类的网络信息过滤系统的研究
The Research of Network Information Filtering System Based on Genetic Algorithm and Fuzzy Clustering
【作者】 陆宏菊;
【导师】 刘培玉;
【作者基本信息】 山东师范大学 , 计算机软件与理论, 2008, 硕士
【摘要】 随着Internet的发展和应用,越来越多的商务、日常活动通过Internet进行,网络与人们的生活越来越紧密。然而,网络是双面的,人们在享受网络所带来便利的同时,不可避免地接触到大量不良信息;另外,基于Internet所固有的开放性、动态性和异构性,用户很难准确快捷地从Internet上获取所需信息。这就需要在浩如烟海的动态信息中过滤掉不符合用户信息需求的有害、无用信息,把不相关信息减至最少。因此,网络信息过滤技术已经成为当前研究的热点之一。如何获得用户的兴趣模板,并依据模板对过滤文档分类,是网络信息过滤中的关键技术。目前常采用文本分类中的相关技术来实现,如:Rocchio、K-元最近邻居、贝叶斯、支持向量机以及遗传算法(GA)等方法。GA在网络信息过滤中的应用主要是为了获得用户的兴趣模板,其效果与适应度函数相关。当前的适应度函数多采用以求个体相似度为基础的方法对种群进行评价。这种方法在评价时,重点在种群个体的相似程度评估上,没有对个体的类别属性进行评价,也没有考虑到特征的典型性及特征包含的类别信息方面的内容,所以获得的用户模型在过滤时效果不是很理想。1965年,Zadeh提出模糊集理论之后,人们开始用模糊的方法来处理聚类问题,并称之为模糊聚类分析。由于模糊聚类得到了样本属于各个类别的不确定性程度,表达了样本类属的中介性,即建立起了样本对于类别的不确定性描述,能更客观地反映现实世界。因此,在基于遗传算法的信息过滤中,引入模糊聚类技术来评价,能够更多的考虑到各特征项所属类别的非绝对性、特征的典型性及所包含的类别信息,从种群个体的类别属性上进行评价,从而可获得更准确的用户兴趣模板。本文在遗传算法中引入了模糊聚类的思想,从模糊聚类的角度对基于GA的信息过滤系统中种群个体进行评价,提出一个基于模糊聚类的遗传算法,然后将该算法应用于信息过滤中,实现了基于遗传算法与模糊聚类的信息过滤系统。最后,在该系统中对其有效性进行了验证。本文具体工作如下:1.将模糊聚类技术融入遗传算法,对个体进行评价。在计算适应度之前,先采用个体所选择的特征子集将训练文本表示成向量,然后采用模糊相似矩阵直接聚类法对其聚类,最后根据聚类的效果来计算适应度。这种评价方法从个体对文本类别的判定能力方面评价个体,更多的考虑到特征的典型性及所包含的类别信息方面的内容。2.提高了算法的抗干扰性。适应度函数通过对模糊聚类结果的正确率和紧凑程度两个方面评价的综合来计算适应度值。该函数设置了一个w参数。调整w的取值,可以降低适应度函数对训练文本集中干扰文本的敏感程度,从而提高了算法的抗干扰性。3.实现了基于遗传算法与模糊聚类的网络信息过滤系统。采用本文中所提出的基于模糊聚类的遗传算法学习训练文本,通过对种群个体进行评估,经过一定代数的迭代训练获得用户的兴趣模板,然后采用改进的Sim函数对待过滤文档比较分类,最终实现信息过滤。通过该系统验证了该方法的有效性。文中通过从模糊聚类角度评价种群个体,提出了基于模糊聚类的遗传算法。经试验验证,该算法在准确率和F1测度方面均有明显的提高。
【Abstract】 Following the development of Internet, more and more commercial and daily activities are carried out through the Internet. Network becomes closer to people’s daily life. Coin has two sides. When we are enjoying the convenience from the Internet, it also brings some bad information to Internet users. In addition, because the Internet is openly, dynamic and isomerous , it is rather hard to get information what we need. This demands a method to reduce the irrelevant information according to user’s information demand. So information filtering becomes one of the hot research fields.Gaining the user’s profiles, expressing user’s interests, using which to classify the documents form the Internet is the key technique of network information filtering. The relevant techniques of text classifier are often used, such as Rocchio, K Nearest Neighbor, Na?ve Bayesian, Support Vector Machine, and Genetic Algorithm (GA). The application of GA in information filtering is to gain the user’s profiles and its effect is determined by GA’s Fitness Function. At present, the Fitness Function often adopts the method that based on computing the similarity of GA’s individuals. The evaluation method pays more attention to individuals’similarity but less to the classificatory attribute of individuals and features, also the typicality of features. Therefore, the effect of users’profiles is not so good.After the fuzzy set theory brought forward by Zadeh in 1965, people begin to use fuzzy theory to do clustering problems. Because the fuzzy clustering (FC) can obtain the degree of classificatory indeterminacy, express the samples’medi-attribute, it reflects the realistic world better. So, if we use the fuzzy clustering method to evaluate the GA’s individuals in network information filtering system based on Genetic Algorithm, can considers more the non-absoluteness of the classificatory residing of each feature, features’typicality and the involved classificatory attribute, mean while, can give a classificatory attribute evaluation of individuals to some degree. Accordingly, gain the more veracious user’s profile. This paper uses fuzzy clustering method to evaluate GA’s individuals in information filtering, proposes a genetic training algorithm based on FC, and then applys this algorithm to an information filtering system, forms the GA and FC network information filtering system, using which proves the validity of GA based on FC. The main tasks that this paper has done as follows:1. Using the GA combined with FC to evaluate GA’s individuals. Before computing the fitness, expresses the training set as vectors according to one individual, then clusters it using direct clustering method of Fuzzy Similar Matrix, computes the fitness finally by evaluating the result of clustering. This method evaluates the individuals according to its ability of juding texts’sorts, pays more attention to the typicality of features and its classificatory attribute.2. Improving the training algorithm’s ability of anti-jamming. The fitness function computes the fitness by combining the correctness and denseness of the result of fuzzy clustering. This function sets an parameter w which can lower the sensitivity to outliers of training text set. Thereby improves the training algorithm’s ability of anti-jamming.3. Implementing the network information filtering system based on GA and FC. This system adopting simulated annealing genetic algorithm to training, evaluating individuals by fuzzy clustering, obtaining user’s profiles through certain generations’iterative training, classifying the information according to profiles using the improved Sim function, accomplishing the process of information filtering, presents the experiment results which proves the validity of the GA based on FC.This paper presents a genetic algorithm based on fuzzy clustering by evaluating GA’s individuals using fuzzy clustering technique. Testing proves that it has an obvious advantage in the aspect of precision and F1 measure.
【Key words】 Information Filtering; Genetic Algorithm; Fuzzy Clustering; Fitness Function; Similarity;
- 【网络出版投稿人】 山东师范大学 【网络出版年期】2008年 08期
- 【分类号】TP18;TP393.09
- 【被引频次】4
- 【下载频次】204