节点文献

基于遗传禁忌算法的网络信息过滤模型研究

Research on Network Information Filtering Model Based on Genetic Taboo Algorithm

【作者】 姜沛佩

【导师】 刘培玉;

【作者基本信息】 山东师范大学 , 计算机软件与理论, 2011, 硕士

【摘要】 随着Internet的发展和应用,网上信息飞速增长,内容丰富,种类繁多。然而,网络是把双刃剑,在给人类带来便利的同时又不可避免地使其接触到大量不良信息;另外,基于网络自身所固有的开放性、动态性和异构性,用户很难准确快速地获取所需信息,如何自动从动态信息流中抽取出符合用户个性化需求的信息变得异常重要。为解决上述问题,网络信息过滤技术应运而生。信息过滤技术能根据用户需求抽取信息并屏蔽不良信息,它主要研究网络信息的获取和表示、用户模板的构建、待处理文档的分类等问题。本文涵盖了网络信息过滤的各个阶段,以信息过滤模型的查准率和查全率两个技术指标为出发点,做了如下几方面的工作:1、深入研究了网络信息过滤相关过滤模型及其各项关键技术探讨了典型的信息过滤模型及其相关算法,重点研究了网络信息过滤中涉及的网络数据获取、分词技术、特征选择算法、权值计算、文本表示模型、分类算法等关键技术。2、提出了基于遗传禁忌算法的网络信息过滤模型深入探讨了遗传算法的基本原理及应用,在充分分析遗传算法优点的基础上,针对遗传算法存在的“爬山”能力差、“早熟”等缺点,引入“爬山”能力较强的禁忌搜索算法对交叉算子进行改进,形成禁忌交叉算子,提高传统遗传算法的搜索能力。在过滤模型的分类阶段,针对模型中使用的传统朴素贝叶斯分类算法不能解决单类别词汇问题,文中对其进行改进,使之具有较好的鲁棒性和适应性。3、提出了应用词汇组合进行句子抽取的文本摘要方法一篇文本往往包含很多句子,但有些句子不能表达该文本的主题,这些冗余句子影响遗传训练形成的用户模板质量。文本摘要作为一种信息压缩工具能对文本内容进行压缩,去掉冗余句子,提取出最精炼的内容。为进一步提高模板质量,文中引入文本摘要方法对语料进行优化。针对摘取过程中词法分析系统分词精度过低而导致特征项之间语义缺失的问题,文中提出根据词性制定修正规则,并依此规则对分词后的句子进行规范的思想,使句子中有语义关系的词语建立相应联系,改进后的摘要方法摘取的内容更精炼,更准确。4、设计并实现了基于遗传禁忌算法的网络信息过滤模型在系统中首先采用改进的文本摘要方法对训练语料进行预处理;然后使用遗传禁忌算法训练文本,形成最优用户模板;最后,采用改进的分类算法对待测文本进行分类,最终实现了一个多层次、多策略及模块化的基于遗传禁忌算法的网络信息过滤系统。经测试,该系统运行可靠、稳定、高效,能对网络信息进行有效的过滤。

【Abstract】 With the development and application of internet, the network information is rapidly increasing, rich in content and various in form. However, coin has two sides, while enjoying the convenience of the internet, we also have to face some negative information. In addition, because the internet is open, dynamic and isomerous, it is rather hard to get information what we need, how to automatically extract the information to meet the personalized demands of the user from dynamic information flow becomes more important than ever. In order to solve above problems, network information filter technology has emerged as required. Network information filter can extract information what the user needs and shield the negative information, it focuses primarily on the research about the acquirement and representation of information, the establishment of user template, and the text classification.This thesis covers each stage of the network information filter and makes research and study on the following aspects with the two main indexes of filter accuracy and speed of information filter model:1. This thesis deeply researches on the related filter model of network information filter and its’key technologiesThis paper discusses the typical information filter model and related algorithms at first. Then, it mainly researches on key technologies which used in network information filter, such as the acquirement of network data, the word segmentation technology, feature selection algorithm, the calculation of the feature weights, text representing model, classification algorithm and so on.2. This thesis proposes the network information filter model based on genetic taboo algorithmThis paper makes an in-depth discussion of the basic principle and application of the genetic algorithm, based on the analysis of the advantages of genetic algorithm, due to the drawback that the genetic algorithms is poor in capable of climbing and has premature problem, this paper introduces taboo search algorithm with strong capable of climbing mountains in crossover operator, which forming taboo crossover operator to improve the search capacity of traditional genetic algorithm. In the classification stage of filtering model, due to the problem that the traditional Naive Bayesian Classifier used in the model could not solve the problem of single category words, this paper improves the classification to make it have better robustness and adaptability.3. This thesis proposes text summarization method applying vocabulary combination into sentence extractionA text contains many sentences, but some sentences can not express the theme of this text, these redundancy sentences have impact on the quality of user template. Text summarization as an information compression tool can compress text content, remove redundant sentences, and extract the most refined content. In order to improve the quality of the template, this paper introduces text summarization to optimize corpus. In the process of extracting, due to a phenomenon that the lexical analysis system what it uses has the low segmentation accuracy and causes semantic loss between features, this paper formulates the amendment rules, which are used to the sentences formed after partition process of the words, to regulate the vocabulary combination according to the part of speech, making the words in the same sentence semantically related to each other can establish their appropriate links. The summary method proposed in this paper makes the contents extracted more refined and accurate.4. This paper designs and implies a network information filter model based on genetic taboo algorithmIn the system, we firstly adopt the improved text summarization method to preprocess the training corpus, then use the improved genetic algorithm to training text, and form the best user template, finally categorize text by using the improved classification algorithm and achieve a multi-hierarch, multi-policy and modular network information filter system based on genetic taboo algorithm. After testing, this system runs reliably, steadily, effectively, which can effectively filter.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络