节点文献

XML Engine安全网关语义过滤的研究与实现

【作者】 吴红娟

【导师】 佘堃;

【作者基本信息】 电子科技大学 , 软件工程, 2009, 硕士

【摘要】 在庞杂的互联网信息中,不良信息以各种不同的方式,通过多种途径从不同的方面对人们造成了不良影响。因此,必要和有效的不良信息过滤对于建设健康、安全的互联网环境显得尤为重要。但是,传统的文本信息过滤算法仅能从结构对应的层次上进行判断,而无法实现文本的语义,很难满足当今信息智能化的要求。本课题结合计算语言学知识,提出并实现了一种语义分析的过滤方法,对于那些不能通过关键字匹配过滤而漏掉的长文本信息,通过语义分析,可以进行很好地鉴别处理,从而有效的防止大量不良垃圾信息的散播。本课题的先进性如下:1、针对各种自动分词方法中出现的问题,改进了具有自学习机制的智能词典的概念,并实现了智能词典的基本模型。该模型在分词的同时,实现了对新词的自学习功能,不需要人工干预,很好地完成了系统的智能性。分词算法采用正向和逆向最大匹配方法相结合的特点,分词的准确率大大提高,同时,配合词频库,能够有效地消解分词歧义,也是对分词准确率的进一步保证。2、通过对特征值算法的深入研究,基于TFIDF的特征值提取算法,在TFIDF稳定性的基础上引入词性系数来改善特征集的选取效果。采用潜在语义标注的方法,对不同词性的特征乘以不同的词性系数,突出不同词性的特征表示文档类别的能力,以减轻文本分类器的工作量,进一步提高处理的速度和效果。3、通过对几种主要的分类器算法的研究,依据贝叶斯算法性能高,复杂度低的特点,并针对项目的实际情况,批量大、速度快、分类种类少的特点,提出一套基于朴素贝叶斯算法的分类器模型,利用特征值的词性系数,利用统计方法对待分类文本进行训练分类。试验证明,该分类器算法具有很高的查全与查准率,为整个语义过滤模块的过滤质量提供了有效的保障。论文研究成果已经应用到国家支撑计划、广东省科技项目XML Engine安全网关上。在整个XML Engine中加入本课题的语义过滤模块,极大的阻止了对大量不良信息的智能过滤,进一步保证了整个XML Engine的安全性能。

【Abstract】 Among the large quantity of complicated Internet information, some ill pieces have bad effects on many people in several different ways and from kinds of aspects. Therefore, necessary and effective filtrating for visiting network is an important aspect of setting up a healthy and safe network environment. However, the traditional methods of text message filter can only judge the layers according to the structure, but not the semantic of the text, which are hard to meet the needs of the intelligentialization.by combinating computational linguistics susbject konwledge, this article proposed and implemented a emantic analysis of filtering methods. For the long text message, that can not be filtered out by keword matching,we can do a better identification and processing through the semantic analysis,so as to ffectively prevent a large number of non-meaning infromation spreaded out.The advanced point of this thesis is mentioned as following: First, aiming at the problems of some word segmentation methods, the concept of intellective dictionary of auto-study protocol is improved, and the basic model of intellective dictionary is archived. This model archives the auto-study function of new words without human being interrupting, and realizes the intellective quality of system. This word segmentation algorithm combines the positive and negative direction max matching, which improves the accuracy of word segmentation. Meanwhile, according to the words frequency library, the algorithm can remove the different meanings of word segmentation, which ensures the accuracy of word segmentation. Second, through the research of the characteristic value algorithm deep, the distilling algorithm of characteristic value based on TFIDF, which imports word property coefficient to improve the characteristic set based on the stability the TFIDF. This algorithm uses the method of latent semantic label to help user analyze the semantic relationship, which multiplies different word property coefficient for different word characteristic. The advantage is highlighting the ability of special position expressing the sort of document, in order to relief the workload of word segmentation, and improve the speed of effective of treatment. Third, through the research of several main categorizer algorithm, based on Bayes algorithm, which has high quality and low complexity, aiming at the characteristic of big batches, fast speed and few sorts of projects, a set of Classifier models of Bayes algorithm is introduced, which uses the word characteristic coefficient and statistic method to sort for the relative degree. The experiment shows that, this categorizer algorithm has the ability of high comprehensive and exact search, which support effective guarantee for the filter quality of all the semantic filter module.The result of the thesis research has already been used in the XML Engine safe gateway, which is the technology project of Guangdong, with national support. Adding the semantic filter module to the whole XML Engine, prevents the intellective filtrating of quantity of bad information, and assures the safe quality of XML Engine.

节点文献中: