节点文献

文本内容安全监管体系模型研究

The Researching of Content Secarity Forbid Systems

【作者】 刘慧民

【导师】 刘功申; 顾梓芳;

【作者基本信息】 上海交通大学 , 电子与通信工程, 2007, 硕士

【摘要】 Internet的开放性和日益增长的规模,为人们提供了自由交换信息的便捷手段。同时巨大的开放信息源也使一些恶意的和不良的(反动、色情等)内容信息趁机而入,成为用户获取有效信息的严重障碍。为保护国家安全、稳定,同时保护网络用户远离有害信息的侵扰,以及控制对这些信息的访问,有必要采取有力措施对这类信息进行监管,同时也有必要为运行Web服务的各种组织,提供对此类信息的访问加以监管的技术和服务。为此开发先进的文本内容安全监管技术是一项紧急而又重要的课题。本文通过对综合运用机器学习、模式识别、数据挖掘、知识发现、自然语言理解、中文信息处理、粗糙集理论、人工智能等学科的相关知识,通过分析各类不良信息的特征,结合文本信息处理目前研究进展,对文本特征选择方法以及相关文本处理算法进行综合分析,研究了适合不良文本信息过滤的模型及关键算法。首先,对国内不良文本过滤现状及相关过滤系统的研究进行了讨论分析,分析了基于PICS(Platform for Internet Content Selection)的内容选择,基于URL的过滤,基于特征词的过滤的优缺点,提出了智能内容过滤是文本深层次分析的必要技术,指出了文本安全过滤的应用领域。研究了处理大样本集的文本预处理技术、网页正文获取、快速词频统计算法等。其次,研究了文本处理中的文本表达技术和特征选择技术,包括Filter特征选择技术,Wrapper特征选择技术,基于粗集的特征选择技术,以及权重计算和归一化技术。指出了各种特征选择技术的优缺点,并对文本表示技术进行了实验。指出不同的过滤器需要采用适于其本身的文本表达方法。正确的归一化能够得到较好的结果。实际的样本集一般是不平衡样本,不同的过滤器对于不平衡样本集试验结果的性能差别较大。试验表明:中心向量法、支持向量机基于向量空间模型来表示文本,我们正确归一化后,和没有归一化前比较,性能有很大的提高。Naive Bayes由于采用概率模型表示文本,在标准样本集(平衡样本集)上得到了同基于向量空间模型的方法和基于支持向量机的方法相差不多的结果。但是在实际样本集(不平衡样本集)上,对于训练集,准确率比基于向量空间模型的方法和基于支持向量机的方法较差,但是在过滤未知反动样本上,准确率非常差,而基于向量空间模型的方法和基于支持向量机的方法较好。分析表明一方面由于不同反动网站的反动样本语法风格不一样,另一方面是反动样本特征空间较大,基于概率的统计方法不能反映全部特征空间分布。中心向量法和支持向量机对于平衡样本集或不平衡样本集都表现了较好的性能。第三,讨论了粗糙集的基本概念,指出了粗糙集的理论本质。研究了粗糙集属性约简算法,比较了基于区分矩阵的约简算法和基于属性重要度的约简算法,指出基于区分矩阵的约简算法在处理文本信息时是不充分的。提出了一种混合的属性约简算法,试验表明该方法在处理文本信息时是非常有效的,一方面利用常用的约简算法降低了文本维数,另一方面利用粗糙集约简算法去掉了很多冗余属性和噪音属性。第四,提出了一种粗糙集和相关过滤器相结合的针对主题特殊文本的过滤新方法,本文基于属性重要度,对文本属性进行前向选择提出了一种新的粗糙集属性约简算法,它产生几个约简,由于各约简基之间没有相同的属性,试验表明在处理文本数据时,具有更强的分类能力。整个过程分成两个阶段:首先将粗糙集理论作为前端预处理工具,实现分类数据中属性的约简过程,降低数据维数但基本上不损失有效信息,然后用统计方法作为后端处理器进一步对约简后的数据进行分类过滤,使计算量大大减少,同时提高了分类速度。通过试验结果可以看出,对未经粗糙集约简的文本属性集和经过快速约简的文本属性集比较,当约简个数m取值为3时,所选择的属性个数大大减少,基于向量空间模型的方法和基于支持向量机的方法在训练集和测试集上都达到了未经约简前的准确率。最后,开发了内容安全网关中不良文本过滤模块,设计了一个有效的不良文本过滤架构。基于多模式匹配算法,研究设计了高效的不良文本过滤引擎,并应用于安全网关和电子邮件过滤系统中。

【Abstract】 With the flooding of information on the Internet, harmful content such as reactionary Web pages, erotic Web pages and hate messages, is appeared. These scandalous pages are baleful comment on our nation, which are inconsistent with the facts, and carry bad weight on our people and on our youth, who are in the important phase to study philosophy and science knowledge. At the same time, the erotic Web pages are also bring bad function on our youth. To forbid these pages is necessary, and it is important to research relevant models and techniques for the Internet service support organization to block these ill Web documents.Content security now confronted the researchers, and in this paper, by using some relevant techniques such as machine learning, pattern recognizing, data mining, nature language understanding, information processing for Chinese, rough set theory and artificial intelligence for reference, efficient models and techniques are proposed to block these ill Web pages. The primary work of the author is introduced as follows:Firstly, the exiting techniques and systems of text filtering in our nation are analysized, the four common methods such as Platform for Internet Content Selection (PICS), URL blocking, keyword filtering, and intelligent content analysis is presented, and the intelligent content analysis is very necessary in-depth to filter the ill pages. As the first step of web page processing, the algorithm of getting the main text of Web page is brought forward, and the statistical algorithm of words quickly is also proposed.Secondly, text expressing techniques, text weights and weight normalization technique, the common feature selection techniques are researched. Different statistic methods need different text expressing techniques, VSM and SVM use vector space model to express text, while na?ve Bayes model use probability of words to express text. The experiment result suggest the three methods get good result on balance data set, but on non-balance data set, the result from na?ve Bayes model is bad than VSM and SVM, especially on the non-balance unseen testing data set. In real application, the data set is often non-balance data set, so our research result is very useful. And the normalization technique is very efficient in improving the precision, especially for non-balance data set.Thirdly, the concept of rough set theory is discussed; the essence of rough set theory is summarized. The experiment of rough set attribute reduct algorithm between discernibility function-based and attribute dependency-based is compared, the result suggest reduct algorithm based on discernibility function is hard to run than the reduct algorithm based on attribute dependency because the memory and time required. A hybrid method to select features more accurately using one of feature selection methods and rough set attribute reduct is proposed. We primarily use one of feature selection methods to select features primarily, next to further select features using rough set attribute reduct. Thus many noise and redundant attributes are deleted, and more accurate and few features are extracted. At last we use na?ve Bayes model to evaluate our feature selection method, the result shows our method has high precision and high recall, and is very effective and efficient.Fourthly, Feature selection is a very important step in text preprocessing, a good selected feature subset can get the same performance than using full features, at the same time, it reduced the learning time. In filter approach, the feature subset selection is performed as a preprocessing step to induction algorithms. But the filter approach is ineffective in dealing with the feature redundancy. In wrapper approach, the feature subset selection is“wrapped around”an induction algorithm, so its running time would make the wrapper approach infeasible in practice, especially for text data. Based on Rough set theory, a new feature selection method is proposed. It generate several ruducts, but the special point is that between these reducts there are no common attributes, so these attributes have more powerfully capability to classify new objects, especially for real data set in application. We choose two data sets to evaluate our feature selection method, one is a benchmark data set from UCI machine learning archive, and another is captured from Web. We use statistical classification methods to classify these objects, in the benchmark testing set, we get good precision with a single reduct, but in real date set, we get good precision with three reducts, and the data set is used in our system for topic-specific text filtering. Thus we conclude our method is very effective in application. In addition, we also conclude that VSM and SVM methods get better performance, while Na?ve Bayes method get poor performance with the same selected features.The end, an efficient topic-specific Web text filtering framework is proposed. This framework focuses on blocking some topic-specific Web text content. In this model, a hybrid feature selection method is used, and a high efficient filtering engine is designed. In training, we select features based on CHI statistic and rough set theory, then to construct filter with Vector Space Model. We train our framework with huge datasets, and the result suggests our framework is more effective for the topic-specific text filtering. This framework runs at server such as gateway, and it is more efficient than a client-based system. And a prefix email filtering system is proposed. Such filtering system is separated from the original web mail server, it control the mailing frequency of each SMTP client dynamically, and check the content of received Email in normal Chinese character encoding with the algorithm of DFSA. To take the filtering accuracy into account, this system will send the useful Emails, which are blocked by error, back to the mail server. In the end, a text-filtering platform is designed for the 863 plan of information security in training and practicing the personnel of text content security domain.

  • 【分类号】TP393.08
  • 【被引频次】2
  • 【下载频次】214
节点文献中: 

本文链接的文献网络图示:

本文的引文网络