

The Research of Web Authority Nodes Mining to Restrain the Malicious Web Page

【作者】 罗江锋

【导师】 张维明;

【作者基本信息】 国防科学技术大学 , 管理科学与工程, 2008, 硕士

【摘要】 随着网络资源的爆炸式增长,web资源的海量性与复杂性使得对web资源的管理变得越来越困难。如今,大量含有非法广告、病毒程序、木马程序的恶意网页已经充斥着web网络,这些恶意网页根据搜索引擎的局限,采取作弊手段,常常在我们的搜索结果中占据较高的排名。目前对恶意网页的处理主要是通过病毒检测软件防止网页中恶意代码的运行,或是在用户通过搜索引擎定位到某个网页时,对恶意网页提示安全警告。这些方法都完全依赖于反病毒软件或网页过滤技术,存在一定的局限性。而基于链接分析的恶意网页抑制方法只是在屏蔽恶意网页的同时,删除恶意网页的所有链接,或对恶意网页的链接进行识别追踪和过滤,没有将恶意网页信息充分的应用到搜索引擎的网页排序算法中。文本主要研究存在恶意网页情况下的web权威网页结点的挖掘问题。文中首先介绍了web挖掘的一般理论和权威网页结点挖掘算法的研究现状。针对现有算法的不足,通过合理假设,不但滤除被发现的恶意网页结点,还将恶意网页结点的先验信息应用在web权威结点的排序算法中。一方面,在模型建立时充分考虑恶意网页的影响,建立了一种新的、考虑恶意网页结点影响的web资源随机浏览模型。通过模型对问题的抽象,将web权威网页结点的挖掘问题转化成一个Markov链状态空间平稳状态分布的求解问题,给模型的算法实现打下理论基础;另一方面,在算法实现过程中,提出了一种通过引入负权对指向恶意网页结点链接进行惩罚的web网页结点排序算法,通过惩罚机制来抑制一般网页对恶意网页的链接,达到抑制恶意网页的目的。理论分析和实验均表明,链接到恶意网页的行为将受到惩罚,与恶意网页链接越紧密,链接的恶意网页数量越大,其权威值降低越多;而不链接到恶意网页的页面权威值将得到一定的增加。这种奖惩机制,将有效抑制一般网页对恶意网页的链接,从链接分析的角度实现了对恶意网页的有效抑制。此外,本文还对算法进行了改进和推广,明确了算法的应用范围。并且在仿真实验中详细讨论了图的生成模型和仿真数据的生成过程,增加了实验数据和实验结果的可信度。

【Abstract】 As the large number and complex structure of the web resource, it is difficult for us to manage the web pages orderly. There are more and more malicious web pages mixed in the web resource. More seriously, as the limitations of the search engine, malicious pages are always returned as authority resource nodes though some illegal ways. Many anti-virus tools have been used to restrain the malicious pages by preventing the running of malicious codes which hide in the pages, or give a safety warning when the user prepare to open it. Those methods make the anti-virus task totally depend on the anti-virus software, or some content-identifying technique. It doesn’t work well. Then, some new methods have been used from the view of linkage analysis. As long as the malicious content is identified, it is common to simply filter out the malicious pages and its linkage. They don’t distinguish the linkage to malicious pages from others during the page’s rank.In this paper, we mainly discuss the link-based authority web pages mining under the environment of malicious pages. After the introduction of the status quo of the graph mining and its general theory, and based on some reasonable assumptions, this paper mainly researches on the impact of the malicious web pages on user’s surfing action and present a new surfing action model. The new model take the prior information of malicious pages into account and, more importantly, convert the problem of the authority web pages mining into the solution of a Markov chain’s steady-state distribution. Under the new surfing model, we put forward a new page rank algorithm with negative link weight penalty to restrain the linkage to malicious pages, in which the web pages which link to malicious pages are punished. Subsidiary nodes are introduced to ensure the correctness and effectiveness of the algorithm under different conditions.Both theoretic analysis and simulation result show authority values of the nodes linking to malicious ones will be reduced, and the more linkages and linkage weight value are, the more authority value will be reduced. But page’s authority of those pages without links to malicious page will be increased. It effectively restrains the linkage to malicious nodes from the perspective of link analysis. All the simulation dates is generated by a web graph model, which is credible.Finally, this paper gives some improved algorithm and takes a statement of conditions under which the algorithm can be used.


