节点文献
基于信任和非信任传播的搜索引擎反作弊研究
Combating Web Spam Based on Both Trust and Distrust Propagation
【作者】 王友;
【导师】 张宪超;
【作者基本信息】 大连理工大学 , 计算机应用技术, 2011, 硕士
【摘要】 随着互联网的飞速发展,搜索引擎成了人们在互联网上查找有用信息的主要途径。网站在搜索引擎中的排名越高,从中获取的用户流量也就越多,流量越多也就意味着更多的利润。这就激励某些网站通过不正当的手段来操纵搜索引擎的排名。这种不正当的操纵就被定义为搜索引擎作弊。搜索引擎作弊不但会造成搜索引擎资源的浪费,还会降低用户的体验。商业搜索引擎不得不采取有效的措施,来减少搜索引擎作弊的不良影响。目前基于信任或非信任传播的链接反作弊算法被广泛用于抵抗搜索引擎作弊行为。相比传统的基于内容或启发式规则的反作弊算法,基于信任或非信任传播的链接反作弊算法不但对作弊者的攻击具有更高的鲁棒性,能抵抗多种作弊类型,而且由于只处理链接而拥有更好的性能。然而,不论是信任传播算法,还是非信任传播算法都存在两大问题。一方面,信任(非信任)传播的过程中好歹不分,即在传播的过程中,对权威页面和作弊页面同等对待。另一方面,虽然有很多学者都认为权威种子和作弊种子共同使用能带来更好的效果,但是之前没有研究者提出有效的利用办法。本文提出的TDR算法,认为一个网页具有两个方面,有价值的一面和作弊的一面,并给每个页面分配两个分数:T-Rank,代表该页面可信的一面;D-Rank,代表该页面不可信,即作弊的一面。TDR算法从权威种子和作弊种子出发,分别沿着链接或反向链接的方向同时传播T-Rank和D-Rank。在传播过程中,一个页面的T-Rank(D-Rank)的传播将受到当前该页面的D-Rank(T-Rank)的削弱。这样,上文提到的信任和非信任传播算法的两大问题都得到了很好的解决。在数据集WEBSPAM-UK2007和ClueWeb09上的实验结果表明,在众多标准下,TDR算法优于其他传统的反作弊算法。
【Abstract】 As the rapid development of Word Wide Web, search engines become the dominant way for people to find useful information on the Web. Since higher ranking in searching results brings more traffic, and more traffic means more profit to the owners of Web sites. It drives some Web sites owners to manipulate ranking results of search engines through unethical methods. This kind of unethical manipulation is termed as Web spamming. Web spam will not only waste resources of search engines, but also decrease the experience of users. Commercial search engines have to take measures to eliminate the negative effect of spam.Recently, anti-spam algorithms based on trust or distrust propagation is widely used to combat Web spam. Anti-spam algorithms based on trust or distrust propagation is more robust to the attack of spammers and more efficient on computing because of only dealing with page links than that based on contents or heuristic rules. However, existing trust or distrust propagating algorithms all have two serious issues. On one hand, trust/distrust is propagated in non-differential ways, that is, it threats the authorities and the spam pages alike in the propagating process. One the other hand, it has been mentioned that a combined use of good and bad seeds can lead to better results, however, little work has been known to realize this insight successfully.The proposed TDR algorithm in this paper, views that each Web page has both a trustworthy side and an untrustworthy side, and assigns two scores to each Web page:T-Rank, scoring the trustworthiness, and D-Rank, scoring the untrustworthiness. From good and bad seeds, TDR simultaneously propagates T-Rank through links and D-Rank through inverse-links, respectively. In the propagating process, the propagation of T-Rank/D-Rank is penalized by the target’s current D-Rank/T-Rank. In this way, propagating both trust and distrust with target differentiation is implemented and the above mentioned two problems are solved. Experimental results on WEBSPAM-UK2007 datasets and ClueWeb09 datasets show that TDR outperforms other typical anti-spam algorithms under various criteria.