节点文献

P2P内容监管中的关键技术研究

Research on Key Technologies in P2P Content Censorship

【作者】 张涛

【导师】 沈昌祥;

【作者基本信息】 北京工业大学 , 计算机应用技术, 2014, 博士

【摘要】 近年来,以P2P文件共享类应用和P2P流媒体类应用为代表的P2P网络应用迅速发展,然而与此同时,一些违规的网络资源也借助P2P类网络应用快速传播,并引发了诸多网络和社会问题。如何对P2P内容和信息进行有效监管已经成为目前P2P研究领域中一个亟待解决的关键问题。P2P内容监管行为包括三个关键步骤,即资源和节点信息的采集,管理目标的选择以及对违规资源传播的控制。资源和节点信息的采集是指依据监管目标,对目标P2P系统中的资源及其发布信息和节点信息等数据进行采集,当前采用的主动实现方式之一是爬虫。管理目标的选择过程是指根据内容管理的目标和范围,在资源和节点信息采集的基础上,分辨正常资源和违规资源的过程。违规资源传播控制是指通过技术和非技术手段,实现对选择的目标资源的传播进行管理的行为,当前的主要管理策略集中在降低索引准确性上,并通过爬虫系统在待管理P2P系统中发布错误的索引信息的方式实现。然而现有技术还存在以下问题:1)随着P2P技术的发展,一些原有的数据采集技术已经不能完成数据采集的任务,比如传统的基于端口的管理方法等;此外,在对基于如DHT等新型体系结构的P2P系统进行数据采集时,现有的采集策略中存在较明显缺陷,比如全面性不佳和效率低下等问题;2)仅依靠资源的发布信息作为判断资源是否违规的依据,忽略了资源的真实可用性和不同资源间关注程度的差异对监管效果带来的影响;3)目前的通过降低索引准确性的资源传播控制策略效果较差,多数伪造条目可以通过内容特征和节点特征进行判断。针对上述问题,本文通过分析P2P系统中资源的分布特征以及内容监管技术的研究现状,重点研究P2P资源发布信息的采集策略、资源可用性判别方法和内容传播、控制的原理和机制,论文的主要研究结果如下:第一,针对使用映射类型索引的P2P文件共享应用中的资源发布信息的采集,本文提出一种基于名称间家族相似性的名称采集策略。利用名称间部分相似的组织方式,通过使用已知名称中的未知部分作为下次迭代初始条件,以及控制预先设定的搜索词向量,该采集策略能够在很大程度上完成目标系统中的资源发布信息的快照。实验在一基于DHT体系结构的实际P2P系统中,以一搜索词为初始向量,搜索得到约1000万个发布信息,间接验证了该策略的可行性。第二,针对目前P2P内容监管过程中,仅通过名称判断内容情况的局限性,本文提出一种基于统计推断的内容可用性判别方式,用以通过样本的可用性情况分析整体的可用性水平。区别于传统通过比较内容与其发布名称是否相符,本文用与一个内容关联的不同含义的名称数量作为衡量其可用性水平的指标,显然关联名称数量越多,内容的可用性越差。进而使用统计推断方法判断该类内容总体的可用性水平。相比于传统的通过名称判断资源实际内容的做法,本文提出的内容可用性的判定方式,1)能够有效减少监管系统中错误目标的数量,2)能够在此基础上,实现在名称和可用性维度上的基于学习算法的监管目标选择。第三,针对目前违规资源传播管理策略的局限性,即仅通过改变可用内容占一次搜索中全部内容比例的局限性,本文基于信息论,将一次内容搜索过程描述成内容经过其发布信息,从信源向信宿传播的信道,并基于此信道模型给出了两种管理策略:1)即通过目前的添加版本和副本的策略,改变信源概率分布;2)通过改变内容和节点特征等信道特征,来影响正常用户在判断搜索内容是否可用时的决策。二者都以减小平均互信息量为最终目标,进而达到减小内容成功传播概率的目的。最后,实验在一个实际P2P系统中通过多元线性规划和方差分析等统计方法分析了影响用户决策过程的关键因素。通过该基于信息理论的分析,一方面为内容传播控制找到了理论依据,另一方面也扩展了现有的仅针对信源的管理策略。

【Abstract】 In recent years, P2P network that represented by P2P file sharing applicationsand P2P streaming media applications had gained rapid development, but at the sametime, some irregularities network resources also spread rapidly with P2P networkapplications, and caused a lot of networking and social problem. How to monitor P2Pcontent and information effectively has become an urgent key issue in P2Presearchfield.The regulatory actions for P2P content includes three key steps, resources andnode information collection, the options of management objectives as well as thecontrol of the spread of illegal resource. The collection of resources and nodeinformation is to collect the resources of target P2P system and it’s publishinformation and node information that based on regulatory objectives, the main waythat currently used is reptiles. The process of management objective selection isdistinguished normal and irregularities resources that based on the objectives andscope of the content management and the collection of resource and node information.Illegal resource dissemination control is the behavior to manage the spread of theselected target resources through technical and non-technical means, the current majormanagement strategies focused on reducing the accuracy of the index, and achieve itby publishing the wrong index information in P2P systems through the crawler system.However, the existing technology also has the following problems:1) With thedevelopment of P2P technology, some of the original data collection technology hasbeen unable to complete the task of data collection, such as the traditional port-basedmanagement methods; Moreover, for others such as DHT based new P2P systemarchitecture for data collection, the existing collection strategies exist obvious defects,such as poor comprehensiveness and inefficient;2) Only rely on the release ofresources information as a basis for judgment whether the resource violation, ignorethat the resource availability and the difference that the degree of concern that isbetween different resources influence on regulatory effect;3) The accuracy ofcurrently adopted strategy by poisoning the index have less effect because the featuresof resources or nodes could taken by ordinary users to distinguish the useable and theunusable. To solve those problems, through the study of distribution of P2P resources andthe actuality of P2P censorship, this dissertation has focused on the strategy of P2Pinformation gathering, the validity of resource and the mechanism of resource spreadand the way to dominate the propagation.Firstly, to improve the completety of metadata gathering in DHT-based systems,a Family-Resemblance based metadata snapshot strategy is proposed. Through thepartly similarity between two metadata, the snapshot strategy could continuouslyiterate by taken the unknown part from any known metadata. In a real DHT basedsystem where the strategy was deployed, about10million metadata was acquired byonly1search term, which proves the Family-Resemblance based strategy indirectly.Secondly, to increase the granularity of target selecting in censorship, a statisticalinference based resource validity differentiation is proposed. The relation between aresource and relatively metadata could be changed into the relation between twometadata, which is much easier to solve. Thus, a standard wilcoxon test could be usedto tell whether a series of resources is valid or not from the view of number ofmetadata. With this inference,1) A huge numbers of invalid resource could beexcluded from censorship target;2) By expending the oberservation, learing algorithmcould be taken to solve the target selecting procedure.Thirdly, to break the current limition in propagation control by inserting invalidcopies or metadatas, an information theory channel model based mechanism isproposed. Though this channel model based mechanis, two obvious points of view tocontrol the resource spread are proposed:1) Currently adopted control strategy is theway to redistribute the information source; and2) A series of features of resourcesand nodes could highly affect the choice of ordinary users. Both are aim to decreaseI(X; Y) which is the factor to measure the effect of propagation control. At last, anmultivariable regression is taken to prove that the historical download times and thesize of a file are the key factors in P2P file sharing systems to affect users’ choices.Besides, this analysis based on information theory gives the theorical evidence ofcurrent stategy and proposes a new way to implement the propagation control.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络