节点文献

基于minwise哈希的文档复制检测的研究及应用

Reserch and Application on Document Similarity Detection Based on Minwise Hashing

【作者】 袁鑫攀

【导师】 桂卫华; 张祖平; 龙军;

【作者基本信息】 中南大学 , 计算机应用技术, 2012, 博士

【摘要】 WEB正经历着爆炸性增长,海量文档中存在大量的相似信息,这些相似性文档一方面消耗了高额的检索资源,另一方面影响了用户的使用。文档的数字化和易获性也使得非法复制、剽窃等行为越来越猖獗。为保护知识产权和提高信息检索效率,文档复制检测技术应运而生并得到迅速发展。文档复制检测就是判断给定文档是否抄袭、剽窃或者相似于一篇或多篇文档的内容。论文以某基金项目相似性检测为实际应用背景,为了在海量数据环境中快速而准确地检测出文档的相似性,主要研究相似性检测系统中涉及的关键技术,重点研究相似度估计算法、相似度检索算法和基于SIMD优化的相似度比对等关键技术,具体进行了如下的研究工作:(1)针对文档相似性检测系统中精度和存储空间只能取离散值、粒度过粗的问题,提出了分数位minwise哈希算法,验证了分数位minwise哈希算法的可行性,构造了使得估计方差最小的最优分数位。分数位minwise哈希算法将整数位minwise哈希扩展到分数位,突破了b整数位的限制,使得相似度可以使用分数位来估计,不仅完善了minwise哈希算法的理论体系,也为实际系统中的用户对于相似度估计的精度和存储空间更加细粒度可选择性需求提供支撑。(2)针对文档相似性检测效率不高的问题,提出了连接位minwise哈希算法。连接位minwise哈希算法将位连接起来进行相似性度量,证明与推导及实验结果显示算法虽然牺牲5%精度,却能成倍地减少比对的次数,大大提升算法的时间性能。一方面,连接位无需任何复杂的操作,方便构建;另一方面,亿万级文档的相似度的估计,通过损失一定的精度误差,获得了性能的成倍提升具有很强的实际应用意义。(3)针对海量文档相似性检索中相似度阂值不能设置过低,初始指纹数少等问题,提出了指纹分组合并检索算法,理论推导及实验结果表明算法能够在低相似度阈值(比如70%)下快速地从已有的文档集中检索目标文档,从而实现相似性文档查询的实时性,并且由于降低了相似度阈值,也增大了相似性检索的应用范围。(4)针对某基金对相似性证据快速采集和清晰呈现的特殊需求,提出了基于SIMD优化的相似性比对算法。通过使用SIMD指令集和OpenCL框架对相似度比对算法进行了一系列的优化,实验结果表明优化算法提升了可提升11.6%-170%的时间性能,一方面使得相似性有迹可循;另一方面也有利于人工复审工作。(5)针对某基金项目相似性检测系统中存在的项目数据难以准确快速提取、海量项目数据比对时间超长、比对结果难以清晰呈现等关键问题,论文论述了如何采用所研究的关键技术形成完整的基金项目相似性检测系统,并为基金项目形式审查提供支持。

【Abstract】 With the explosive information growth of Web, there are a large number of massive web of similar information. On the one hand, these similar documents consumed high resources of index, the other affected users. Document similarity detection technology is an important topic in the information processing field, and it is a powerful tool to protect the author’s intellectual property and to improve the efficiency of information retrieval.Duplicate Document Detection (DDD) is widely used to find out similar documents, or in other words, to detect plagiarism in documents. Plagiarism does not only include intact copy, but also close imitation of the language and thoughts of another author and the representation of them as one’s own original work. Takeing fund projects similarity detection as a research background, In order to quickly and accurately detect the similar documents in the environment of massive amounts of data, this paper focuses on the theories and methods of document similarity detection for a more in-depth study, especially on the similarity estimation algorithm, the similarity retrieval algorithm, and similarity matching techniques based on SIMD optimization. Research work has been done as following:(1)f-fractional bit minwise hash algorithm is proposed for a wider range of selectivity for accuracy and storage space requirements. This paper studied the feasibility of f-fractional bit minwise hash algorithm, and constructs the optimal fractional bit to make the minimum variances of estimator. The algorithm’s innovation is extending the b bit into f-fractional bit. It broke through the limit of b integer; the similarity could be estimated by fractional bit. It not only improves the theoretical system of minwise hash algorithm, but also provides support for the diverse needs of precision and storage space in the actual system.(2) Connected bit minwise hash algorithm is proposed to improve the efficiency of similarity estimation since the half of comparisons is greatly reduced with5%loss of accuracy. Connected bit is convenient to be built and the performance increases exponentially with a strong practical significance in the environment of massive amounts of data.(3) Fingerprint group merging retrieval algorithm is proposed in large part to address both sides of a problem:similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. Fingerprint group merging retrieval algorithm could quickly find documents with higher similarity in existing documentation set with the lower similarity threshold. Due to the reduction of the similarity threshold, the application level is wider.(4) For the problem that comparison’s results are difficult to clearly show, an optimized similarity comparison algorithm is proposed and implemented by using Intel SIMD technology and GPUs, based on the analysis of the whole non-parallel algorithm and the data statistics. Experimental results demonstrate that certain performance improvements could be obtained. As evidence of similarity to be significant expressed in the system. On one hand, the similarity can be tracked, on the other hand, in favor of manual review.(5)To solve the key problems of fund projects similarity detection system that the existing project data is difficult to extract quickly and accurately; the time of mass projects data comparision is too long; comparision’s results are difficult to clearly show, this paper uses key technologies to form a complete fund projects similarity detection system for the fund projects formal review.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2014年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络