节点文献

基于相似度估计文档重复率检测算法研究

Research on Document Repetition Rate Detection Algorithm Based on Similarity Estimation

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 王钰宁刘晓霞周绍军

【Author】 Wang Yuning;Liu Xiaoxia;Zhou Shaojun;Department of Information Engineering, Sichuan Water Conservancy Vocational College;

【机构】 四川水利职业技术学院信息工程系

【摘要】 在信息时代中,文档的相似性检测技术得到了广泛的应用,包括在数字化图书馆、搜索引擎、论文查重等许多领域,取得了巨大的成功。但基于词频统计的文档相似性检测技术准确率低,基于字符串对比的文档相似性检测技术无法实现复杂场景下的应用。为了解决这些问题,在近年来产生了大量基于相似度估计的文档相似性检测技术。其中shingle算法,minwise哈希算法是一种相对成熟,性能稳定的文档相似性检测算法。具体地,本文将根据基于词频统计的方法和基于字符串对比的方法的不足,总结出基于相似度估计的方法的优点,详细描述shingle算法,minwise哈希算法的思想、优点以及后续发展,强调文档相似性检测技术目前存在的问题和未来研究方向。

【Abstract】 In the information age,document similarity detection technology has been widely used,including in digital library,search engine,paper retrieval and many other fields,and has achieved great success.However,the accuracy of document similarity detection based on word frequency statistics is low,and the application of document similarity detection based on string comparison cannot be achieved in complex scenes.In order to solve these problems,a large number of document similarity detection techniques based on similarity estimation have been developed in recent years.Among them,shingle algorithm and minwise hash algorithm arethe relatively mature and stable document similarity detection algorithms.Specifically,this paper summarizes the advantages of the similarity estimation based on the disadvantages of the word frequency statistics method and the string comparison method,describes the ideas,advantages,and subsequent developments of shingle algorithm and minwise hash algorithm in detail,and emphasizes the existing problems and future research directions of document similarity detection technology including minwise hash algorithm.

【关键词】 重复率相似度估计检测算法
【Key words】 Repetition RateSimilarityEstimationDetection Algorithm
【基金】 四川水利职业技术学院科研项目(KY2020-30)资助
  • 【分类号】TP391.1
  • 【下载频次】56
节点文献中: 

本文链接的文献网络图示:

本文的引文网络