

Research on Code Conversion Technology in Program Code Similarity Detection

【作者】 裴冬梅

【导师】 刘东升;

【作者基本信息】 内蒙古师范大学 , 计算机应用技术, 2008, 硕士

【摘要】 程序代码的分词转换技术是实现程序代码相似度判别系统的一个重要技术,一个好的分词转换技术不仅可以提高相似度判别系统中对程序进行相似度计算的速度,还可以提高相似度计算的精度,这对相似度判别系统的发展具有重要的现实意义。在程序代码相似度判别系统中,程序代码的分词转换技术得到了广泛的应用。我们可以把一个程序看作一个文本串,然后再通过一定的文法分析将这个文本串转换成描述程序基本信息的标记(token)串。所以对程序相似性的比较就转变成比较两个程序的标记串。而比较标记串的过程就是程序代码的分词转换的过程。本研究首先介绍了关于程序代码相似度判别技术,包括程序代码相似度判别的定义与分类,国内外研究发展的现状以及现有的程序代码相似度判别系统的相关介绍。然后对程序代码分词转换过程中所用到算法情况进行了介绍,包括分词算法,字符串匹配算法等。本研究设计了一个实验系统,该实验系统主要由四部分组成,第一部分,完成实验系统对程序代码的预处理及分词功能,预处理即去掉那些在程序中存在,但对相似度判别无影响的信息,如程序中的注释语句、连续的空格、空行等,接着对预处理后的程序代码进行分词;第二部分,创建程序代码转换所需的词表;第三部分,将程序代码的预处理及分词之后的程序采用字符串匹配算法转换为字符串标识;第四部分是通过用户界面可得到源程序代码转换后的结果输出。最后,通过一些实验对该实验系统进行简单的验证与分析。其中实验的数据来自于学生所做的程序作业,实验结果反映出该实验系统不仅可以支持多种程序语言的转换,而且转换后的实验结果可用于基于字符串相似度判别的算法中,为后续的研究,即对转换后的标记串进行相似度计算,从而得到相似程度的数据,提供了可靠的测试信息。

【Abstract】 The segmenting programming code is a very important technical for implement the system of detecting programming code similar. A very good technical of segmenting words can provide faster and exacter method to the system of detecting programming code similar. It is very important effect to detecting programming code similar.In the system of the detecting programming code similar, the technical of segmenting words can use so widely. Firstly, we can look the programming code as the text string, then use grammar analysis method to convert such text string to the token what can describe basic information and properties of the programming code. Such this process just is word segmenting and converting.This paper introduces the technology of detecting programming code similar. Such as detecting similar definition and detecting similar technology sorts. Then it introduces detecting similar technology’s development in overseas and internal. At last, the paper introduces very useful arithmetic for programming code segmenting and converting. They are: segmenting words arithmetic, character string matching arithmetic etc.In programming code segmenting and converting research, We implement a experiment system, Its functions contain four parts, First part of function is programming code processing and segmenting, processing just is removing un-useful content ,such as comments, space and so on; Second part of function is creating a word dictionary for programming converting. Third part of function is using character string matching arithmetic to convert programming code to the token. Forth part of function print out the converting results via GUI of the system. At last, we must test and analyzeour experiment system via a reasonable and scientific experimentation. All experimentation data come from student’s homework. Test result tell us this system can support multiple programming language converting, and the result as character string type, it can be used for detecting similar arithmetic base on character string detecting. This experiment provides stable testing information. And such information is very important for researching detecting programming code similar system.

  • 【分类号】TP311.11
  • 【被引频次】8
  • 【下载频次】291

