节点文献
结构语义相似的程序识别方法研究
Research on Program Recognition Approach Based on Structural Semantic Similarity
【作者】 王甜甜;
【导师】 马培军;
【作者基本信息】 哈尔滨工业大学 , 计算机应用技术, 2009, 博士
【摘要】 程序识别使用一个已知模式分析给定程序,从而识别给定程序的意图。程序识别在程序理解、软件系统分析、编译器优化、重复代码检查和软件缺陷检测等领域中有着广泛的应用。本文提出结构语义相似的程序识别方法。主要研究内容包括以下四部分:针对代码多样化给程序分析带来困难的问题,提出基于系统依赖图的程序标准化方法。改进了传统的系统依赖图表示,使其充分表示程序的语法结构与语义。并针对已有的指针分析算法不适合于程序标准化的问题,基于控制依赖树和改进的指向表示方法提出流敏感和上下文敏感的指针分析算法,提高指针分析和数据流分析的准确性,并使得指针分析结果可直接应用于程序标准化转换中。最后将改进的系统依赖图、指针分析算法与程序标准化过程有机结合,提出基于系统依赖图的程序标准化模型,根据程序标准化转换规则,对系统依赖图进行语义不变的转换,从而消除代码多样化。实验结果表明,本文提出的指针分析算法准确性高于已有的Emami指针分析算法的准确性,并且应用于程序标准化时可显著提高代码多样化消除率。本文的程序标准化方法的代码多样化消除率高于已有的Hattori与Ishi方法的代码多样化消除率,可以有效地消除代码多样化,提高程序分析的灵活性,为判定程序的结构语义相似奠定良好的基础。针对基于图提取子程序时间与空间开销巨大的问题,提出结构度量相似的候选子程序提取方法。在已有的代码相似度及编辑距离的定理和推论的基础上提出新的推论,将特征向量聚类应用于候选子程序提取中。首先,将代码表示为控制依赖树,通过基本代码标准化,消除影响度量值计算的代码多样化。然后,采用特征向量描述程序的结构信息,将复杂的图的相似度求解问题转换为简单的相似向量的聚类问题,快速提取可能与给定模式相似的候选子程序。实验结果表明,该方法能够过滤掉大部分不相似代码,并且可以识别含有代码多样化的代码,为大规模软件在语义级别上的程序识别奠定基础。针对已有的程序识别方法不能较好地在语义级别上识别程序,并且不能定量地衡量代码的语义相似度的问题,提出基于系统依赖图的语义级别的程序识别方法。在此基础上提出基于程序转换和语义分析的编程题自动评分方法,克服了传统的基于动态测试自动评分和基于软件度量分析的自动评分方法没有考虑学生程序是怎样实现编程任务的缺陷,为编程题自动评分提供新思路。实验结果表明,基于系统依赖图的语义级别的程序识别方法能够准确计算代码的语义相似度,较好地解决程序识别结果的表示和准确性问题。针对传统基于度量值的程序识别方法准确率低和基于图的程序识别方法复杂度高的问题,在上述研究的基础上,提出度量值和图相结合的程序识别方法。首先,将模式程序和用户程序表示为改进的系统依赖图。然后,以模块为单位,采用结构度量相似的候选子程序提取方法,缩小语义级别程序识别的搜索空间,提高算法的效率。最后,对可能相似的候选子程序进行高级代码多样化和精确的语义级别的程序匹配,识别语义相似的子程序。在此基础上提出语义级别的相似代码检测方法,为相似代码检测提供新思路。实验结果表明,度量值和图相结合的程序识别方法可以有效地处理大规模代码,不但能识别完全相同的代码,还可以识别包含代码多样化的相似代码,实现语义级别的程序识别,准确率和效率高于GPlag和integrated logic-domain model两种方法。综上所述,本文提出了结构语义相似的程序识别方法,解决了代码多样化的识别、识别结果的表示和准确性、识别方法的可扩展性等关键技术问题。
【Abstract】 Program recognition recognizes the intention of a given program, by analyzing this program with a known pattern. It has wide application such as program comprehension, software system analysis, compiler optimizing, duplicated code detection, and bug detection.This dissertation proposes a structural semantic similar program recognition approach. The major original works in the research are listed in details as follows.Code variations are widely believed to impede program analysis. To solve this problem, a program normalization approach based on system dependence graph is proposed. The traditional representation of system dependence graph is augmented to sufficiently represent the syntactical structure and the semantic of programs. Existing pointer analysis algorithms usually adopt a low-level intermediate representation which can not sufficiently represent the syntactical structure and the semantic of programs. This makes them difficult to be applied to program normalization. To solve this problem, a flow-sensitive and context-sensitive pointer analysis algorithm based on control dependence tree is presented. The control dependence tree is used as the intermediate representation for the source program, and an improved point-to representation is proposed to represent alias information. Based on this, data flow equations are defined to compute the point-to information by traversing the control dependence tree. The point-to information can be directly used in program normalization to improve the accuracy of data flow analysis. Finally, the augmented system dependence graph, the pointer analysis, and the program normalization process are combined to form the model of program normalization. Semantic-preserving transformations are performed on the system dependence graphs of programs according to a set of predefined rules. As a result, code variations are removed. Experimental results show that the accuracy of our pointer analysis approach is higher than that of Emami’s approach, and our pointer analysis approach can greatly improve the variation removal rate of the program normalization. The variation remove rate of our normalization approach is higher than that of the normalization approach proposed by Hattori and Ishi. Our normalization approach establishes a good framework for testing the semantic equivalence of source codes, and it can facilitate program analysis.The approach of extracting sub-programs based on graphs is high in space and time complexity. To solve this problem, a candidate sub-program extraction approach based on structural metrics similarity is proposed. Based on the existing theorems and corollaries of tree similarity and editing distance, new corollaries are proposed to apply the vector clustering to the candidate sub-programs extraction. First, source codes are represented as control dependence trees, and basic code normalization is performed to eliminate basic code variations that affect the calculation of metrics. Then, vectors are computed to describe the structural information of source codes, and the difficult graph similarity problem is reduced to a simpler vector clustering problem. Candidate sub-programs are quickly extracted. Experimental results show that most of the dissimilar sub-programs can be filtered out, and sub-programs with code variations can be recognized. This approach lays a good foundation for large-sized program recognition on the semantic level.Most of the traditional program recognition approach can not analyze code on the semantic level and can not evaluate the semantic similarity quantitatively. To solve this problem, a program recognition approach based on system dependence graph on the semantic level is proposed. Based on this approach, an automatic grading approach based on program transformation and semantic analysis is proposed. It can conquer the problem of the traditional automatic grading approaches which do not take into account how a student program answers a given programming task. Experimental results show that the program recognition approach based on system dependence graph can well solve the problem of the representation and accuracy of the recognition results.The graph-based program recognition approach can well recognize structurally and semantically similar programs, but it has a high computation complexity. On the contrary, the metrics-based program recogniton approach has a lower computation complexity but a lower precision. To take advantages of these two approaches, a metrics-based and graph-based combined approach for program recognition is proposed based on the above researches. First, source codes are represented as augmented system dependence graphs. Then, the candidate sub-program extraction based on structural metrics similarity according to a granularity of module is performed to filter out most of the dissimilar code pairs so as to lower the computational complexity. After that, advanced code normalization is performed on the candidate sub-programs to remove code variations so as to recognize similar sub-programs on the semantic level. Finally, program matching is performed on the normalized control dependence trees to output semantically similar sub-programs. A semantically similar code detection approach is proposed based on this approach, to provide a new insight into similar code detection. Experimental results show that this approach can be applied to large softwares, and it can recognize not only identical sub-programs but also sub-programs with code variations. The precision of this approach is higher than those of GPlag and integrated logic-domain model.In conclusion, this dissertation has proposed a structural semantic similar program recognition approach, and solved the following key problems: code variation recognition, the representation and precision of the results, the scalability.
【Key words】 program recognition; semantic similar; program normalization; program matching; system dependence graph;