节点文献

非连续上下文建模及其在可执行文件压缩中的应用

On Non-sequential Context Modeling with Application to Executable Data Compression

【作者】 戴文睿

【导师】 支琤;

【作者基本信息】 上海交通大学 , 通信与信息系统, 2008, 硕士

【摘要】 数据压缩技术在过去的20年中迅速发展,并且广泛地应用于文本、语音、图像、视频以及可执行文件等领域。数据压缩的过程一般严格地分为两步:建模过程和编码过程。在编码算法得到的码长已经十分接近香农极限的今天,建模过程对于数据压缩中效率的提升起到了决定性的作用。迄今为止研究人员对于上下文建模进行了大量工作,并且提出了一系列经典的建模方法。可执行文件压缩作为数据压缩的一个分支,随着网络设施和手持终端的普及,各种网络程序分发和手持设备驱动程序存储等应用要求的出现,逐渐显现出其重要的意义。然而经典的建模方法一般仅考虑连续的上下文模型,虽然在诸如文本压缩中取得良好的效果,但在可执行文件压缩中则不尽如人意。本文首先回顾了经典的基于连续上下文建模的算法及其冗余的估计。在此基础上,通过对于连续上下文限制条件的松弛,考虑采用已预测子序列中字符的任意组合而非之前的后缀来构成非连续上下文,从而延伸出更广泛的非连续上下文及其模型的定义,并就此讨论基于非连续上下文的建模。对于非连续上下文建模,讨论的主要内容包括三点:第一,通过引入模型树来为非连续上下文建立一系列的上下文加权树,从而可以得出基于非连续上下文模型的加权概率估计;其二,针对所得到的加权概率估计,讨论它的模型估计冗余并与经典的连续上下文建模方法中的相应结果进行比较,体现出其对具有非连续相关性数据估计时得优势;最后对于存在或不存在训练数据的情况,分别建立对于指定数据的上下文模型选择的方法:当不存在训练数据时可以通过本文提出的方法,用已预测数据快速近似判断上下文模型;而当存在训练数据时可以依据最小描述长度准则,通过贪心算法选定一系列最优的模型。对于可执行文件压缩,本文考虑同时采用连续上下文模型和非连续上下文模型来进行估计,并由这些估计最终给出加权概率估计。在总结了可执行文件,尤其是IA-32指令集中的连续和非连续相关性后,将非连续建模方法结合其中的相关性应用到IA-32指令集中,并通过训练数据采用前述的贪心算法依据最小描述长度准则选择出一组最优估计的上下文模型。在具体的实现中,本文提出一个可执行文件压缩的框架,包括对于可执行文件特有相关性建模以及指令语法分析的预处理、对于指令以选定的模型得出基于连续上下文或非连续上下文的概率模型估计,以及采用一族考虑p阶范数的归一化最小均方误差算法对所得出的概率估计进行混合,渐近地得到最优加权概率估计。

【Abstract】 In the last two decades, data compression techniques have developed rapidly, with extensive applications to the fields, such as text, speech, image, video and executable compression. The process of data compression can be strictly split into two steps: modeling and encoding. Nowadays, since the code length is rather close to the Shannon bound, modeling is playing a critically significant role in the enhancement of compression efficiency. Up to now, mass literature is concentrated on context modeling, and consequently, several classic context modeling methods have been put forward.With the development of network and portable device, executable compression,as a branch of data compression, turns to be significant in the applications such as online distribution of program binaries and storage of driver file in portable device. However, although classic context model algorithms, which consider sequential context models only, perform efficiently in text compression, their performance in executable compression is not satisfactory enough.In this dissertation, we first review the sequential modeling algorithms and their redundancy estimation. And then we generalize the definition of non-sequential contexts and context models by relaxing the limitations on sequential contexts from suffix of predicted subsequence to any permutation of symbols in the subsequence. On such basis, we put forward a method for non-sequential context modeling. There are three main points in it: 1) we establish a series of context weighting trees for non-sequential contexts by introducing the model tree, and we can make weighted estimation according to this series of context weighting trees. 2) The upper bound of model redundancy in non-sequential context modeling is developed according to the weighted estimation, and it is compared with the one in sequential context modeling for the advantage in estimation of source data with non-sequential correlation. 3) Whether training sequence exists or not, the model selection methods are proposed in non-sequential context modeling: a fast method, where predicted symbols are utilized for selection of models, is proposed when there is no training sequence available and when training sequence does exist, an optimal set of models are found by greedy algorithm under the minimum description length (MDL) evaluation.In this dissertation, both sequential models and non-sequential models are considered in executable compression, and the predictions from such models are mixed for a weighted probabilistic estimation. By concluding the sequential and non-sequential correlation in executable files, especially IA-32 instruction set, we apply our non-sequential context modeling method to the assembly instructions collaborating with the correlation in it and customize an optimal set of context models by the described greedy algorithm under MDL evaluation according to the training sequence. Moreover, the executable compression framework is proposed, which comprises of preprocessing which deals with unique correlation in assembly instructions and the syntax parsing of IA-32 instruction, estimation by the selected models (sequential or non-sequential), weighting by a family of normalized least mean square (LMS) algorithms considering Lp-norm of input probability estimation and asymptotically gaining the optimal weighted probability estimation.

  • 【分类号】TP391.1
  • 【被引频次】1
  • 【下载频次】78
节点文献中: 

本文链接的文献网络图示:

本文的引文网络