

The Research on Natural Language Information Hiding

【作者】 刘玉玲

【导师】 孙星明;

【作者基本信息】 湖南大学 , 计算机应用技术, 2008, 博士

【摘要】 随着计算机和互联网技术的发展与普及,信息隐藏成为信息安全领域兴起的一个研究热点,在版权保护、隐蔽通信、身份认证等许多方面有着非常广阔的应用前景。目前,学术界对视频、图像和音频中的信息隐藏进行了广泛的研究。文本文档作为一种普遍使用的重要的信息存储与传输媒体,利用它们作为载体进行隐蔽通信、利用数字水印技术保护文本文档版权以及对文本内容进行认证等具有重要意义。由于文本文档缺乏图像、音频和视频等媒体所具有的人类视觉或听觉冗余特性,以及当前自然语言处理领域对文本内容的理解、变换和生成缺乏坚实的理论基础与实用的自动化技术等原因,文本信息隐藏的研究工作极具挑战性。早期基于格式的文本信息隐藏技术不能抵御重新排版和光学字符识别攻击,应用不广。自然语言信息隐藏是起步不久的新兴领域,代表了文本信息隐藏的发展趋势。本文主要以中文自然语言文本为研究对象,按照自然语言信息隐藏嵌入法的修改粒度,分别在词汇层、句子层以及篇章层提出相应的自然语言信息隐藏方法。其次,为了克服现有嵌入法容量较小、实现难度较大等问题,提出一种基于Mimic的载体文本生成方法。其主要研究成果如下:(1)根据汉语自身的特点,提出两种词汇层自然语言信息隐藏方法。一种是基于异形词替换和同义词替换的方法。该方法将物理相邻的词作为上下文窗口,然后利用词法分析系统对其进行预替换以判断是否嵌入信息。这种方法易于实现、容量较大、能抵抗机器分析的攻击。另一种是基于语义相邻的同义词替换方法。首先利用《同义词词林》和《知网》构建一个同义词库并对同义词组进行分类;然后对于不能完全替换的同义词,根据依存句法分析获取被替换词的语义相邻词作为上下文语境,并选取与上下文语境的出现概率最高的同义词进行替换。这种方法可以有效地获取上下文,并较好地消除错误的替换。(2)针对现有句子层自然语言信息隐藏方法主要集中在英文文本,且存在现有句法分析技术与生成技术无法满足句法变换要求等问题,提出两种句子层中文自然语言信息隐藏方法。一种是基于句法分析树变换的方法。首先设计并实现一个基于BP神经网络的句法分析器;然后对句法分析树进行编码;最后通过句法变换规则修改句法分析树编码来隐藏信息。另一种方法是基于移位变换的方法。首先利用汉字数学表达式思想实现文本数字化,然后通过移位变换规则以隐藏秘密信息。(3)针对现有篇章层自然语言信息隐藏方法研究较少,且存在实现难度大、可行性不高等问题,提出一种基于命名实体和指代消解的篇章层自然语言数字水印方法,同时引入扩频技术对水印信息进行编码。实验结果表明该方法能抵抗一定的主动性攻击,具有较好的鲁棒性。(4)针对现有的自然语言信息隐藏生成法需要通信双方额外传输词典和句型模板库,且存在生成的文本容易引起怀疑等问题,提出一种基于Mimic的载体文本生成方法。该方法不需要事先构建精细的词典和(或)句子模板库,且能提高和增强秘密信息传输的效率和安全性。同时,文中以Microsoft PowerPoint(PPT)文档为例详细地描述了工具MIMIC-PPT的实现过程。

【Abstract】 With the development and popularization of computer and Internet technology, information hiding has become one of the hot spots in the field of information security, and has been extensively used for copyright protection, covert communication, authentication, etc. At present, most have focused on information hiding of video, image and audio documents. However, digital texts form one of the largest chunk of digital data people encounter daily, thus covert communication, copyright management and authentication for text documents are more serious than they are for video, image, and audio documents.Comparing with other media documents, such as image, audio and video, text documents lack redundancies of the human visual system and human auditory system. Additionally, there are few of strong theories and practical automatic techniques in natural language processing area to understand, transform and generate texts. Thus the research of text steganography is very challenging. The early methods of text steganography are based on the physical format of texts. Due to those methods exploited tolerances in typesetting by making minute changes in line placement and kerning, making them vulnerable to simple reformatting and OCR (short for Optical Character Recognition) attacks, their applications are limited. Natural language steganography, as a new area, directs the text steganography.This dissertation mainly concerns about Chinese texts, and proposes several methods for natural language steganography on word level, sentence level and paragraph level. Additionally, due to the limit of the amount of hidden information and the sensitivity of modifying a given cover text, a new method based on Mimic is proposed. The main contributions are summarized as follows.(1) According to characteristics of Chinese texts, two methods on the word level are proposed. The first method exploits the substitution of variant forms of the same word and synonyms. In the method, the neighboring words are deemed as context words. When substituting, a Chinese morphological analyzer is introduced to evaluate whether the text is correctly segmented. The method is easy to implement. It can achieve a high degree of capacity and resist machine analysis. The second method is substitution of synonyms based on the semantic adjacent words. Firstly, the synonymy sets are created and classified with HowNet and Tongyicicilin. For the non-totally interchangeable synonymy sets, the context words are obtained from the semantic adjacent words by analyzing the dependency relationships, and then the synonym is selected with high probability of its cooccurrence of the semantic adjacent words. The method can effectively obtain the context words, and avoid the improper substitutions.(2) As present work on natural language steganography on the sentence level is mainly designed for English texts, this dissertation proposes two methods on the sentence level of Chinese texts. The first method is based on the transformation of syntactic parser trees. Firstly, a parser based on BP neural network is designed and implemented. And then, all the syntactic parser trees are encoded. Then, secret information is embedded by modifying the trees according to the transformation rules. The second method is based on shift conversion. Firstly, a method based on Chinese mathematical expression is presented to encode Chinese texts. Then, secret information is embedded according to the shift conversion rules.(3) Presently, there is little work on natural language steganography on the paragraph level. This dissertation proposes a Chinese natural language watermarking method on the paragraph level. The method is based on named entity and coreference resolution. Additionally, the spread spectrum technique is introduced to encode the watermark. The experimental results show that the method is robust, and can resist some active attacks.(4) For the existing text mimicking methods, it is necessary for the communication parties to share the dictionary and sentence templates. Additionally, the generated texts are easy to incur suspicion. This dissertation proposes a new method of natural language steganography based on Mimic. The method needs not construct sophisticated dictionaries and sentence templates beforehand. Moreover, it can improve the efficiency and security of transmitting secret information. A tool, called MIMIC-PPT, is implemented by combining text mimicking techniques with characteristics of PPT documents.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2011年 03期

