

Research on Text Information Hiding Methods Based on Digital Watermarking

【作者】 吴戈

【导师】 陈殿仁;

【作者基本信息】 长春理工大学 , 物理电子学, 2011, 博士

【摘要】 进入21世纪,人们对网络逐渐从认识过渡到熟悉,互联网的使用呈爆炸性的增长。通过互联网可以实现各种信息的交流,快捷方便地获得数字信息(影视、文学作品、技术资料等)和在线服务(网银、购物等等)。但事物总是双面的,网络既给人们合法使用提供了方便,同时也使盗版变得更加轻易,因此数字作品的管理、保护不仅成为业界迫切需要解决的问题,而且司法界有效执行版权判定的要求。文本水印作为一种保护文本数据信息的技术,它的目的是使文本载体中隐藏的信息可以不受侵犯地保留在数据中,并可根据需要随时把水印提取和恢复出来。从而为确定对作品的侵权行为提供有力的证据。由于文本文档在空间域和频率域缺乏足够的冗余,所以在多媒体中应用的隐藏算法难以被直接引入到文本中。早期文本水印是利用字间距、行间距和字体的微小变化可以不被人眼察觉来实现的,一旦遭遇重新排版或者光学字符识别攻击,就会受到严重损坏,导致应用有限。文本作为载体有其特殊性,它属于自然语言范畴。当前基于自然语言处理实现文本信息隐藏日益成为文本水印研究的主要方向,很多研究机构为此投入大量人力物力,到现在为止涌现出大量成果,但还没有形成各方都完全认可的统一标准,这也使自然语言处理领域文本信息隐藏的研究工作极具挑战性。本文以中文自然语言文本为研究对象,研究了几种在词汇层、句子层进行文本信息隐藏的方法。其主要创新点如下:提出了一种基于连词替换的文本水印算法。根据混沌系统对初值极端敏感和连词替换后文本语义变化极其微小的特点,通过对原始水印信息进行混沌加密,对同义连词建立同义词替换表,把文本中的连词按照加密信息进行替换以实现水印嵌入,生成含水印文本。其特点是水印嵌入灵活,通过实验证明生成的含水印文本隐蔽性好,对于各种攻击有较强的鲁棒性。提出一种基于汉字文本中常用字的文本信息隐藏方法。根据汉字常用字在不同文本中出现的频次总体比较稳定的特点,以及“的”字的隐现规则,按照隐秘信息,通过“的”的增删来改变常用字之间字符数量的奇偶值,由此在文本中嵌入水印。其特点是适用范围广,不受文本种类的影响。经过实验证明水印容量大,隐蔽性较好。提出一种基于文本句子层分析的信息隐藏方法。由于文本中大量存在的与核心词紧密相关SBV(?)ADV和SBV(?)ADV(?)POB句法结构,而这种ADV结构中的副词及POB结构中的介词由于与核心词紧密相关,通常都有很强的表意作用,所以在句子中起着比较重要的作用。基于这个句法特点,采用副词同义词替换和介词同义词替换,并结合混沌加密和序列映射等多重加密方法,实现将信息嵌入到文本中的目的。通过实验与其他相近算法相比,该算法水印容量较大,有较强的鲁棒性和隐蔽性。提出一种基于文本关键词和汉字常用字的零水印算法。根据零水印不改变原始文本内容,以及某一文本中关键词和汉字常用字出现频次能够反映文本的重要特征,以及在第三方处进行加密及添加时间戳的机制,来实现信息隐藏和版权保护。针对实际中可能出现的对文本的微小改动所造成的排序变化,又增加了拼音排序以加强算法的鲁棒性,实验证明改进算法对各类攻击的抵御能力更强。

【Abstract】 From the beginning of the 21st century people have been familiar to the network gradually,the uses of the internet have been in explosive growth.People could exchange variety of information and access digital information (films, literature, technical information), online services(online banking,herbmylife,and so on) easily through internet, but the pirate became also easier. So the management and protection of digital works become not only the urgent need to solve the problem,but also the request to ascertain the rights towards the illegal use of digital works.As a kind of technology for saving text digital information, text watermarking can protect the hiding information in text media and could be saved in datas retrievably so as to realize confirming the rights of the text and tailing the infringing behaviors to the text. Because text documents lack redundancies of the space fields and frequency fields, the hiding techniques in multimedia could not be used to generate text watermarking.The early methods of text steganography were relized by that human visual system can not recognize the minute changing of the texts physical formats, such as word space, line space and character font.Once the text is attacked by typeset or optical character recognition,the watermarking would be damaged and its applications would be limited.As a carrier text has its particularity,it belongs to the category of natural language.Now the research of text information hiding based on natural language processing become the main direction, many research institutions put a lot of manpower and resources to it.Until now a large number of results have emerged, however,a unified standard which can be accepted by all parties have not been formed.thus the research is challenging.The paper mainly concerned about Chinese texts watermarking,and proposes several methods for natural language steganography on word level and sentence level.The main contributions are summarized as follows:A method based on exchanging conjunctions is proposed. Based on the characters that the chaotic systems is sentitive to the initial values extremely and the text sematic features have tiny changes after synonyms’ exchanging,the algorithm encrypt the original watermarking information through chaotic system,create the substitution table of synonym conjunctions,then replace the conjunctions in accordance with encrypted watermarking to realize hiding information. By the experiments’ proof the watermarking has good invisibility and strong robustness to attacks.A method based on Chinese characters frequently used is proposed.Based on the steady emergence frequency of the frequently used characters, and the looming rules of ’de’ word, according to the hiding information,the algorithm changes the parity value of character numbers between two frequently used words through adding or deleting’de’,then the watermarking is embedded.lt has a wide range of applications and can not be effected by text types.The experiments results show that the watermarking has large capacity and better invisibility.A method on sentence layer is proposed.Because there are many sentence structures such as ’SBV(?)ADV’ and ’SBV (?) ADV (?) POB’ existing in text.The adverbs in ’ADV’ structures and the prepositions in ’POB’ structures have tight relations with head words of the text, they are always ideographic strongly and play a more importment role.Based on this character, combined with chaos encryption and sequence mapping,the information is embedded in the text by the synonyms’swapping of adverbs,the same as the prepositions.Combined with other similar algorithms through experiments.the algorithm has large watermarking capacity, strong robustness and good invisibility.A zero watermarking method is proposed that based on text key words and Chinese high frequency characters.Because the zero watermarking doesn’t change the original text a little,while the key words and Chinese frequently used characters can reflect the importment features of the text,the algorithm realize the information hiding and copyrights reservation through encryption and adding time stamp by the third certification authority. For enhancing the robustness the algorithm add phonetic sequencing to reduce the effects of changing sequence by modifying the text. The method has a better ability to resist many kinds of attacks.


