

Design and Research of Digital Watermarking in Natural Language Documents

【作者】 余振山

【导师】 黄刘生;

【作者基本信息】 中国科学技术大学 , 信息安全, 2009, 博士

【摘要】 自然语言是人类相互交流中最主要、最准确、最高效的方式。随着数字时代的来临,人们每天都会接触大量的电子文档、网络新闻、论坛、博客等,自然语言数字文本已经成为新的交流层面上最重要的载体,如何保护其版权是亟待解决的问题。数字水印是数字文件版权保护的重要手段。对于数字水印的研究首先在多媒体载体的领域展开,在图像、音视频方面都出现了针对人类视觉特点或者听觉特点的水印算法。由于这几种媒体的处理手法相近,冗余度也较高,研究不断深入。近年来逆向的对水印算法的检测等攻击分析也逐渐得到重视。数字水印是数字文件版权保护的重要手段。对于数字水印的研究首先在多媒体载体的领域展开,在图像、音视频方面都出现了针对人类视觉特点或者听觉特点的水印算法,由于这几种媒体的处理手法相近,冗余度也较高,研究不断深入。逆向的对水印算法的检测等攻击分析也逐渐得到重视。反观文本方面,存在处理手段特殊、冗余度低、自然语言规则复杂、计算语言学受限等困难,文本数字水印的研究起步晚,成果也较少。但是因为文本既常见又重要,所以近年来投身文本水印领域的研究者逐渐增加,从排版类到语法语义类都出现了新颖的水印算法,同时文本水印算法的检测分析工作也已起步。不过总体来说,文本数字水印领域还未出现足够实用的方案,水印算法的检测分析成果凤毛麟角,整体上缺乏系统的理论基础。有鉴于此,本文的研究工作及取得的相应成果主要包括:1.自然语言文本中数字水印模型的研究。建立了适合文本的通讯模型,根据密码学基础的方法定义了水印的不可检测性、程序敌手、人类敌手、不可见攻击、鲁棒性等概念,构造了用交互证明系统验证水印算法安全性的方法,并将其应用于对实际水印系统的评价。2.自然语言文本中数字水印的设计。提出并实现一种新的文本数字水印算法——宋词水印。这是一种附加型生成文本水印,算法由水印信息直接生成一段宋词,这段宋词在字数、行数、句子形式、格律和韵脚等方面符合某个词牌,具有很强的迷惑性。将生成的宋词附加于载体文本中,验证时提取这段宋词,对照词典即可还原出水印信息。由于生成的宋词具有较高的迷惑性,所以水印具有良好的隐蔽性。实验结果表明水印信息与生成文本的大小比值达到16%,因此本方法也可作为一种高嵌入率的文本隐写算法。据我们所知,这是第一个利用特殊体裁的文本水印算法。3.自然语言文本中数字水印的检测研究。针对排版类的Snow水印设计检测算法,并指出检测一般性排版类水印算法的思路。针对语义类的基于同义词替换的水印,设计利用上下文信息的检测算法,通过考量关键词是否是同义词集合中最适合上下文的词语,判断该点是否被嵌入信息,整篇文章的关键词的考量结果导致文本是否带有水印信息的判断。同一同义词集合的词语对同样的上下文比较合适度时,我们用IDF系数调整常用词和冷僻词之间的差距。实验表明检测算法对于T-Lex同义词水印系统达到了90.0%的准确率、86.6%的精度和82.5%的召回率。针对基于翻译的水印系统,我们也设计了检测的方法。4.提出将整个互联网作为语料库的思想。如果将每个包含自然语言文本的网页视作语料库中的一篇文档,那么整个互联网就可视为一个超大规模的、按影响力有序的、实时更新的语料库。配合搜索引擎等工具,人们可以从中提取自然语言使用习惯等传统语料库因规模受限、成本过高等原因无法有效提供的信息。

【Abstract】 Natural language is the most primary, the most exact, and the most efficient way of human communication. With the development of digital technique, people meet lots of electronic documents, netnews, forums, blogs, and so on. Digital natural language documents have became the most important media over the Internet. How to protect the copyright of these digital documents is an urgent problem.Digital watermarking is an important way to protect the copyright of digital files. Research in this area first develops in multimedia area. Making use of the disadvantages of human vision system and human auditory system, researchers have designed watermarking algorithms for image, audio and video. Due to the similarity of these multimedia carriers in processing and their sufficient redundancy, research in designing watermarking develops rapidly, and research on steganalysis of these schemes has received enough attention.By contrast, owing to special processing methods, low redundancy, complexity of natural language rules, and limitation of computer linguistics, research on watermarking in digital text starts late and gains less achievement. However, text is common and important in our daily life, more and more researchers investigate into this area in recent years. New watermarking algorithms emerge from formatting kind, syntactic kind to semantic kind. Meanwhile, steganalysis on text watermarking has already started. Generally speaking, in the area of digital watermarking in natural language text, application-proper schemes haven’t been designed yet, results in steganalysis are still rare, and the theoretic basis is waiting to be established. With this concern, the main research work and the corresponding contributions of this dissertation are as follows:1. Research on model for digital watermarking in natural language text. We establish communication model especially for text, use the methodology of foundations of the cryptography to define the concepts of undetectability, procedure adversary, human adversary, invisible attack and robustness. Also, we find out an approach to prove the safety of watermarking algorithms by interactive prove systems. And we use these to evaluate some actual watermarking systems.2. Design of watermarking schemes for digital natural language text. We propose and realize a new digital text watermarking system– StegCi. It is an appending watermarking scheme. A piece of Ci is produced from watermark by the encoding algorithm. The generated Ci is accord with some tune in number of lines and words, sentence patterns, rhythm and rhyme, so it is innocuous. Stego Ci is then added to the carrier text. During verification, watermark is extracted from the stego Ci by looking up a lexicon. Because stego Cis are innocuous, watermarking is difficult to detect. Experimental result show that the ratio of watermark to carrier reached 16%, which means StegCi is also a high embedding ratio text steganography system. To the best of our knowledge, this is the first text watermarking scheme making use of special type of literature.3. Detection of watermarking schemes for digital natural language texts. For algorithm Snow which belongs to the class of formatting methods, we design detection algorithm and point out the general way to steganalyze formatting schemes. For synonym substitution based schemes which fall into semantic kind, we design detecting algorithm by making use of the context information. By investigating whether the keyword is the most suitable word for the context in its synonym set, judgement of whether this keyword is carrying watermarking bit is made. The investigation over the whole text leads to the final judgement about watermarked or not. When comparing between words in a synonym set for the same context, we use IDF to balance common words and rare ones. Experimental results for T-Lex watermarking system show 90% accuracy, 86.6% precision and 82.5% recall rate. For watermarking system based on translation, we also design detecting algorithm.4. Developing the idea of treating the whole Internet as a corpus. If each webpage which contains natural language texts is treated as a document in this corpus, the whole Internet can be regarded as a large-scale, influence-weighted, up-to-date corpus. With the help of tools such as searching engine, people may get useful information about the usage of natural language which is very difficult to get from traditional corpra because of their limited size or unaffordable cost.
