节点文献

数字图书馆敏感数据匿名发布若干关键技术研究

Research on Publishing Sensitive Data Based on Anonymity in Digital Libraries

【作者】 骆永成

【导师】 乐嘉锦;

【作者基本信息】 东华大学 , 控制理论与控制工程, 2011, 博士

【摘要】 随着信息技术的不断发展,数字图书馆的资源日益丰富和各项服务不断创新,用户隐私问题也日益突出。面向各种应用的数据共享和分析服务的数据匿名发布技术一方面具有较好的适用性、通用性和实用性等优势,另一方面又能够充分尊重用户的隐私,有利于数字图书馆应用数据的充分利用和信息共享,从而促进图书馆开展各项服务工作。然而,数字图书馆的应用数据有一定的具体领域特征,隐私保护诉求和数据形式存在多样性。本文通过对现有各种匿名模型及匿名化技术的研究和分析后,指出目前通常的数据匿名发布技术不足以解决数字图书馆敏感数据发布多种场景下的隐私保护问题。因而,本文对数字图书馆敏感数据匿名发布的若干关键技术进行了一些研究,论文的主要工作如下:(1)面向应用的敏感数据匿名发布框架的研究针对当前敏感数据隐私保护中所面临的种种挑战,创新地提出了一种适应应用需求的数据发布体系结构框架方案——基于领域知识面向应用的敏感数据匿名发布框架,并对框架模块进行了初步介绍,同时还给出了一个个性化自适应的隐私保护数据发布算法。该框架尝试使用自适应的机制,不但能满足不同的数据应用需求而且又能满足数据所有者不同的隐私保护需求。在自适应数据发布算法中,联合采用了准标识属性QI泛化和敏感属性SA泛化以获得符合匿名发布原则的匿名数据表,从而在满足隐私保护需求的同时减少了发布数据的信息损失,即尽可能地提高了发布数据的信息精度。(2)基于泛化的个性化匿名数据发布技术的研究本文结合匿名模型的最新发展,提出了一个可以应用于数字图书馆敏感数据发布的个性化敏感数据发布模型——(P,α,k)-匿名模型和基于泛化技术的数据匿名化实现算法,从面向个体和敏感属性值角度出发,充分考虑了图书馆特殊用户隐私保护诉求和大众用户的普遍性隐私保护需求。文中首先介绍了相关工作并在分析现有个性化匿名原则的基础上对个性化隐私约束参数进行了建模,并提出了(P,α,k)-匿名模型;接着提出了一个基于泛化技术的启发式TopDown—LA算法,并介绍了该算法应用的局部重编码和特化处理技术,保证了算法获取最小k-泛化,最大限度地提高匿名化表精度,而后还分析了算法复杂性和正确性。最后通过真实数据实验,验证了这种启发式的个性化匿名算法可行性。该算法能充分满足个性化隐私保护需求进行匿名发布数据,相比Basic Incognito和Mondrian算法信息损失少,算法性能良好。(3)用户身份保留的匿名数据发布技术的研究本文提出了三种具体的身份保留匿名化原则,并重点介绍了基于聚类的匿名发布和有损分解IDAnatomy两种数据发布方法的实现。数字图书馆应用数据的分析在绝大多数情况下不仅需要发布的数据保留用户身份,而且还需要考虑用户的个体隐私保护需求。针对此种情况,本文首先考虑数字图书馆领域应用数据通常存在单一个体对应多条记录的情况,特别分析了此情况下用户敏感数据的侵犯情况,并提出了三种具体的身份保留匿名化原则。接着介绍了应用加权层次距离信息损失评估方式实现数据匿名的基于聚类的(P,α,β)-clustering算法,并分析了算法复杂度;另外还介绍了有损分解IDAnatomy数据发布方法,其通过将原始关系的准标识符属性和敏感属性以两个不同的关系发布,利用它们之间的有损连接来保护隐私数据的安全,并且给出了基本的IDAnatomy算法保证发布的数据满足隐私保护和实用性要求。最后在实验环境中从多个方面比较了原有匿名方法和身份保留的匿名化方法,检验了方法的有效性。(4)敏感数据图发布相关技术的研究本文主要提出了一种新的图聚类安全分组策略和两种不同实现策略的匿名数据发布算法。文中首先分析了数字图书馆复杂个体交互关系数据发布的隐私保护问题,同时根据背景知识对图攻击问题进行了增量式知识查询建模和量化。接着在建立二分图图模型和相关定义的基础上,初步对图的数据匿名集成和数据匿名化问题进行了探讨,同时介绍了简单匿名化、列举和划分等二分图基本数据匿名发布方法。而后结合最新研究成果,提出了一种新的图聚类安全分组策略来提高二分图发布数据的可用性,并从实现策略上比较了先聚后分的CKG算法和边分边聚的KGC算法,其间还重点分析了两个关键问题——图泛化信息损失和聚类分组超顶点的描述。最后通过实验表明,基于聚类安全分组策略匿名方法能为图中的个体提供隐私保护的同时还能在一定程度上提高匿名图数据的可用性。本文研究了数字图书馆领域几个常见应用场景下的数据发布若干关键技术,给出了一些可行解决方案,并且对提出的各种算法不仅都作了详细的性能分析,而且使用数字图书馆运行的实际数据集或综合数据集对算法进行了详细实验。经实验和性能分析都表明:本文提出的算法与相关算法相比具有很好的性能和较好的适应能力。

【Abstract】 With the continuous development of information technology, the resources and innovative services in digital libraries are becoming increasingly rich. At the same time the issue of the users’ privacy is also increasingly prominent. Applied to data sharing and data analysis, the anonymous technique in privacy-preserving data publishing on the one hand has good applicability, versatility and practicality, on the other hand can fully respect the users’ privacy, which is conducive to full application of the data and information sharing, thus promoting library’s services. However, application data in digital libraries has some characteristics of specific areas, which is the diversity of privacy protection demands and the data form. After analyzing various existing anonymity models and anonymization technologies, the thesis points out that the current anonymous data publishing techniques will not solve the privacy problem of sensitive data released under various scenarios in digital libraries. Therefore, it studies some key techniques of anonymous dada publishing for the sensitive data in digital libraries. The main work as follows:(1) Research on the sensitive data publishing framework based on domain knowledge and the applicationFacing with the current challenges of the sensitive data protection, a data publishing architectural framework based on domain knowledge is proposed to meet the application requirements. And several modules of the framework are introduced. Furthermore, an adaptive and personalized data publishing algorithm is given. The framework trying to use an adaptive mechanism, not only can meet the needs of the different data applications, but also can satisfy the needs for the different owners’ privacy protection. In the adaptive data publishing algorithm, it is used together the generalization principles of the quasi-identifier property and sensitive attribute in order to obtain the anonymous released data sheet to meet the demand for privacy protection, while reducing the information loss. That is as much as possible to improve the accuracy of the released data.(2) Research on the technology of the personalized anonymity data publishing based on the generalizationWith the latest development of anonymity, this thesis puts forward a personalized data publishing model applied to release the sensitive data in digital libraries from the perspective of the individual and sensitive attribute values, which is a (P,alpha,k)-anonymity model, and an algorithm based on the generalization. The model gives full consideration to the special user’s privacy and the public users’privacy. First, after introducing the related works and several existing personalized anonymity principles, this thesis gives the personalized privacy constraints with several parameters and proposes a (P,alpha,k)-anonymity model. Second, a heuristic algorithm based on the generalization, TopDown-LA, is proposed. And the techniques of local encoding and specialization used in the algorithm also be explained, which ensure the algorithm to obtain the minimum k-generalization and maximize the accuracy of the anonymous table, and then the complexity and accuracy of the algorithm also be analyzed. Finally, the real data experiments verify the feasibility of this heuristic algorithm. These show that it can fully meet the needs of personalized privacy protection, compared with less loss of information than Basic Incognito and Mondrian, and it has good execution performance. (3) Research on the identity-reserved data publishing technologyThis thesis introduces three specific identity-reserved anonymity principles, and focuses on the two data publishing methods of the clustering-based anonymization and the lossy decomposition, ID Anatomy. In most cases the analysis of the released data in digital libraries not only need to reserve the user’s identity, but also need to consider the needs of the user’s individual privacy. In such cases, the thesis first considers the data with multiple records corresponding to a single individual. In particular, it analyzes the violations of the sensitive data. And it brings forward three specific identity-reserved anonymity principles. Then, the thesis describes the clustering-based algorithm, which applied the weighted-hierarchical-distance methods to assess the information loss, and analyzes its complexity. It also introduces a method of the lossy decomposition, IDAnatomy, which releases the quasi-identifier property and sensitive attributes by using two different relationship tables with their original relations, utilizing the lossy connection to protect the privacy security. And the algorithm guarantees to meet the requirements of privacy and utility. Finally, in the experimental environments it compares several aspects of the original methods and identity-reserved anonymous method, testing the validity of the method.(4) Research on the graph data publishingThis thesis presents a new clustering-based safety grouping strategy for the graph data and two different anonymous data publishing algorithms. Firstly, it analyzes privacy protection data publishing issues of the complex interaction data in digital libraries, and implements an incremental knowledge query model based on the background knowledge of the graph attack problems. Secondly, on the basis of the establishment of bipartite graph model and some related definitions, the issues of the graph anonymization integration and data anonymization are discussed. Also it introduces some bipartite graph data publishing methods, such as primitive anonymous publishing, list approach, partitioning approach, and so on. Then, combined with the latest research results, a new clustering-based safety grouping strategy to improve the data availability of the released bipartite graph is introduced. And it compares the CKG algorithm and KGC algorithm from the implementation strategies. During this period it also highlights the information loss of graph generalization and the description of super-nodes. At last, the experiments show that the clustering-based safety grouping strategy can provide privacy protection for the individuals and increases the availability of anonymous graph data to some extent.In this thesis, the various algorithms not only have made a detailed performance analysis, but also have run with the actual data set in digital libraries or integrated data set. The experimental results and performance analysis show that the proposed methods compared with the related algorithms have good performance and better adaptability.

  • 【网络出版投稿人】 东华大学
  • 【网络出版年期】2012年 06期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络