节点文献

人类蛋白共进化网络研究与交互式转录组注释系统构建

Research on Human Protein Co-evolution Network and Interactive Annotation System of Transcriptome

【作者】 赵东宇

【导师】 于军;

【作者基本信息】 中国科学院北京基因组研究所 , 生物信息学, 2013, 博士

【摘要】 随着人类基因组计划的顺利完成,基因组学、转录组学和蛋白质组学等组学研究都进入了快速发展时期。而DNA测序技术的革新和进步,导致了生物信息数据的爆炸式增长。当前生物信息学研究的重要命题,就是如何对这些组学数据进行合理存储、整理、挖掘并高效使用。本论文的研究将围绕蛋白质组学和转录组学中的两个具体问题展开,以生物信息学数据挖掘方法和数据管理模式来解决这两个问题。蛋白质共进化网络是蛋白质组学研究的一个重要方向,也是揭示蛋白质相互作用关系的重要方法。当前蛋白质相互作用的研究手段主要包括实验方法和生物信息学方法两大类,与生物实验方法相比,生物信息学方法更加省时有效,更适合组学数据的深入挖掘。近年来,已有众多物种完成全基因组测序,这为研究人的蛋白质共进化网络提供了前提和基础。基于此,本课题主要进行人类蛋白质共进化网络的构建,通过真核生物全基因组同源基因之间的进化距离,采用镜像树方法,以NCBI的HomoloGene数据库中18个真核生物的18,283个同源蛋白家族为研究对象,构建不同物种蛋白家族间的距离矩阵,并计算了每两个蛋白家族之间的皮尔森相关系数与向量数量值,得到人类蛋白质共进化网络。最后应用蛋白质复合物数据、DIP和HPRD数据库中的蛋白质互作数据以及代谢调控网络数据对蛋白质共进化网络进行有效性检验,验证结果表明共进化网络可以用于揭示蛋白质之间的相互作用关系。我们又进一步分析了蛋白质共进化模型相关系数过于聚集的原因,采用了不同物种宽度比较其进化距离差异,得出当前真核生物全基因组同源注释的物种较少,物种间进化距离宽度不明显,与人类远源物种数量较少是造成相关系数过于聚集的主因。后续更多物种测序完成,必将改善真核生物的蛋白质共进化网络研究。随着蛋白质组和基因功能的系统性研究顺利进行,转录组信息的需求也在不断地增加。尤其是研究不同细胞生理状态下和不同病理状态下的基因调控和功能方面,转录本与所编码蛋白质的具体分布和功能的关联性尤为重要。如何把这些转录组数据深入的整理、归纳、注释、存储以及合理的利用是我们研究的重要目标。近年来,综合型转录组数据库已经归纳整理并存储了各种不同测序技术的转录组数据,受到了广泛的使用。然而,当转录组学数据需要进行交互注释和深度挖掘时,这类数据库就无法满足了。因此,我们专门构建了人体转录组交互式注释系统,该系统以人体结构有向图为组织框架,利用链接表存储方式和深度优先遍历根路径算法存储和遍历人体结构图,搜索到的细胞或组织根路径方便了数据的查找和获取,最重要的是系统建立在Web2.0交互式平台上,扩展空间巨大。由于进行课题研究时,EST的测序技术较为成熟,数据覆盖面广、使用量大,所以,我们采用了EST作为系统的首选数据源。结合EST的文库信息,按照在人体健康与病理细胞中的表达情况,把其分类到相应的细胞或组织中。除此之外,进一步挖掘人的看家基因、组织特异基因、基因在染色体上的表达信息以及基因的GO功能分类,并将以上各种分析处理的数据综合起来补充人类转录组注释系统的数据信息。该系统基于mediawiki引擎,可提供交互式服务,用户不仅可以搜索、浏览、数据下载,也能够进行上传、注释等操作,方便系统中数据的实时更新,让每一位用户都成为管理员,使得系统高效有序地运行。最新数据库状态表明,短期内的高注册率和高访问量说明人类转录组注释系统具有较高的实用性。

【Abstract】 With the successful completion of Human Genome Project, genomics, transcriptomics and proteomics research have achieved a rapid development period. Meanwhile, the innovation and progress of DNA sequencing technology result in bioinformatics massive data growth. An important bioinformatics question is how to rationally store, manage and process these omics data. This study will focus on two specific issues in proteomics and transcriptiomics, protein co-evolution model and transcriptomics interactive annotation system.Protein co-evolution network is an important method to reveal relations of protein-protein interactions(PPI). Currently, PPI investigation methods mainly consist of two categories, biological experimental methods and bioinformatics methods. Compared with biological experiments, bioinformatics methods are more effective and more suitable for genomics data mining. In recent years, more whole genomes have been published, which promote the study of human protein co-evolution network. Based on those points, we have constructed human protein co-evolution model, which begins with18,283homologous protein families of18eukaryotic species from NCBI HomoloGene database. We computed the evolutionary distance between eukaryotic genome-wide homologous genes, built distance matrix with mirror tree methods between different species protein families, and calculated the Pearson correlation coefficient and the vectors number of each protein family. Finally, we identified the efficiency of the protein co-evolution model with data from human protein complexes, PPIs from DIP and HPRD databases, and proteins from human metabolic networks. The results show that the protein co-evolution network model can be used to reveal the interactions between proteins. We further analyzed why the correlation coefficient is too concentrative in the protein co-evolution model. The evolution width of different species is used to compare their evolutionary distances. We found that the eukaryotic species with homology gene annotations are less as well as the species number which is distal with human.This may cause the correlation coefficient gathered too tight. We believe that more completed whole genomes will improve the eukaryotic protein co-evolution network research.With the rapid development of proteomics and systematic study of gene function, the demand of transcriptome information is constantly increasing. The distribution and function pertinence of transcripts and encoded protein are so important to study gene regulation and function with different physiological and pathological states. The main problem is how to collate, store, process and annotate these transcriptome data. In recent year, the well-known comprehensive transcriptome databases collate and store various types of transcriptomic data, but these databases cannot satisfy the interactive annotations and the deeply data-mining of those data. Therefore, we built the human transcriptome interactive annotation system, Wikicell, which is based on the organizational framework by human body structure graph. Searching and accessing data or body structure graph is convenient in Wikicell using adjacency list storage and depth first search methods. The whole system is built on Web2.0interactive platform, which have a huge space for expansion.The major data source is EST data, because EST sequencing technology is more mature, and EST data has widely coverage and more popular for users. We classified EST data into the appropriate cells or tissues in accordance with their library information. We also supplied the housekeeping and tissue-specific genes tables; the gene expression information table divided by chromosomes and GO functional classification tables. The system is bulit on mediawiki engine, and provides interactive services, on which users can not only search, browse, download the data, but also upload, comment and make other operations to facilitate real-time updates of data. Every user is an administrator to make the system efficiently and orderly operation. The latest state of the database shows that high registration rate and high page views confirm the human transcriptome annotation system has high practicability.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络