节点文献

高通量基因组数据的处理、分析与建模

Processing, Analysis and Modeling on High-Throughput Genomic Data

【作者】 王丛茂

【导师】 张大兵; 王秀杰;

【作者基本信息】 上海交通大学 , 生物化学与分子生物学, 2012, 博士

【摘要】 随着高通量测序技术的不断发展,生物学相关数据也越来越多,如何从高通量实验数据中挖掘出有价值的知识和规律是生物信息学及计算生物学研究的热点之一。本文围绕高通量基因组数据处理、分析方法等展开了一系列研究,并取得了以下研究结果。1、随着第二代DNA测序技术的发展,人们揭示了越来越多不同物种的参考基因组序列和不同生物个体基因组序列。然而,如何存储和管理数量巨大的不同生物个体的基因组数据,已成为生物学家面临的一个重要挑战。本文提出了一种新颖的压缩工具GRS (Genome ReSequencing),用来储存并分析有参考基因组序列的基因组重测序数据。和以前的方法相比,GRS能够处理没有单核苷酸多态性参考序列和其他变异信息图谱的基因组序列数据,并根据参考基因组序列自动重建个体基因组序列。通过对第一个韩国人个体基因组序列数据的测试,GRS能够实现159倍左右的压缩效率,从原始2986.8 MB大小压缩至18.8 MB。通过对水稻和拟南芥测序数据的测试,水稻基因组数据从原来的361.0 MB大小压缩至4.4 MB,拟南芥基因组数据从115.1 MB压缩至6.5 KB。该压缩工具可以通过http://gmdd.shgmo.org/Computational-Biology/GRS访问。2、染色质免疫沉淀后对其进行大规模高通量并行测序(ChIP-Seq)是用于研究蛋白质和基因组DNA相互作用的的重要手段。本文设计了一种可以用来分析来自Illumina双端测序ChIP-Seq数据的新算法,并开发出其对应的分析工具SIPeS(从双端测序数据中鉴定结合位点)。我们获得了拟南芥AMS转录因子(一个参与拟南芥花粉发育过程的基因)ChIP-Seq实验;SIPeS分析结果与现有的分析方法CisGenome和MACS相比,有更高的结合位点识别分辨率。根据双端测序数据,SIPeS可以准确的计算出有效基因组长度(mappable genome length/effective genome length),并且通过使用动态基线(dynamic baseline)的方法有效地分辨出紧密相邻的结合位点,特别是对于拟南芥等基因密度较大的基因组时非常有效。该分析工具可以通过http://gmdd.shgmo.org/Computational-Biology/ChIP-Seq/download/SIPeS访问,目前版本为2.0。3、蛋白质的相互作用参与生物体生命活动的各个方面,虽然目前有超过10个以上的公用拟南芥蛋白质相互作用数据库。但是,这些数据库存在某些缺陷,包括使用没有统一标准类型的相互作用证据,缺乏统一的蛋白质或基因标识符以及使用没有标准定义的其它信息等。为了有效地整合来自不同相互作用数据库的数据,并最大限度地利用这些数据,本文提出了一个交互式的生物信息学网络工具,ANAP(拟南芥网络分析流水线)。ANAP是根据拟南芥蛋白质相互作用数据整合及其相互作用网络研究而开发的,它可以方便地进行蛋白质相互作用网络分析。ANAP集成了11个拟南芥蛋白质相互作用数据库,其中共包括201699对唯一的蛋白质相互作用对,15208个标识符(包括11931个TAIR的AGI号),89种相互作用检测方法,73种参与拟南芥蛋白质相互作用的物种,6161篇参考文献。ANAP可以用来作为构建蛋白质相互作用网络的知识库,根据用户的输入,支持蛋白质直接和间接相互作用分析。它有一个直观的图形界面,便于网络的可视化,并为每对相互作用提供详细的证据。此外,通过连接相应TAIR数据库,ANAP可以很方便在生成的相互作用网络中浏览相关基因或蛋白质的功能注释,并且可以比较方便的连接至相关基因或蛋白质对应的AtGenExpress可视化工具(AVT),拟南芥1001基因组GBrowse(1001基因组),蛋白质知识库(UniProtKB),京都基因与基因组百科全书(KEGG)以及Ensembl基因组浏览器(EnsemblGenomes)去更好的进行相互作用网络分析。该工具可以通过http://gmdd.shgmo.org/Computational-Biology/ANAP/ANAP_V1.0访问。4、转基因作物的安全性评价是转基因作物研究到其商业化过程中的关键步骤,其中分子特征是安全评价中最基本和最重要的部分,包括评价外源插入位点,旁侧序列及插入拷贝数等。相对于常规使用的检测方法,如Southern杂交,聚合酶链式反应,原位杂交,基因组步移等,建立和发展新的高通量转基因作物分子特征分析方法是有益和必要的。这里,我们在双端测序技术基础上开发了一个准确的高通量方法用以评估转基因水稻全基因组水平的分子特征。对于转基因水稻T1C-19,利用我们建立的方法,可以清楚的发现位于4号和11号染色体上的外源插入位点,该结果同时较好的得到了常规PCR和Sanger测序方法的验证。

【Abstract】 With the rapid development of biological sciences, a large amount of data has been generated. How to explore the valuable knowledge has become a major topic in bioinformatics and computational biology research. This thesis study focuses on high-throughput genomic data with regard to their processing, analysis, and modeling.The following important findings have been made.1. With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently, available bioinformatics tools used to compress genome sequencing data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool named GRS for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequencing data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequencing data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequencing data set, GRS was able to achieve 159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.2. ChIP-Seq, which combines chromatin immunoprecipitation (ChIP) with high-throughput massively parallel sequencing, is increasingly being used for identification of protein-DNA interactions in vivo in the genome. However, to maximize the effectiveness of data analysis of such sequences, new algorithms that are able to accurately predict DNA-protein binding sites need to be developed. Here, we present SIPeS (Site Identification from Paired-end Sequencing), a novel algorithm for precise identification of binding sites from short reads generated by paired-end solexa ChIP-Seq technology. We used this method on the ChIP-Seq data from the Arabidopsis basic helix-loop-helix transcription factor ABORTED MICROSPORES (AMS), which is expressed in anther during pollen development. Our results show that SIPeS has better resolution for binding site identification compared to two existing ChIP-Seq peak detection algorithms, Cisgenome and MACS. Moreover, SIPeS is designed to accurately calculate the mappable genome length with fragment length based on the paired-end reads. Dynamic baselines are also employed to effectively discriminate closely adjacent binding sites for effective binding site discovery, which is of particular value when working on genomes with high gene density. This de novo tool is available at http://gmdd.shgmo.org/Computational-Biology/ChIP-Seq/download/SIPeS, and current version is 2.0.3. Protein interactions are essential in the molecular processes occurring within an organism and are utilised in network biology to help organise and understand biological complexity. Currently, there are more than 10 publically available Arabidopsis protein interaction databases. However, there are limitations with these databases, including different types of interaction evidence, a lack of defined standards for protein identifiers, and the use of other non-standard information. To effectively integrate the different datasets and maximise access to available data, this paper presents an interactive bioinformatics web tool, ANAP (Arabidopsis Network Analysis Pipeline). ANAP has been developed for Arabidopsis protein interaction integration and network-based study, to facilitate functional protein network analysis. ANAP integrates 11 Arabidopsis protein interaction databases, comprising a total of 201,699 unique protein interaction pairs, 15,208 identifiers (include 11,931 TAIR AGI code), 89 interaction detection methods, 73 species interacting with Arabidopsis and 6161 references. ANAP can be used as a knowledge base for constructing protein interaction networks based on a user input and supports both direct and indirect interaction analysis. It has an intuitive graphical interface allowing easy network visualisation and provides extensive detailed evidence for each interaction. In addition, ANAP displays the gene and protein annotation in the generated interactive network with links to the TAIR, AtGenExpress Visualization Tool (AVT), Arabidopsis 1001 Genomes GBrowse (1001 Genomes), Protein Knowledgebase (UniProtKB), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Ensembl Genome Browser (EnsemblGenomes) to significantly aid functional network analysis. The tool is available open access at http://gmdd.shgmo.org/Computational-Biology/ANAP/ANAP_V1.0.4. Safety assessment of genetically modified (GM) crops is a key step from research of transgenic crops to commercialization. Molecular characterization, including analysis of the integrated site, flanking sequence, and copy numbers of insertion, provides the most basic and important data to safety assessment. Development of high-throughput analyzing methods for molecular characterization of GM crops proves to be advantageous over conventional methods, such as southern blotting, polymerase chain reaction (PCR), fluorescence in situ hybridization (FISH), and genomic walking. In this work, we developed a high throughput and accurate method based on the paired-end sequencing technique to reveal the molecular features of GM rice at the genome-wide level. One transgenic rice event T1C-19 was selected to test the applicability of the developed method. The integrated sites in Chr04 and Chr11 were clearly revealed for two transgenes, and the sequences surrounding the integration sites were easily identified using conventional PCR and Sanger sequencing.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络