节点文献

22周孕龄人胎肝转录组及SARS-CoV(BJ-01)基因组的生物信息学研究

Bioinformatics Study on Transcriptome of Human Fetal Liver Aged 22 Weeks of Gestation and Genome of SARS-CoV(JB-01)

【作者】 陈廷贵

【导师】 贺福初; 朱云平;

【作者基本信息】 中国人民解放军军事医学科学院 , 细胞生物学, 2005, 博士

【摘要】 研究目的:肝脏在人体生命活动中具有重要的生理功能,而4-6月孕龄人胎肝还是造血、免疫系统干/祖细胞的主要来源,并表达大量与细胞植入、定居、转移相关的基因。本文研究目的是借助生物信息学的工具,通过分析本室测定的人胎肝EST数据及来自公共数据库中的芯片数据,以了解人胎肝转录组特点,并通过分析为蛋白质组研究及基因功能研究打下良好的基础。 另外由于SARS爆发,为了解SARS-CoV所表达的蛋白质种类及它们的功能,促进蛋白质组鉴定工作,又开展了另一部分研究:即对SARS-CoV(BJ-01)进行基因预测,并推测所得蛋白质功能。 研究内容:首先通过EST预处理,获得人胎肝EST有效序列;其次对EST进行正确的聚类,得到EST丰度信息,并对EST进行鉴定;第三,对已知基因进行GO分类和KEGG分类,克服基因功能人为分类缺点,建立标准化的人胎肝表达谱:第四,通过对人胎肝已知基因数据与芯片数据进行比较,获得人胎肝的特点,并分析相关组织之间的关系:第五,对人胎肝功能未知EST进行电子拼接、验证,获得全长cDNA或完整ORF;第六,对未知基因进行功能推测,为基因功能研究打下基础;第七,建立人胎肝转录组数据库及蛋白质组质谱肽段鉴定体系;第八,对SARS-CoA(BJ01)进行基因预测及功能推测。 研究方法:第一,对人胎肝EST进行预处理,排除重复测序序列、外来序列、长度小于100bp的序列,确保后续分析序列为有效序列,并通过VecScreen程序去除载体序列,通过本地化repeat_masker程序去除重复序列:第二,比对NT数据库,根据分值不小于200并按照功能已知与否把EST分为功能已知和功能未知两类:第三,利用Blast比对UniGene数据库、DoTs数据库、MGC数据库和Twinscan所预测的人转录组数据库,获得较准确的EST丰度信息;第四,通过DAVID软件对功能已知基因进行GO分类,同时进行KEGG分类;第五,从芯片数据中选择相关的五种组织,通过DAVID对它们进行分析,获得人胎肝转录组特点,并通过层次聚类对五种组织的关系进行分析:第六,通过Phrap软件对未知EST进行电子拼接,并比对原始EST及四个转录组数据库进行验证,同时用ATGpr软件检验完整性和检出ORF,建立相应的蛋白质数据库;第七,对所获得的功能未知基因进行Prosite、Pfam、PSORT、SOSUI及电子基因定位等分析;第八,对于SARS-CoV(BJ—01)基因组,首先比较12种基因预测方法,然后选用启发式模型(Heuristic models)、基因鉴定(Gene identification)、

【Abstract】 Bioinformatics Study on Transcriptome of Human Fetal Liver Aged 22 Weeks ofGestationBackgrounds: Human fetal liver aged 22 wk of gestation (HFL22w), consistes of hepatic parenchyma cells and hematopoietic stem/progenitor cells, and corresponds to the turning point between immigration and emigration of the hematopoietic system. We had studied HFL22w before, but with improvements of data sources including: (1) The rapid growth of expressed sequence tags (ESTs) in dbEST; (2) The renewal of the GenBank non-redundant database; (3)The establishment of Gene Ontology (GO); (4) The increase of tissues expression profiling data coming from microarray; (5) The continuous perfection of UniGene, DoTs, MGC and Twinscan program, we must study on HFL22w once more, for the purpose of protein identification in proteomics, protein-protein interaction network research, and new gene function study.Aims: (1) Clustering of EST to get EST frequency information, and identifying gene; (2) GO classification for known genes to build standard expression profiling about HFL22w and to compare with those of other tissues; (3) validation of the results to get predicted proteins and their functional informations from unknown ESTs.Methods: The ESTs were first searched against the GenBank non-redundant database, UniGene, DoTs, MGC and Twinscan database for the identification of gene and a more perfected clustering of the ESTs. After classifying those known ESTs by using GO, those unknown ESTs were assembled by using PHRAP, then validated, and obtained full length ORF database of HFL22w. The encoding proteins were studied to get their function information. Finally, the known ESTs profile was compared with five tissues expression profile from the microarray data.Results: There are 16674 ESTs sequenced from a 3’-directed cDNA library of HFL22w. Among them, 8097 (48.6%) (Group I) matched to known genes or had partial homology to knowngenes; 4271 (25.6%) (Group II) exhibited no significant homology to known genes; and the remaining 4306 (25.8%) (Group III) were genomic sequences of unknown function, mitochondrial genomic sequences, bacterial DNA and repetitive sequences. The 2483 genes corresponding to Group I can be divided into 425 gene categories by GO classification. Some of the genes are related to metabolism, biosynthesis, development, cell proliferation, defense response, cell migration, hemopoiesis and endocytosis. The correlation coefficient (0.994) between the Group I and fetal liver data from microarray indicates their high similarity. Comparison on microarray data of five tissues (including fetal liver, bone marrow, liver, thymus and lymph node) indicates that genes related to reproduction, coagulation, homeostasis, regulation of gene expression (epigenetic), biosynthesis, energy pathways, cell migration, response to pathogenic bacteria, and natural killer cell mediated cytolysis in fetal liver are more than in other four tissues. Hierarchical clustering of these tissues shows that thymus and lymph node are closely related, thymus and bone marrow, liver and fetal liver are the next, fetal liver and bone marrow are the last. 2416 genes corresponding to Group II were assembled and their average length was lengthened from 342 bp to 1682 bp. 2098 genes (86.84%) of unknown ESTs had been prolonged. In these 2098 genes, 1037 genes (49.43%) were validated by UniGene, DoTs, MGC and Twinscan database. Then we predicted the characteristics of proteins (1921 genes) with length not less than 30 aa and obtained 277 profiles or patterns. More than 10 types were discussed.Conclusions: (1) The results of ESTs clustering show that the number of high expressed genes is small, but these genes include more ESTs than the others; (2) We obtained 1379 new genes; (3) GO analysis showed that human fetal liver display some typical characteristics of gene expression patterns related to special physiological functions; (4) Comparison of gene expression on five tissues showed that human fetal liver and liver have closer relation than the other tissues; (5) 1037 full length cDNAs and ORFs were obtained by assembling unknown ESTs and validation.Bioinformatics Study on Genome of SARS-CoV(BJ-Ol)SIGNIFICANCES: SARS, an atypical pneumonia of unknown aetiology, was recognized at the end of February 2003. For understanding the disease and cured it, we must got its gene informations.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络