节点文献

乳腺癌组织学分级特征基因提取及基因集富集分析

Feature Genes Selection and Gene Sets Enrichment Analysis for Histologic Grading of Breast Cancer

【作者】 叶云

【导师】 卢坤平; 马文丽;

【作者基本信息】 南方医科大学 , 生物化学与分子生物学, 2010, 博士

【摘要】 乳腺癌是女性最常见的恶性肿瘤之一,而且也是妇女恶性肿瘤主要的死亡原因之一。乳腺癌多发于西方欧美国家,尽管死亡率已经逐渐得到控制并有所下降,但发病率一直居高不下。近年来,原为乳腺癌低发区的亚洲国家发病率也呈逐年升高的趋势。乳腺癌严重威胁着妇女的健康,但乳腺癌病因相当复杂,与遗传因素、激素、免疫及各种环境因素(理化、生物因子、生活方式等)有关。影响乳腺癌的预后因素很多,从病理角度分析,肿瘤的组织病理学类型和组织学分级是重要的预后因素。由于乳腺癌组织学分级能够提供重要的预后信息,在临床上早已得到医学工作者的认可。目前应用得最广泛的乳腺癌分级方法是B-R分级,也被称为诺丁汉分级系统。这个分级方法以肿瘤细胞的形态学和细胞学特征作为评价依据,综合腺管形成的程度、细胞核的多形性和核分裂计数3个方面的得分,将乳腺癌分为Ⅰ级(G1,高分化,生长慢),Ⅱ级(G2,中分化),Ⅲ级(G3,低分化,高度增生)恶性肿瘤。对大量病人进行的多变量分析表明,未治疗G1病人的5年生存率为95%,而G2和G3的乳腺癌5年生存率则分别只有75%和50%。因而,组织学分级可以独立于淋巴结和肿瘤大小,作为预测乳腺癌复发和死亡的一个重要的指标。虽然组织学分级对于判断预后的重要性已越来越受到重视,但由于在分级评估过程中存在着一些主观的成分,而且操作比较繁琐,所以乳腺癌分级的可重复性还不够理想,即在不同的观察者之间存在着不一致性,通常其重复性只有60%-85%。肿瘤的基因组表达模式反映了肿瘤的生物学特性,基因表达谱可用于区分无法用病理学方法区别的肿瘤类型,为乳腺癌的生物学研究和预后提供了一种全新的方法。通过基因芯片表达数据可以获得与乳腺癌组织学分级相关的分类特征,实现乳腺癌的正确组织学分类,为乳腺癌的诊断和预后提供可靠的预测依据。已有研究者利用基因芯片分析获得了乳腺癌预后的标记基因,这种方法比传统的预后标记能更准确地判断乳腺癌的预后,且在随后的实验中也进一步证实了结果的可靠性。然而,这些研究还存在缺陷,即预测和验证都采用同一组数据,进一步验证也没有采用其他数据集。另外,基因芯片表达谱中许多被测基因与样本的区分没有很大关系。在分类问题中引入这些不必要的基因,将增加分类问题中样本的维数,导致计算复杂度的增加,同时可能会产生一些不必要的噪声数据。如果存在能将两类区分开的较小的基因子集,将有利于生物医学工作者专门研究这些基因的功能,了解其生物意义,开发基于这些基因的价格低廉的癌症诊断芯片。因此,特征提取是DNA微阵列研究的一个很重要的内容,通过特征提取找到足够少的能够进行有效分类的基因子集是非常必要的。不同分级对应于不同的细胞分化程度,低分化的肿瘤通常预后更差。肿瘤细胞的分化程度基于病理上的组织学分级分类,虽然低分化的肿瘤预后更差,然而其中的分子机制却仍然不清楚。肿瘤细胞具有无限增殖维持肿瘤克隆生长的能力,这与干细胞最重要的特性之一——自我更新能力存在着惊人的相似性,表明肿瘤可能起源于正常干细胞或者其祖细胞。目前已经发现很多致癌基因可以干扰正常细胞的分化,这些基因同样也可以影响肿瘤细胞的分化。因此,控制干细胞功能的某些调控网络,可能在某些肿瘤中也同样发挥作用。我们通过对不同分化程度乳腺癌基因表达谱的基因集富集分析,以期发现不同分化程度的乳腺癌的基因表达差异,并能用于改善乳腺癌组织学的分级,从而更好地了解肿瘤细胞分化的分子机制及与正常胚胎干细胞是否存在联系。研究内容主要分为三个部分:第一部分:芯片数据质量控制从NCBI共享数据库GEO(http://www.ncbi.nlm.nih.gov/geo/)下载乳腺癌相关的基因芯片数据,登录号为GSE2109、GSE5460、GSE1456和GSE3494。用dChip对芯片数据进行预处理,以总荧光强度为中位数的芯片为基准,对所有芯片进行标准化,以PM/MM模式均一化各芯片中所有基因的表达水平。同时,对有污染的芯片进行校正,还原原始芯片扫描图像,生成芯片质量报告。根据探针污染率和探针交叉杂交率判别芯片的质量,将校正后探针交叉杂交和污染仍大于5%的样本分样本和临床数据缺失的样本排除在下一步分析之外。共有676个乳腺癌芯片样本达到质控标准,可以用于后期的数据分析,GSE2109、GSE5460、GSE1456和GSE3494分别有186、109、147和234个样本。表达谱的基因表达值以2为底进行对数转换,选择PM-only模式分析得出各芯片中所有基因的表达水平,随后按以下标准进行过滤:0.5<标准差

【Abstract】 Breast cancer is the most common female cancer in the world and the leading cause of death by cancer among women. Although the mortality rate is now stabilized or decreasing, breast cancer incidence is still on the rise through all western countries. Even in Asia, the incidence is gradually increasing in recent years. Etiological factors of breast cancer are related with hereditary, hormone, immunity and environmental factors, including factors of physico-chemical biological, as well as life style.Histologic grade of breast cancer has been recognized for a long period of time. The mostly studied and widely used method of breast tumor grading is the Bloom, Richardson grading system, also known as the Nottingham Grading System. The Nottingham Grading System is based on a microscopic evaluation of morphologic and cytologic features of tumor cells, including degree of tubule formation, nuclear pleomorphism, and mitotic count. The sum of these scores stratifies breast tumors into grade 1 (G1; well-differentiated, slow-growing), grade 2 (G2; moderately differentiated), and grade 3 (G3; poorly differentiated, highly proliferative) malignancies. Multiple studies have shown the grade of invasive breast cancer is a powerful indicator of disease recurrence and patient death, independent of lymph node status and tumor size. Untreated patients with G1 disease have a 95% 5-year survival rate, whereas those with G2 and G3 malignancies have survival rates at 5 years of 75% and 50%, respectively. The histologic grade of breast carcinomas has long provided clinically important prognostic information. However, there are insurmountable inconsistencies in histologic grading between institutions and pathologists. With the advent of new unified methods, such as the Elston and Ellis modification of the Bloom and Richardson method, the reproducibility of histologic grading has been investigated and found to range from 60% to 85%.The genome-wide expression patterns of tumors are representation of the biology of the tumors; the diversity in patterns reflects biological diversity. Gene-expression profiling has been used to develop genomic tests that may provide better predictions of clinical outcome than the traditional clinical and pathological standards. It brought new insights into breast cancer biology and prognosis, and showed promise in refining clinical decision making. Feature genes could be obtained from gene expression profiles to predict histologic grade in breast cancer. Some researchers had identified gene-expression signatures, which predicted the outcome with more accurately than conventional prognostic indicators. The signatures were validated in a follow-up study. However, this validation was imperfect as the training and validation cohorts had overlapping patients and external validation using independent data sets was not performed. Furthermore, most genes in gene expression profile were not related to samples discrimination. Such genes will increase the dimension in discrimination and computing complexity. Noise data generated if these genes involved in grouping. A small subset that classified samples corretly is helpful for biomedical researchers to explore fountions of these genes and develop a cheap microarray for cancer diagnosis. Accordingly, feature extraction is essential to get a minor and accurate gene subset in microarray data analysis. The differentiation level (or grade) of human tumors is assessed routinely in the clinic, with poorly differentiated tumors generally having the worst prognoses. However, this classification is based on histopathological criteria, and the underlying molecular pathways controlling tumor differentiation are poorly described. The hallmark traits of stem cells—self-renewal and differentiation capacity—are mirrored by the high froliferative capacity and phenotypic plasticity of tumor cells. Moreover, tumor cells often lack the terminal differentiation traits possessed by their normal counterparts. These parallels have given rise to the hypothesis that tumors often arise from undifferentiated stem or progenitor cells. A number of oncogenes are known to interfere with normal cell differentiation, and such oncogenes could also affect tumor cell differentiation that the regulatory networks controlling the function of stem cells may also be active in certain tumors.We examined whether histologic grade is associated with gene expression profiles of breast cancers and such profiles could be used to improve histologic grading. We used recently developed gene set expression analysis methods GSEA to assess whether the expression signatures and regulatory networks that define human ES cell identity are also active in human tumors.This thesis can be divided into three parts:PartⅠ:Quality control of microarray dataGene expression data from these studies can be accessed at published gene expression datasets, the National Center for Biotechnology Information (NCBI) GEO database (http://www.ncbi.nlm.nih.gov/geo/, accession numbers GSE2109, GSE5460 GSE1456 and GSE3494). Two cohorts of patients included in this study were based on platform GPL570. Data preprocessing and normalization were done with dChip package. Expression values were generated in dChip employing a model-based expression algorithm and the perfect match/mismatch model (PM/MM). We used a two-step filtration strategy in order to remove noise while retaining true biological information. The first step was to get the scaning images of chips and report of array summary. Samples without clinical data and those array outlier and single outlier more than 5% were excluded.676 breast cancer microarray samples were obtained for further analysis.186 samples were from GSE2109 while 109 ones from GSE5460, 147 ones from GSE1456, and 234 ones from GSE3494. The second step was removing batch effect by empirical Bayes method because the samples were from different labs.After that, all expression values were log2 transformed. Genes were filtered as following step:Variation across samples:0.5< Standard deviation/Mean

节点文献中: 

本文链接的文献网络图示:

本文的引文网络