节点文献

基因组尺度高信息量RNA干扰筛选数据分析:一类系统生物学应用中若干模式识别问题的研究

Data Analysis for High Content RNA Interference Screening: Pattern Recognition Approaches for Certain Systems Biology Application

【作者】 尹征

【导师】 孙优贤;

【作者基本信息】 浙江大学 , 控制科学与工程, 2009, 博士

【摘要】 控制论、系统论思想与模式识别相关方法广泛参与诸多交叉学科的研究。控制论、系统论可以指导对各种实际问题的认知,模式识别相关方法则构成了具体的解决方案。本文将控制论、系统论与模式识别相关方法应用于一类系统生物学研究。具体的,以基因在细胞形态变化中的调控作用为主要研究对象,以在果蝇培养细胞种系中进行的大尺度高信息量RNA干扰筛选(RNAi HCS)为应用背景,我们分析并解决了RNAi HCS数据分析中的一系列问题,包括细胞形态表现型在线发现、表现型在线建模与确认、针对不同表现型的特征选择与细胞分类、综合单个细胞分类结果的基因功能建模等。我们将本文设计的一系列方法组合为一套完整的数据分析流程,并协助生物学者对近200万单个细胞图像进行综合分析,提出了“细胞形态表现型具有定型化性质”的生物学假设。本文提出利用高斯混合模型对表现型建模,改进了利用间隔统计估计聚类个数的方法、设计了迭代表现型兼并流程以比较新数据集与已知表现型的异同、利用最小分类误差方法实现表现型模型在线更新,最终形成了在线表现型发现算法。这种方法随着新数据的不断产生辨认新颖表现型,并对其进行建模与确认。当前的RNAi HCS数据分析流程大多使用手工挑选的典型表现型及代表细胞作为训练集,但数据集规模的不断扩大使手工分析难以反映整个数据集的完整风貌,我们的方法有效的解决了这一问题。为了考察整个数据集中各个细胞与典型表现型的相似程度,我们设计了“支持向量机迭代特征消去-遗传算法”联合特征选择方法,利用精简的特征集合描述表现型形态并使用以高斯径向基函数为核函数的支持向量机进行细胞分类。根据支持向量机对每个细胞形态的分析,我们执行一系列质量控制、统计分析及数据筛选与整合操作,为针对每个基因的RNAi实验挑选出一个带有稳定形态特征的细胞群落;根据可重复性细胞群落的形态特征生成每个基因的量化形态分值,并利用聚类分析辨别在细胞形态变化中发挥不同作用的基因与基因家族。本文以控制论、系统论为指导,整个数据分析流程中综合运用多种模式识别、统计分析技术,形成了完整、高效的RNAi HCS数据分析流程。在数据分析方案设计中注重动态与静态分析的对立统一,实现了典型表现型在线发现与在线建模;注重利用统计学方法发掘微观与宏观层面的联系,系统化处理单个细胞形态作为分析基因功能的基础;注重对单一层面分析结果的升华,努力通过特定应用的分析结果掌握普遍规律,提出并初步验证了细胞形态表现型具有定型化特点这一假设。

【Abstract】 Cybernetics, systems theory and pattern recognition theory and methodologies are broadly applied to interdesciplinary research. In the context of different applications, cybernetics and systems theory can help dissecting various research topics while pattern recognition technologies form the workflow of solving specific problems. In this thesis, cybernetics, systems theory and pattern recognition theory and methodologies were applied to systems biology research. Specifically, in the context of large scale high content RNAi screening (RNAi HCS) aiming at constructing local regulatory network for Drosophila cell shape change, a series of challenges confronting RNAi HCS data analysis were analyzed and solved. We proposed original solutions for online phenotype discovery, online modeling and validation of novel phenotypes, feature selection, cell classification and modeling of gene functions based on single cell morphology profile. The proposed methods were combined into a complete data analysis workflow, and handled a dataset of more than 2 million single cells. Based on the analysis results from real dataset, we helped biologists propose a biological hypothesis regarding the canalization of cell morphology.At present, most RNAi HCS data analysis workflows utilize typical phenotypes and cells identified from expert ground truth labeling as basis of gene function research. However, the growing size of dataset makes it infeasible to cover the property of whole dataset using manually picked training set. We improved gap statistics, a cluster number estimation and cluster validation method; designed iterative phenotype merging, a strategy comparing newly generated dataset and existing phenotypes; used Gaussian mixture model to describe each phenotype and applied minimum classification error method to do online model update; we combined these components and proposed an original online phenotype discovery workflow to discover, model and validate novel morphological dataset as the dataset extended.In order to compare cell morphology with typical phenotypes, we combined "Support vector machine-Recursive feature elimination (SVM-RFE)" and "Genetic Algorithm based on SVM" to form a feature selection scheme. Using the informative feature subsets and SVM with Gaussian Radial Basis function as kernel functions, we quantified the similarity of morphology between single cell and typical phenotypes. Based on the cell classification results, we carried out a series of quality control, statistical analysis, data filtering and consolidation; picked up a group of significantly repeatable cell population to represent the result of RNAi treatment targeting each gene; the quantitative morphology signatures for each single gene are generated based on those cell populations, and we used cluster analysis on those signatures to identify gene families with different functions in regulating cell shape change.Guided by cybernetics and systems theory, the whole data analysis workflow implemented various state-of-the-art technologies of pattern recognition and statistical analysis, and showed the capability of automatic data analysis in large scale RNAi HCS. We combined dynamic and static analysis and realized online phenotype discovery, modeling and validation; the relationships between information from micro- and macro- level phenomena were checked and single cell morphology profile were utilized to model gene function; the data analysis results on specific project contributed to the understanding of the general law underlying cell morphology change, and we proposed and validated the hypothesis regarding canalization property of cell morphology based on our data analysis using real RNAi HCS dataset.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2010年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络