节点文献

高通量基因测序图像处理与数据分析

High-throughput Genome Sequencing Image Processing and Data Analysis

【作者】 叶丙刚

【导师】 吴效明;

【作者基本信息】 华南理工大学 , 生物医学工程, 2010, 博士

【摘要】 高通量基因测序技术的研究在我国刚刚起步,具有非常重要的现实迫切性。目前,国外公司正凭借其基因测序技术和测序设备的先发优势,利用基因资源的唯一性,抢先申请基因专利,谋求垄断未来全球的基因产业。“工欲善其事必先利其器”,没有现代的基因测序技术,就没有现代的生物技术。未来的生物医药、生物能源、个体化医疗等产业都将建立在现代的基因测序技术基础上,尤其是基因诊断和基因治疗为特征的个体化医疗技术。在高通量基因测序技术中,获得的原始图像是由含有碱基位信息的荧光点组成,通过图像处理和数据分析,可以得到所测基因序列中的碱基位。因此,本论文的主要研究内容可以为两个部分,即高通量基因测序图像处理和其相关数据的分析。高通量基因测序图像处理主要是对所得到的测序图像进行去噪和锐化处理,分割出碱基基团荧光点,建立含有碱基位信息的荧光强度数据文件和噪声数据文件。数据分析主要是对所得到的荧光强度数据进行信号解耦,碱基相位校正,再结合噪声数据进行碱基位识别和质量评估。本论文主要的研究内容和成果有:1)采用小波分析的方法,提出了基于小波系数相关阈值的图像去噪算法。基于小波系数相关阈值的图像去噪算法是根据信号的小波系数具有强相关性,噪声的小波系数是弱相关的或不相关的特点,通过构造小波系数的相关函数,确定相关阈值的方法实现图像的去噪。2)在图像信息熵和水平集分割方法的研究基础上,提出了结合图像信息熵的水平集C-V模型分割算法。本文提出的图像分割算法是在水平集图像分割方法C-V模型算法的研究基础上,引入图像信息熵算法,图像信息熵的研究是建立待分割图像目标区域的信息熵统计特征,为目标搜索提供方向性,提高水平集C-V分割模型的抗干扰能力和自适应性,使分割结果更准确,分割效率更高。3)提出了基于相关分析的碱基荧光基团信号解耦算法。本论文提出的基于相关分析的碱基荧光基团信号解耦算法,是根据所得荧光强度数据,依据相关分析方法,构造出交叉影响矩阵,交叉影响矩阵构造的方法和有关理论不仅是建立在分析纵向的、一维时间序列信号基础上,也是建立在横向的空间信号基础上,所构造出的交叉影响矩阵进一步进行校正,校正矩阵的因子是通过单样本柯尔莫哥洛夫—斯米诺夫检验方法得到。4)在回归分析和markov过程理论研究基础上,提出了碱基相位问题的校正算法。本论文根据高通量基因测序合成反应中,某个待测碱基序列出现的不管是相位“超前”还是“延迟”问题,其在荧光强度上表现为最强,即最大荧光强度值出现在同一循环中现象,提出了碱基相位问题的校正算法,算法的核心是采用回归分析的方法,并结合Markov过程理论,得到实现相位问题校正的概率矩阵。5)提出了基于最大后验概率的碱基识别算法。碱基位识别是将可信度最高的碱基位从处理后的荧光强度信号中识别出来,并按合成顺序组成基因序列片段。针对碱基识别问题,本论文提出了基于最大后验概率的碱基识别算法,该算法的核心是在一个降一维的三维超高斯概率球面上求积分的过程。6)结合噪声研究,提出了一种碱基质量评估方法。碱基位质量评估方法用来评估碱基位识别结果的质量,本论文在碱基位噪声研究的基础上,通过蒙特卡洛抽样法确定低信噪比碱基位的概率,并给出评估碱基位质量定义。

【Abstract】 In China, the high-throughput genome sequencing research is just now starting, which is allimportant and exigent in these days. At present, the foreign corporations are utilizing their superiority in the sequencing technology and equipment, putting in for the genome patents preemptively in case of the uniqueness of the genome, bucking for forestalling the global genome industry in the future.‘Sharpen the knife before cutting the wood’, and no modern genome sequencing technique, no modern biology technique. The tomorrow biology medicine, biology energy source, individualized medical treatment, etc. will be built on the base of the morden genome sequencing, especially the individualized medical treatment with the character of genome diagnosis and therapy.In the high-throughput genome sequencing technology, the original image consists of fluorescent spots with base information, and we can get the base of the genome by image processing and data analysis. The paper mainly contains two parts, high-throughput genome sequencing image processing and its relative data analysis. The image processing is mainly denoising and sharpening of the sequencing images, and segmenting fluorescent spots, establishing the fluorescence intensity data file with base information and its noise data file. The data analysis is mainly decoupling the signal of the fluorescence intensity, phasing emendation, base calling and its quality evaluating. The following is the main content of the study.1) Adopting wavelet method and putting forward one kind of image denoising arithmetic based on the threshold of the relative wavelet coefficient. The image denoising arithmetic is based on the signal coefficient has strong relativity and the noise has weak or no relativity, constructing the relative function of wavelet coefficient, getting the relative threshold to carry out the image denoising.2) Based on the research of the image entropy and level set segmentation, putting forward one king of C-V model segmentation with image entropy. The segmentation arithmetic is based on the research of C-V model of level set segmentation method, introducing image entropy arithmetic, and the research of the entropy is building up the statistical character of the segmenting region, providing the direction of searching the target, improving the anti-jamming ability and adaptivity of the C-V model, making the segment result more accurate and the efficiency higher.3) Putting forward one kind of the base fluorescent signal decoupling arithmetic based on the correlation analysis. The decoupling arithmetic is based on the fluorescence intensity data, using the correlation analysis method, constructing the cross-talk matrix, and the method to construct the cross-talk matrix is not only from the analysis of one dimension tme serial sigal but also from space sigal, and the matrix also needs further emendation, the factor of the matrix from one sample kolmogorov-smirnov test.4) Putting forward one kind of base phasing emendation arithmetic based on regression analysis and markov process. In high-throughput genome sequencing synthesizing reaction, when one base fragment to sequence has phasing overtaking or delaying, its fluorescence intensity will be highest, namely it has the highest intensity in the same cycle. According to this fact, the emendation arithmetic is mainly to use the regression analysis and markov process, and to seek for the probability matrix of phasing emendation.5) Putting forward one kind of base calling arithmetic based on maximum a posteriori. Base calling is to tell the base with the highest reliability from dealed fluorescence intensity signal, and to form the genome fragment by the synthesizing order. The base calling arithmetic of the paper is based on maximum a posteriori, which is mainly a quadrature process on a three dimensions Gaussian probability hypersphere incase of one dimension reducing.6) Based on the research of the noise, putting forward one kind of base quality evaluating method. The base quality evaluating method is to judge the result of the base calling, and based on the research of the noise, the paper takes the use of Monto Carlo method to get the probability of the base with low signal noise ratio, and gives the definition of the base quality.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络