节点文献

新一代基因测序的数据处理中的相关问题

The Next Generation Sequencing Data Processing

【作者】 张骏

【导师】 苑波;

【作者基本信息】 上海交通大学 , 计算机软件与理论, 2011, 硕士

【摘要】 随着下一代基因测序技术(NGS, Next Generation Sequencing)的发展,实验设备和流程日趋成熟,越来越多的公司推出了自己的测序平台,基因测序已经逐渐脱离了专业的基因实验室,让更多的研究组和研究人员都开始进入该领域。与之而来的,NGS数据处理面临着越来越高的要求和挑战,研究人员已经不能满足于使用基因测序机器厂商所提供的基本的数据处理程序,转而使用更开放的、灵活的第三方处理软件。在本文中,我们重新审视了NGS基因数据处理的过程,从原始的图像数据处理到碱基识别,完成了一整套NGS基因测序数据的处理算法。其中,在现有的一些NGS数据处理工具中,图像处理部分一般采用的水平集分割法或简单的使用拉普拉斯算子进行处理。在我们仔细分析了这些结果之后,发现他们其实并不能精确的完成基因簇定位以及识别的任务,为此,我们重新设计了处理算法(NRDPT, NGS Raw Data Processing Tool)。不同于已有的几种处理方法的是,该方法使用了基于边缘和霍夫变换的基因簇定位算法,有效提高了定位准确度。并且,在基因簇定位准确的基础上,我们设计了一个两步的配准策略,极大的提高了效率(~9倍提高于传统算法)。在本文中我们会详细讨论这部分的算法。在碱基识别部分,目前已经有的一些研究均基于Illumina测序平台的测序数据,这些研究主要用来试图修正使用该仪器所经常会出现的相位错乱问题,这些问题一般是来源于所采用的生化反应的缺陷。而在新的一些测序方法中(如SoLiD、HYK等),因为更新了测序流程,这些问题并不存在。在本文中,我们讨论了在不同的测序方法中会出现的问题及其对于碱基识别过程的影响,在仔细考虑了几种不同的碱基识别策略后,我们完成了基于连接反应测序过程的碱基识别方法,并得到了不错的结果。基因测序技术的发展很快,我们的研究过程基于我国完全自主知识产权的华因康公司的P-STARII型基因测序仪展开,在整个的研究过程中,机器和测序流程也在不断升级,这些不确定性常常增加了我们研究的难度,但这也正说明本领域正在飞速的发展。在这里,我们期待NGS测序技术的真正成熟,并最终走入临床领域。

【Abstract】 In recent years, Benefited by the significant development of the Next Generation Sequencing (NGS) technology, more and more companies launched their own sequencing platforms, and instruments has been invented. Such as the Genome Analyzer (Illumian, San Diego, USA), 454-FLX (Roche, Basel, Switzerland) and SOLiD (Applied Biosystems, California, USA) and so on. According to this, gene sequencing has been graduated from the professional lab. Many research groups and researchers are entering this field, and NGS data processing is facing increasing demands and challenges. Researchers have been not satisfied with the basic pipelines provided by the machine manufactures. And many open and flexible NGS data processing pipelines were developed in the past years, such as BING (Kriseman, 2010) and Swift, but they all based on the Illumina’s data. In this paper, we carefully reviewed the process of NGS data processing, and design the whole pipeline and algorithms, from gene cluster locating, image registration to base-calling.Among all, we found that the raw data processing part in the existing NGS pipelines are straightforward or even absence. They use general algorithms like level set segmentation or simply Laplace operator for locating the clusters. After carefully analyzing, it was found that these algorithms could not exactly locate the position of each cluster in the fluorography. We redesigned the processing algorithm (NRDPT, NGS Raw Data Processing Tool) and present here.Different with the existing methods, we use edge based Hough transforms to do the cluster positioning, effectively improved the positioning accuracy. And a two-step registration algorithm designed in this paper greatly save the time costs (about 9 times increased). In the base-calling part, existing studies are now based data produced by Illumina sequencing platform.These methods mainly designed to correct the phase disorder problems, which are caused by the biochemical processing. But in some of the new sequencing methods (such as SoLiD, etc.), these problems do not exist. In this article, we discussed these problems and carefully considered several strategies. Then, a well-designed base calling method is descripted, which is based on the reactions used in PSTAR-II and got pretty results.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络