节点文献

基于二级结构的非编码RNA挖掘方法研究

Research on Non-coding Rna Mining Based on Secondary Structure

【作者】 邹权

【导师】 郭茂祖;

【作者基本信息】 哈尔滨工业大学 , 人工智能与信息处理, 2009, 博士

【摘要】 非编码RNA的研究是目前生物信息学领域最重要的课题之一。步入21世纪以来,关于非编码RNA的研究连续获得Science评选的年度十大科学突破,并在2006年获得了诺贝尔生物或医学奖。越来越多的生物信息学研究人员致力于从已有的测序数据中挖掘非编码RNA,并分析其功能。但目前的挖掘方法还存在挖掘效率低、假阳性高、无法发现新家族等缺憾。因此,本文从分析RNA的结构出发,结合并改进分类学习方法,对非编码RNA挖掘中的若干关键问题进行深入的研究。本文的主要内容包括:(1)提出处理生物信息学中普遍存在的训练样本不平衡的分类方法。生物信息学中存在大量的正反例不平衡的学习问题,这是由于现实分布的特点,另一方面也由于获得正例花费的成本远远高于反例。本文提出一种处理正反例不平衡的分类方法,以解决生物信息学中的snoRNA识别、microRNA前体判别、SNP位点的真伪识别等问题。本文方法利用集成学习的思想,将反例集均匀分割并依次与正例集组合,得到一组类别平衡的训练集;然后对每个训练集采用不同原理的分类器进行训练;最后投票表决待测样本。为了避免弱分类器影响投票效果,本文结合AdaBoost思想,将每个分类器训练中产生的错误样本加入到下两个分类器的训练集中,这种做法既避免了AdaBoost的反复训练,又有效地利用了投票机制遏制了弱分类器的影响。五组UCI测试数据和三组生物信息学实验证明了本方法在处理类别不平衡的分类问题时的优越性。此外,本文还开发了基于该方法的软件libID,以方便广大同行使用。(2)提出RNA二级结构的“质心”表示方法和基于它的二级结构预测算法。目前RNA的各种二级结构表示方法,均不能快速地衡量两个RNA分子二级结构的相似程度。针对该问题本文提出“质心”的概念来描述RNA分子中各个茎区的位置,并且衍生出“质心距”、“D函数”等概念来进一步刻画茎区之间、二级结构之间的相似程度。基于这种快速衡量二级结构相似程度的方法,本文分别对比较序列分析法和最小自由能方法做出改进。对于比较序列分析法,提出一套独立于多序列比对的预测算法;对于最小自由能法,结合RNA的类别信息,进一步提高预测效果。(3)对目前挖掘microRNA的两种思路进行了研究,并深入的分析和讨论了其中的部分关键问题。同源比对和从头预测是目前挖掘microRNA的两种思路。同源比对方法是目前的主要方法,本文提出一种基于关键字树的比对搜索算法,提高了搜索的精度同时又降低了期望时间开销。将本文的方法分别应用于大豆和家蚕上均取得了较好的效果。从头预测方法基于机器学习思想,是未来的发展方向,它有利于发现新家族,不过成熟体定位问题一直是该方法的瓶颈。本文从两个角度对该问题进行了深入的探讨,取得了较准确的结果。尽管没有完全解决该瓶颈,但为该问题的深入研究奠定了基础。(4)结合本文提出的二级结构预测算法和样本类别不平衡的分类算法,挖掘snoRNA。目前的snoRNA挖掘方法大都是基于靶标信息的。随着“孤儿”snoRNA等新的功能性snoRNA的发现,独立于靶标信息的挖掘方法受到越来越多的关注。相比于目前的挖掘方法,本文将外显子序列引入训练集,提取了更为显著的二级结构特征,应用本文提出的专门处理类别不平衡的分类器,得到了一套更为有效和准确的snoRNA挖掘方法。特别地,本文还针对snoRNA的特殊二级结构,提出了有效的二级结构预测算法,并且应用于挖掘的特征提取过程中,这在国际上尚属首次。交叉验证和基因组片段上的挖掘实验证明了本文方法的有效性。

【Abstract】 Non-coding RNA is one of the most important topics in bioinformatics. The research of non-coding RNA has been voted as top ten scientific progresses for several years recently, and it won the Nobel Price in 2006. More and more bioinformatics researchers devote themselves to mining non-coding RNA and analyzing the function. However, the efficiency of the current mining method is low and the false positive is high. So in this thesis, I develop the secondary structure prediction algorithm, improve the machine learning method for imbalanced data, and do deep research on mining non-coding RNA.The contributions of the dissertation are as follows:(1) Three strategies are proposed for class imbalance learning problems in bioinformatics.There are many class imbalance learning problems in bioinformatics. It is because of the native distribution and that positive samples always spend much more than the negative ones. A novel classification method is proposed for training class imbalance data, such as identifying snoRNA, classifying microRNA precursors from pseudo ones, mining SNPs from EST sequences, etc. The method is based on the main idea of ensemble learning. First, the negative set (big class) is divided randomly into several subsets equally. Every subset together with the positive set is a class balance training set. Then several different classifiers are selected and trained with these balance training sets. After the multi-classifiers are built, they will vote for the last prediction when facing new samples. In the training phase, a strategy similar to AdaBoost is used. For each classifier, the samples will be added to the next two classifiers’training sets if they are misclassified. This strategy can improve the performance of weak classifiers by voting. Five UCI data sets and three bioinformatics experiments prove the performance of our method. Furthermore, a software program, named libID, is developed.(2)“Centriod of helix”is proposed firstly as a novel concept in this thesis, and two novel algorithms are developed based on this concept.RNA secondary structure can not be compared quickly by current representation. In this thesis, a novel concept“centroid of stem”is proposed for discribing the position of the stem, and more novel concepts, such as“distance between centroids”,“D function”, are extended for measuring the difference between secondary structure. The comparative sequence analysis method and the minimum free energy method are both improved based on these novel concepts. For comparative sequence analysis method, a novel prediction algorithm is proposed independent of multiple sequence alignment; for minimum free energy method, the prediction performance is improved by involving the class information.(3) Research and key problems on mining microRNA are discussed deeply.Homologous searching and ab initio predicting are two methods for mining microRNA. Homologous searching is the main method currently. In this thesis, a novel searching method based on keywords tree is proposed, for saving the time cost and maintaining the sensitivity at the same time. The application on soybean and silkworm proves the performance of our method. Ab initio prediction is based on machine learning and will be the main mining method in the future. It can find new microRNA family, however, localization of mature part is the bottleneck. In this thesis, I discuss this problem with two points of view. Although I havn’t solved this problem completely, my work has done help on the further research.(4) Algorithm on mining snoRNA is developed based on the secondary structure prediction and class imbalance learning methods mentioned above.SnoRNAs are mined based on targets information currently. As the development of function, especially as the discovery of“orphan snoRNA”, ab initio mining methods is noticed and researched since the independent of targets information. In this thesis, we propose a novel ab initio snoRNA gene mining algorithm, which is based on ensemble learning and a special secondary structure prediction algorithm. Three contributions are made to improve current mining methods, including enriching the negative training set, using the ensemble classifiers for the class imbalance data, and developing a special secondary structure prediction algorithm for extracting features with high quality, which is the first time to our knowledge. The performance of learning method is proved by cross validation and the mining method is proved by the experiments on genome data.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络