节点文献

MicroRNA识别及其与疾病关联的预测算法研究

Research on MicroRNA Identification Algorithm and Disease Related MicroRNA Prediction Algorithm

【作者】 玄萍

【导师】 郭茂祖;

【作者基本信息】 哈尔滨工业大学 , 人工智能与信息处理, 2012, 博士

【摘要】 microRNA(miRNA)是一类长度约为22nt(核苷酸)的内源非编码RNA,在动植物许多重要的生命过程中起着关键的调控作用,并且与肿瘤等多种疾病的发生发展密切相关。生物信息学在miRNA的研究中起到了重要作用,极大地推动了该领域的迅速发展。本文主要研究miRNA相关问题的计算预测方法,对miRNA前体分类预测、miRNA成熟体位置预测、疾病关联的miRNA预测等问题进行了深入的研究,取得了一些创新成果。主要包括以下四方面的内容:(1)研究了高效的基于支持向量机的miRNA前体分类预测方法。研究miRNA的功能需要先找到miRNA。通过生物实验识别miRNA的方法是耗时和昂贵的,并且难于发现那些表达量较低或者只在特定组织或发育阶段表达的miRNA。因此,使用计算预测方法筛选可能的miRNA候选集合,可以为生物实验提供指导和参考,对推动miRNA的识别具有重要意义。本文结合miRNA前体的特点,提出了基于支持向量机的miRNA前体分类预测方法。好的特征和正反例(真/假miRNA前体)数据集合是建立高效的分类预测模型的基础。因此,本文从真/假miRNA前体中提取得到序列相关特征、结构相关特征和能量相关特征。提出了基于遗传算法的特征选择方法,选取了有代表性的特征子集。由于植物miRNA前体反例数据集的匮乏,本文首次从拟南芥、水稻、大豆的蛋白质编码序列中提取类似茎环的序列作为假miRNA前体序列,并建立反例数据集。针对真/假植物miRNA前体类别不平衡问题,结合集成学习和AdaBoost思想建立了集成分类器PlantMiRNAPred。PlantMiRNAPred分别在拟南芥、水稻、毛果杨、小立碗藓、苜蓿、高粱、玉米和大豆等8个物种中取得了超过90%的准确率,对植物miRNA前体的识别研究具有重要价值。此外,我们还使用人类miRNA前体的数据建立了分类模型HumanMiRNAPred,该模型也取得了更高的预测性能,有助于推动人类miRNA前体的识别研究。(2)研究了准确的miRNA成熟体位置预测方法,能够为新预测得到的miRNA前体候选,预测其中成熟体的位置。基于机器学习的miRNA前体分类预测方法,通常只能预测分类新的miRNA前体,无法预测其中miRNA成熟体的位置。然而,在进行后续生物实验验证前,通常需要给出其中miRNA成熟体的位置,因此本文提出了基于支持向量机的miRNA成熟体位置预测方法。首先将miRNA:miRNA*作为一个整体,以更好的反映miRNA及miRNA*相互结合的特点。其次,从真/假miRNA:miRNA*中提取特征并选取得到有代表性的特征子集。第三,针对真/假miRNA:miRNA*数量相差悬殊的问题,提出了两阶段样本选择方法,依据反例样本(假的miRNA:miRNA*)的分布密度和样本的预测误差,选取有代表性的反例样本,建立miRNA成熟体位置预测模型MaturePred。与现有的方法相比,MaturePred取得了更准确的预测性能,能够为后续生物实验提供更可靠的动植物miRNA成熟体候选。(3)结合miRNA功能相似性的准确度量,提出基于k个最相似miRNA结点的疾病关联miRNA预测算法。miRNA调控的异常是导致肿瘤等多种疾病的重要原因,因此研究miRNA与疾病的关联对研究发病机理是非常重要的。研究表明功能相似的miRNA通常参与相似疾病的过程,即与相似的疾病关联,反之亦然。于是可以通过度量与两个miRNA相关的两组疾病间的语义相似性,评估两个miRNA间功能相似性。本文通过考虑每个疾病术语的信息含量,进一步改进了miRNA功能相似性的度量。提出了基于k个最相似的邻居miRNA结点的疾病关联miRNA预测算法HDMP,该方法可以系统的预测与特定疾病关联的miRNA候选。此外,结合同属于一个miRNA家族或miRNA分簇中的miRNA间功能更相似的特点,在预测时进一步考虑miRNA家族和分簇的信息,提出了预测算法HDMPW。针对18种人类常见的疾病,证实了HDMP和HDMPW能够有效预测疾病关联的miRNA候选。随着miRNA和疾病关联数据的快速增长,HDMP未来可以扩展到其它人类疾病的预测。(4)在建立miRNA功能相似性图的基础上,提出基于随机游走的疾病关联miRNA预测算法。在计算miRNA间功能相似性的基础上,建立miRNA功能相似性图。将疾病关联miRNA的预测问题转换为随机游走问题,提出了基于随机游走的预测算法HDMPR。与HDMP和HDMPW不同的是,HDMPR在预测时不是考虑了k个最相似邻居结点的信息,而且考虑了miRNA功能相似性图的全局结构信息。使用18种人类常见的疾病与miRNA的关联数据,验证了HDMPR方法的有效性。实验结果表明,对于多数的疾病而言,HDMPR取得了比HDMP和HDMPW更好的预测性能。总体来说,HDMP、HDMPW、HDMPR均能够为后续生物实验,提供可靠的与特定疾病关联的miRNA候选,为生物学家进一步验证可能的疾病关联miRNA提供指导作用。

【Abstract】 MicroRNAs (miRNAs) are a set of short (about22nucleotides) non-coding RNAsthat play significant regulatory roles in various biological processes of animals andplants. Furthermore, accumulating evidence indicates miRNAs are associated withvarious human diseases. The application of bioinformatics in miRNA research greatlypromotes the development of this cutting-edge area of current biology. In this thesis, westudied pre-miRNA classification, mature miRNA position prediction, anddisease-related miRNA identification. The creative work mainly consists of thefollowing four parts.(1) A novel classification method based on support vector machine (SVM) isproposed specifically for predicting plant pre-miRNAs.Identification of miRNAs is the first step in miRNA functional studies. DetectingmiRNAs by experimental techniques is expensive and time-consuming. It is difficult toidentify the lowly expressed miRNAs or the miRNAs that expressed in the specifictissues or expressed in developmental stage. Therefore, computational predictionmethod can provide the potential pre-miRNA candidates for the biologists. Consideringthe characteristics of pre-miRNAs, the classification method based on SVM is proposed.It is well studied that the good features and positive/negative (real/pseudo pre-miRNA)datasets are the basis of constructing efficient classification model. Therefore, thesequence-related features, structure-related features, and energy-related features areextracted from the real/pseudo plant pre-miRNAs. A set of informative features areselected by our feature selection method based on genetic algorithm. Due to lack ofpseudo plant pre-miRNAs, we extract the pseudo hairpin sequences from the proteincoding sequences of Arabidopsis thaliana, Oryza sativa, and Glycine max respectively.These pseudo hairpin sequences are used as negative samples. Considering the classimbalance of real/pseudo pre-miRNAs, the classification model (PlantMiRNAPred) isconstructed by combining ensemble learning and AdaBoost method. PlantMiRNAPredachieves more than90%accuracy on the plant datasets from8plant species, includingArabidopsis thaliana, Oryza sativa, Populus trichocarpa, Physcomitrella patens,Medicago truncatula, Sorghum bicolour, Zea mays, and Glycine max. PlantMiRNAPredhas important value in identifying plant pre-miRNAs. In addition, we construct aclassification model, HumanMiRNAPred, with the data of human pre-miRNAs.HumanMiRNAPred achieves higher prediction performance, which is helpful forfacilitating identification of human pre-miRNAs.(2) A machine learning method based on support vector machine is proposed topredict the positions of miRNAs for the new pre-miRNA candidates.Most of pre-miRNA classification methods based on machine learning can distinguish real pre-miRNAs from pseudo pre-miRNAs, and few can predict thepositions of miRNAs. However, to efficiently identify the actural miRNAs, thepositions of miRNAs usually should be given for the subsequent biological experiments.Therefore, the position prediction method is proposed. First, a miRNA:miRNA*duplexis regarded as a whole to capture the binding characteristic of between a miRNA and itscorresponding miRNA*. Second, we extract the features from real/pseudomiRNA:miRNA*s and select the informative features to improve the predictionaccuracy. Third, two-stage sample selection algorithm is proposed to combat the seriousimbalance problem between real miRNA:miRNA*s and pseudo miRNA:miRNA*s. Therepresentative negative training samples (pseudo miRNA:miRNA*s) are selectedaccording to their distribution density in the high dimensional sample space and theirprediction deviations. The prediction method, MaturePred, achieves higher predictionaccuracy compared with the existing methods. MaturePred can provide the morereliable animal miRNA candidates and plant miRNA candidates for subsequentexperiments.(3) On the basis of accurately measuring the functional similarity of two miRNAs,the method based on the k most similar neighboring miRNAs is proposed for predictingdisease-related miRNAs.The abnormal expression of miRNAs is one of important causes which result invarious diseases. Therefore, the identification of human disease-related miRNAs isimportant for investigating their involvement in the pathogenesis of diseases. It isknown that miRNAs with similar functions are often associated with similar diseasesand vice versa. Therefore, the functional similarity of two miRNAs has beensuccessfully inferred by measuring the semantic similarity of their associated diseases.We achieve more accurate measurement of miRNA functional similarity by consideringthe information content of disease terms. A new prediction algorithm, HDMP, based onthe k most similar neighboring miRNAs is presented for predicting disease-relatedmiRNAs. In addition, the miRNAs that belong to a miRNA family or locate a cluster aremore similar with each other. We furthermore propose the prediction algorithm based onthe information of miRNA family or cluster. The algorithm is referred to as HDMPW.HDMP and HDMPW were proved successful in predicting the potential disease-relatedmiRNA candidates for18human diseases. HDMP can be easily extended to otherdiseases with the rapid increase of miRNA-disease association data for specificdiseases.(4) On the basis of constructing miRNA functional similarity graph, a methodbased on random walk is proposed for predicting disease-related miRNAs.The miRNA functional similarity graph is constructed by calculating the functionalsimilarity of two miRNAs. The prediction algorithm based on random walk with restart,HDMPR, is proposed for predicting disease-related miRNAs. Unlike HDMP and HDMPW, HDMPR does not consider the k most similar neighboring miRNAs, butrather it considers the global structure of miRNA functional similarity graph. Theefficiency of HDMPR is validated by the association data of18human diseases. Theexperimental result indicates that HDMPR achieves higher prediction performance thanHDMP and HDMPW for most of the18diseases. Overall, HDMP, HDMPW, andHDMPR are useful in providing reliable disease-related miRNA candidates forsubsequent biological testing.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络