节点文献

生物数据特征提取方法及应用研究

Research on Method of Feature Extraction for Biological Data and Its Application

【作者】 杨昂

【导师】 李仁发;

【作者基本信息】 湖南大学 , 计算机应用技术, 2012, 博士

【摘要】 随着高通量技术的飞速发展,大量研究结果产生了海量的生物医学数据。如何从海量的生物医学数据发掘有生物意义的知识和规律是后基因时代人类所面临最具挑战性的生物学问题之一。序列数据飞速增长,而大量参与重要生命活动的基因和蛋白质功能仍然未知。由于生物数据本身的复杂性及不同研究领域存在的不同研究评价准则,人们很难仅从数据本身出发去发现基因和蛋白质的功能信息,因而人们开始通过特征特征提取方式来对生物信息数据中所存在的规律进行挖掘。生物数据的特征提取是生物信息学中最为基本的问题,特征提取算法的优劣直接关系到生物数据信息提取和分析的准确性。本文立足于基因数据和蛋白质数据,围绕基因数据和蛋白质数据的特征提取进行深入研究,根据相应数据自身的特点及其应用背景,提出了三种不同的特征提取算法,并在标准数据集上对方法的准确性、可靠性进行验证及分析。本文主要工作概括如下:(1)蛋白质特征提取是蛋白质相关应用问题的基础,特征提取的不完整是影响蛋白质特征有效提取的主要因素之一。针对该问题本文提出一种基于混合特征的序列特征提取方法。该方法主要是通过利用一些蛋白质序列特征信息构造出一个向量,并以此作为蛋白质的特征向量。基于该方法本文将该特征向量作为SVM或KNN分类器的输入来预测出蛋白质进行亚细胞的准确定位。通过跟其他的一些基于序列信息的蛋白质亚细胞定位方法比较,该方法能够在没有预先知道蛋白质结构知识的情况下自动地对蛋白质亚细胞定位进行预测。从实验结果和时间分析上可以看出本文所提方法在准确度上要优于其他的一些方法,说明了这种方法的正确性和有效性。(2)蛋白质特征提取方法中,研究人员大多偏重于局部信息的提取,这使得所构造的特征仍然不够完整。针对该问题本文提出一种序列数字特征提取方法,该方法忽略了蛋白质的结构和相互作用信息,基于疏水性,极性,电荷性等特性构造出一个向量并以此作为蛋白质的特征向量。该方法获得的特征既包含了蛋白质序列全局信息,又囊括了序列局部信息。基于该方法本文提取蛋白质序列的特征向量并结合最近邻分类算法(KNN)预测蛋白质的功能分类,以解决没有或者其相互作用信息很少的蛋白质功能类预测问题。为了讨论亚细胞定位信息是否对蛋白质功能预测有影响的问题,本文将亚细胞位置信息融入所提特征中,并将其用于蛋白质功能预测,实验显示其效果在某些方面优于其他方法,这也证实了所提方法的有效性。(3)基因表达数据具有高通量、高维、非线性、高噪声以及分布不均的特点,这直接影响了基因数据所含信息的有效提取。本文针对基因表达数据的特点提出了一种新的特征基因选择算法。该方法同时考虑了过滤法和缠绕法在特征选择中的应用,在对原始数据过滤后引入KNN方法对每一条基因进行聚类,然后引入聚类紧密度指标来进一步降低特征基因的维数;考虑到基因与基因之间的相互作用,本文在特征提取过程中引入一种新的特征基因搜索策略。该方法所选择特征基因在具有很好的识别精度的同时也具有较好的冗余。本文将该特征基因选择方法应用于肿瘤亚型识别试验以及关键SNP的选择实验中。结果表明,本章所提出的方法可获得很好的实验效果。

【Abstract】 With the rapid development of high-throughput technologies, a flood of biomedical data has come into being. One of the most challenging biological problems we are facing in the post-genome era is how to excavate significance biological knowledge and law from massive biomedical data. With the booming of sequence data, the function of genes and proteins involved in important life activities still remains unknown. It is difficult to discover functional information of genes and proteins from the data itself due to the complexity of the biological data and the difference of evaluation criteria existed in different research areas. And thus people began to mine the rule of bioinformatics data by means of feature extraction. Feature extraction is the most fundamental problems in bioinformatics, and the quality of feature extraction algorithm is directly related to the accuracy of information extraction and analysis of biological data. Based on gene and protein data, features extracted from gene and protein data are explored in more depth in this paper. According to the characteristics of the data itself and its application background, we propose three different feature extraction algorithms and meanwhile verify the accuracy and reliability of the methods. This paper is summarized as follows:(1) Protein feature extraction is the basis of the protein associated application problems, feature extraction is one of the effective extraction of the main factors affecting protein characterized incomplete. This paper presents a mixed feature-based sequence features extraction method for the problem. The methods are to construct a vector through the use of some protein sequence feature information, formed as a protein characteristic vector, based on the method of the feature vector as the input of SVM or KNN classifier to predict the exact localization of the protein subcellular, in this article. By comparison with other protein subcellular localization method based on sequence information, the method can automatically in the case did not know the knowledge of protein structure in advance. From the analysis of the experimental results the proposed method is superior to other methods in accuracy, correctness and effectiveness of this approach.(2) Still not enough protein feature extraction methods for most of the researchers extracted emphasis on local information, which makes the tectonic characteristics. A series digital feature extraction method is proposed in this paper for this problem, the method to ignore the protein’s structure and interaction information, to construct a vector based on the hydrophobicity, polarity, charge and other characteristics as the feature vector of the protein. Obtained by this method characterized in protein sequence contains both the global information, and encompasses the sequence of local information. This article extract protein sequence feature vectors and combined with nearest neighbor classifier (KNN) algorithm to predict protein function classification, to address the protein functional class prediction problem no or little interaction information. In order to discuss whether the subcellular localization of information issues that affect protein function prediction subcellular location information into the mentioned features, and for protein function prediction experiments show the effect in some respects superior to other methods. It also confirms the effectiveness of the proposed method.(3) This paper presents a new feature gene selection algorithm, and applied to the tumor subtype recognition. This method taking into accounts both filtration and wrap method of feature selection. First of all filter method is used to reduce gene cluster dimension then clustering tightness indicators is introduced to further reduce the feature genes, While the interaction between genes is taking into account, making this feature gene subset of features with low redundancy but highly classified information, not only with high recognition accuracy but also low redundancy. SVM is used as classifier, our experiments is based on four gene expression data sets commonly used in the international. The results show that the method presented in this chapter is superior to some other methods.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2014年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络