节点文献

多分类器系统在蛋白质功能预测方面的应用

Multiple Classifier Systems for Protein Function Prediction

【作者】 黄丹梅

【导师】 梁艳春;

【作者基本信息】 吉林大学 , 生物信息学, 2010, 硕士

【摘要】 作为数据挖掘领域的一个重要分支,分类技术有着广泛的应用,并且经过多年的研究和发展,许多经典的分类方法已经被研究者所熟悉,例如k-近邻、贝叶斯方法、决策树、支持向量机、神经网络等。而这些传统的方法具有一定的局限性,于是研究人员相应地提出了多分类器系统,同时,多分类器系统的研究进展又面临着一些重要的问题。蛋白质功能预测作为后基因组时代面临的主要挑战之一,许多机器学习方面的算法逐渐被研究出来了。G蛋白偶联受体(G-protein coupled receptors ,GPCRs)是一类非常重要的信号分子受体,因能结合与调节G蛋白活性而得名。由于GPCRs的结构特征及其在信号传导中的重要作用,决定了它可以作为药物靶标,当前畅销药物中有20%属于GPCRs相关药物,世界药物市场大约有1/3的小分子药物是GPCRs的激活剂或拮抗剂。另外,GPCRs的功能失调会导致多种疾病产生。由此可见,研究GPCRs的功能相关数据有着极其重要的应用价值。本文采用数据挖掘的相关技术,通过研究前人的理论和实践成果,针对多分类器系统的实现所存在的主要研究问题,提出了相应的改善方案和策略,并且基于weka数据挖掘分析平台编程实现了该系统,并对GPCRs的功能数据进行操作和处理,实验结果表明,系统的分类性能有了一定程度的提高。

【Abstract】 With the rapid development of information technology, in order to extract hidden important information from the stored large amounts of data, data mining techniques have emerged.In the field of data mining, classification plays as an important role of data analysis techniques, which analyses the inputing data through training the data set with focused characteristics, looks for an accurate description or model, and then predicts data type for unknown data sample. Classification problem in artificial intelligence, machine learning, pattern recognition and other fields has been extensively studied, and there are a number of traditional classification algorithms. However, these algorithms, with training through the known types of data set to get a single classifier, are reckless in scalability and efficiency. In addition, it is very difficult for them to deal with the classification task of the complex mass of data . Thus, the multiple classifier system has been put forward, which make use of the members of the classifier combination, related testing information and a ensemble approach to obtain a comprehensive classification prediction information, thereby enhance classification accuracy and reliability. How to obtain more useful information from the different members of the integrated systems to improve the classification performance, has become an important research questions in the field of data mining.Classification usually needs to predict the class label the forecast data belongs to. In sample set, each data belongs to a certain type of discrete disorder. Classification algorithm train from data set, analyses them, and then establishes classification model. The next phase is to classify the unknown types of data with this classification model. Here, we described the traditional classification techniques, including the commonly used classifier models, such as the k-neighbors, decision tree, support vector machines, Bayesian methods, neural networks, etc.; then the methods evaluating performance of classifier, such as hold-out and cross-validation method, were introduced.For the multiple classifier systems, with good performance should be in accordance with necessary and sufficient conditions : the base classifiers should be accurate and diverse. In other words, multiple classifier systems need to solve the following issues: the base classifier generation strategy, the base classifier selection, the base classifier fusion methods, and its assessment. The“overproduce and choose”strategy is adopted. As for the classifier generation strategy, you can operate data sets, classes, as well as properties, or change the classification model of the structure ,or improve the classification algorithm.The author studied the structure of multiple classifier systems and level of integration strategies at all levels, did research on diversity evaluation, and summarized combination methods. Then proposed classifiers generated strategy with training on different sources of data set ,which is a method of operating data set to extract the most representative samples . It considers classification performance and selecting the representative data set, and can generate candidate classifiers with better performance. With these candidate classifiers, we needs to select a subset of the optimal classifier from them. In order to care about the systematic assessment of performance, we carry out the selection method based on diversity and accuracy. The selection method takes account not only diversity problems the conventional classifier considered, but also the classification accuracy itself and the ensemble performance, which will help improve the total classification accuracy. In the final phase, with output of the member classifiers, we select a combination with the maximum principle to determine the final output as the final output.On protein function prediction, this paper introduced the commonly used protein databases, and devided protein function prediction methods into three categories from the perspective of machine learning, which are: supervised methods, semi-supervised methods, unsupervised methods.In this paper, multiple classifier systems show good results on theoretical and technical aspects. However, there are many problems needed to be in deeper study. For example: the structure of multiple classifier system topology and integration of decision-making research, the candidate classifier set of selected optimal subset needed to be considered acts of independence among classifiers, diversity, locality and other conditions; how to integrate multiple member classifiers to determine output information to get better classification performance, involved with building a fusion system, etc., therefore, the impact of various factors that affect classification system should be considered. In the phase of selecting members of the classifier, the mutual independence, should be concerned about as whether you can make a more sound theoretical analysis to give a better measure for the members of the classifier correlation, as well as the comprehensive consideration of the problems in procedure of the classifier generation and combination. In addition, the optimization of system design, as a research priorities, has been carried out to achieve some meaningful results, but it can’t dynamically choose the best multiple classifier system architecture for a given categorization task, which is still an unresolved issue.In addition, the research in multiple classifier systems are always fixed in such conventional pattern, maybe we should search for another way to improve .

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2010年 09期
  • 【分类号】Q51;TP311.13
  • 【下载频次】135
节点文献中: 

本文链接的文献网络图示:

本文的引文网络