节点文献

基于改进的ID3算法的蛋白质纯化方法研究

The Methods Research of Protein Purification Based on the Improved ID3

【作者】 赵桐锌

【导师】 刘文琦;

【作者基本信息】 大连理工大学 , 测试计量技术与仪器, 2011, 硕士

【摘要】 现阶段生物技术的发展十分迅速,蛋白生产工艺的确定是其中的热门,也是目前生物领域中的一项重要研究课题。蛋白纯化工艺是蛋白生产中一个十分重要的步骤。在蛋白生产及相关研究中蛋白质的分离纯化技术使用广泛,传统的蛋白纯化方法是依靠操作人员的经验进行反复的试验最后进行确定的,但是此方法却花费比较大,周期也比较长。蛋白质本身所具有的各个性质与蛋白纯化方法之间存在着一定的关系,因此本文将数据挖掘技术引入到纯化方法的确定中来。决策树方法不仅能够直接体现数据的特点,便于理解,具有较好的分类预测能力,能方便提取决策规则,而且擅长处理非数值型数据。本文采用决策树方法中的ID3算法对历史蛋白数据集进行分类,找出蛋白性质与纯化方法之间的隐藏关系。ID3算法以信息论为基础,以信息熵和信息增益度为衡量标准,实现对数据的归纳分类。但是ID3算法存在不能处理离散数据和多值偏向性的缺点,不能直接应用到蛋白纯化方法的确定中,本文提出了改进的ID3算法(RS-ID3),运用粗糙集理论将数据离散化并应用信息增益率来计算属性重要度,克服了传统ID3算法的局限性。通过对UCI标准数据库中的数据集进行分类,将RS-ID3算法与另一种改进的ID3算法——C4.5算法进行比较,可以看出所提方法具有更好的分类效果。最后将所提的RS-ID3算法用于蛋白质纯化工艺摸索,实例验证也具有很好的效果,该方法为纯化方法的确定提供了支持。

【Abstract】 At present, biotechnology development is very rapid, determination of protein production process is most popular and is also an important research topic in the biological area. Protein purification is a very important step in the production. In protein production and related research technology, isolation and purification of proteins is widely used. Traditional purification method is relying on the experience of operators repeatedly test, but this method takes a larger and the cycle is longer.There is a certain relationship between the protein purification method and protein properties, so this paper has taken data mining to solve this problem. Decision tree can directly reflect the features, easy to understand, has better classification of the predictive power, easy extracting decision rules, and is good at dealing with non-numeric data. Using ID3 algorithm to categorize historical protein data sets and identify hidden relationships between protein properties and purification methods. ID3 algorithm based on the information theory、information entropy and information gain for the metrics, enabling the data summary classifications. But the ID3 algorithm cannot process discrete data and values disadvantage of biased, so it cannot be directly to the protein purification method of determining. This paper improved ID3 algorithm (RS-ID3), using rough set theory to discrete data and using information gain ratio to calculate attribute significance, overcomes the traditional limitations of ID3 algorithm. Using RS-ID3 algorithm compared with another improved ID3 algorithm--C4.5 algorithm, analysis shows this algorithm not only improves the UCI machine learning data set classification accuracy, but also has a good effect in the prediction of protein purification. Support is provided for the purification methods of determining.

【关键词】 数据挖掘ID3蛋白质纯化离散化决策树
【Key words】 Data miningID3Protein purificationDiscretizationDecision tree
节点文献中: