节点文献

基于数据挖掘的若干化工过程优化和化合物构效关系研究

Research on Chemical Process Optimization and QSAR/QSPR of Organic Compounds Using Data Mining

【作者】 杨善升

【导师】 陆文聪;

【作者基本信息】 上海大学 , 材料学, 2008, 博士

【摘要】 数据挖掘是指综合运用多种算法,从大量数据中发现事先未知的信息和知识的计算机数据处理过程。作为一门多学科的交叉技术,数据挖掘已成为数据库系统和机器学习领域的重点研究课题,因其应用前景广阔而受到学术界和工业界的广泛关注。本文将数据挖掘技术应用于若干化工过程优化和化合物构效关系研究,主要的研究内容和研究成果如下:1.在系统研究了基于数据挖掘技术的氨合成装置生产优化方法的基础上,开发了具有自主知识产权的、用于解决合成氨工业生产操作参数优化的数据挖掘优化系统—DMOS合成氨优化系统,该系统由离线版和在线版优化软件组成。该优化系统具有一些显著特点,如融合了不同的数据挖掘方法、自动建模、模型更新、多模型优化策略、在线监测优化及友好的操作界面等,因而具有功能强大、操作便利和适应性强等明显优势。本工作还利用开发的DMOS合成氨优化系统,通过对云维集团有限公司沾化分公司氨合成装置1~#、2~#、3~#合成塔生产数据的数据挖掘,分别找出了影响装置入塔新鲜气量的主要工艺参数,建立了入塔新鲜气量与有关工艺参数间的数学模型,结果表明所建模型可靠性强,可为优化生产提供指导。2.从技术或经济角度看,化工过程优化是提高企业竞争力和经济效益的必要手段。本文将数据挖掘技术分别应用到某炼油厂偏三甲苯装置和某石油化工股份有限公司芳烃抽提装置生产优化,特别是首次将适合小样本数据建模的支持向量机(SVM)方法用于上述两个化工过程生产优化中,分别找到了影响装置优化目标的主要生产工艺参数,建立了装置优化目标与有关工艺参数间的定性、定量模型。结果表明:(a)较高的C01塔底温度(T01-01)、C02塔底温度(T02-01)(均控制在211±0.5℃)和较高的C01塔板温差(dT01)(30.5±0.5℃)有利于提高生产装置的偏三甲苯收率;偏三甲苯收率支持向量分类(SVC)模型的分类和预测正确率分别为100%和96.2%;偏三甲苯收率支持向量回归(SVR)模型的拟合与预报的均方根误差(RMSE)分别为0.028和0.034。(b)较高的T4504塔底温度(T04-01)(203.5±0.5℃)、较低的T4503灵敏温度(T03-02)(126±0.5℃)和较低的回流比(R)(0.27±0.2)有利于降低芳烃抽提装置抽余油中的芳烃含量;抽余油中芳烃含量的SVC模型的分类和预测正确率皆为100%;抽余油中芳烃含量的SVR模型拟合与预报的均方根误差(RMSE)分别为0.072和0.060。最后,在装置生产技术人员的参与下,制定了基于装置优化模型的生产优化方案,并成功应用于装置生产优化实践,为稳定生产和提高企业经济效益起到了十分重要的作用。据初步统计,两个优化项目实施后已产生直接经济效益近600万元。3.采用密度泛函理论(DFT)方法计算了139个多环芳烃化合物(PAHs)的8个量子化学结构参数,用遗传算法(GA)-SVR特征筛选方法分别得到了关联多环芳烃沸点(bp)、正辛醇/水分配系数(logKow)和色谱保留指数(RI)的最佳量化参数集,用基于训练集留一法交叉验证方法得到优化的SVR模型参数,多环芳烃bp、logKow和RI的SVR模型对训练集(样本数分别为45、52和90)和测试集(样本数分别为12、13和23)拟合和预测的R~2(分别为0.997、0.964和0.950)和q~2(分别为0.999、0.897和0.931)值较大。结果表明:SVR方法结合DFT方法计算的量化参数可以建立PAHs若干物性的较佳的QSPR模型,所建模型有很好的预测性能。4.开发了预测结构多样的芳烃的正辛醇/水分配系数(logKow)的QSPR模型。首先利用不同的化学软件计算了350个芳烃的68个分子结构参数,然后用最小冗余最大相关(mRMR)-GA-SVR特征筛选方法得到7个较佳的分子结构参数集,再用SVR 5重交叉验证方法得到优化的SVR模型参数,最后将SVR算法用于总结训练集300种芳烃logKow的QSPR模型,并将该模型用于测试集50种芳烃logKow的预测。本文还将SVR模型对芳烃logKow的拟合/预测结果与人工神经网络方法(ANN)、多元线性回归方法(MLR)和偏最小二乘法(PLS)模型的结果进行了比较。结果表明:SVR模型对芳烃logKow的拟合/预测的R~2和q~2分别为0.85和0.84,明显优于ANN(分别为0.82和0.80)、MLR(分别为0.77和0.77)和PLS(分别为0.77和0.77)模型的结果。5.总结了包含不同取代基团的581种芳烃对梨形四膜虫毒性的QSAR模型。用mRMR-GA-SVR特征筛选方法从计算得到的68个芳烃分子描述符中选出6个关联芳烃毒性最好的分子描述符,然后用SVR 5重交叉验证方法优化训练集芳烃毒性的SVR模型参数,进而用SVR方法得到训练集500种芳烃毒性的QSAR模型,最后将该模型用于测试集81种芳烃毒性的预测,并将SVR模型与PLS模型预测性能进行了比较。结果表明:SVR模型对芳烃毒性拟合/预测的R~2和q~2分别为0.77和0.67,其结果明显优于PLS模型(R~2和q~2分别为0.69和0.58)。

【Abstract】 Data mining (DM), a multi-disciplinary research area, is a technology to find the unknown, hidden and interesting knowledge from the massive data. It has been recognized as a key research topic in database and machine learning. It has also aroused wide interest of scientific or industrial circle for its large potential application. Carrying out experimental work, finding the regularities of the data obtained, and making prediction for some unknown phenomena, are the chief mode of the research work in the fields of chemistry and related disciplines, including chemical engineering, materials science and environmental science. Since the progress and achievement of computer science and technology, computerized data processing, or so-called machine learning, has been widely used in chemical research work and chemical industrial optimal control. Up to now, the statistical methods used in chemistry are almost all based on the classical statistical theory. It is well known that one of the basic principles in classical statistics is the law of large numbers. According to this principle, when the number of observations tends to infinity, the empirical distribution function converges to the actual distribution function. In other words, the training data set with infinite number of samples should be provided for getting a reliable mathematical model by using machine learning. In any practical problem-solving work, including the machine learning tasks in chemistry and chemical engineering, however, it is impossible to have so many samples for training and mathematical model building. On the contrary, in most of the chemical data processing work the number of training samples is usually quite small. In recent years, a widely recognized theory of statistical science, the statistical learning theory (SLT), has been proposed to find the answer of the above-mentioned question. And newly proposed method of machine learning, support vector machine (SVM), has been proposed based on the spirit of statistical learning theory. The SVM has been used in many fields of application, including image recognition, text categorization and DNA research, with rather good results. Now these powerful data processing techniques have been also used in the fields of chemistry and related disciplines. As a newly proposed algorithm, SVM has bright future as a powerful tool for chemistry and related fields.This thesis focuses on the application research of data mining in chemical process optimization and quantitative structure-activity/property relationship (QSAR/QSPR) of compounds. During the last decades, process optimization and monitoring have been developed into an important branch of research and development in chemical engineering. Generally speaking, large volumes of data in chemical process operation and control are collected in modern distributed control and automatic data logging systems. By using data mining, the useful information hidden in these data can be extracted not only for fault diagnosis but also for the optimal control with the objective of saving energy, increasing yield, reducing pollution, and decreasing production cost. The study of quantitative structure-activiry/property relationship (QSAR/QSPR) is one of the chemical topics. QSAR/QSPR study is also one of the most important steps in molecular design. In QSAR/QSPR work, the known data of some similar compounds are used as training samples, and the number of training samples is usually not more than several houndreds. The flexibility in classification and ability to approximate continuous function make SVM very suitable for QSAR/QSPR studies. The work and contributions of this paper are listed as following:1. The comprehensive and graphical software, Data Mining Optimization System (DMOS), for ammonia synthesis optimization and monitoring has been developed. The DMOS integrates most of the modern optimization methods including database search, pattern recognition, artificial intelligence, statistical learning, and domain knowledge. Some novel computational techniques developed in our lab are also implemented in the DMOS. The DMOS has two versions: the off-line version and on-line version. The DMOS has some exciting characteristics such as method fusion, feature selection, automatic model, model validation, model updating, multi-model building, and on-line monitoring, which contribute to solve optimization and monitoring problems of complex ammonia synthesis process. The DMOS has been successfully applied to the ammonia synthesis process. The main technical parameters affecting the flow of fresh synthesis gas are found. The qualitative and quantitative models correlated between the flow of fresh synthesis gas and some technical parameters are summarized. It can be expected that the DMOS has great potential in ammonia synthesis process and even other chemical processes optimization and monitoring.2. Chemical process optimization is an indispensable means to increase competition power and economic profit of chemical enterprises from technical and economic viewpoints. In this work, the two chemical process optimizations based on data mining (including the 1,2,4-trimethylbenzene unit and the aromatic hydrocarbon unit) are studied. The SVM method especially appropriate for the modeling of small size of data set was firstly applied to the two chemical processes optimization. Morever, traditional methods including Fisher and PCA methods are considered as complementary methods, since they also have their advantages as compared with SVM. They can give many linear projection figures which contain plentiful information. Domain experts, including chemists and chemical engineers, can find very useful inspiration from these projection figures. From these models, the main technical parameters affecting objective function are found. The qualitative and quantitative models correlated between objective function and some technical parameters are then summarized. The optimal results are showed as following: (a) The higher bottom temperature (about 211±0.5℃) of tower C01 (T01-01) and tower C02 (T02-01) and the higher difference of tray temperature (about 30.5±0.5℃) of tower C01 (dT01) benefit to enhance the 1,2,4-trimethylbenzene yield. The correct rate of classification based on training and predicted data sets of the 1,2,4-trimethylbenzene yield by using SVC model are 100% and 96.2%, respectively. The root mean square errors (RMSE) of the 1,2,4-trimethylbenzene yield for trained and predicted data sets calculated by SVR model are 0.028 and 0.034, respectively, (b) The higher bottom temperature (about 203.5±0.5℃) of tower T4504 (T04-01), the lower sensitivity temperature (126±0.5℃) of tower T4503 (T03-02), and the lower reflux ratio (0.27±0.2) of tower T4503 (R) are propitious to decrease the aromatic content of raffinate. The correct rate of classification based on training and predicted data sets of the aromatic content of raffinate by using SVC model are 100% and 100%, respectively. The root mean square errors (RMSE) of the aromatic content of raffinate for trained and predicted data sets calculated by SVR model are 0.072 and 0.060, respectively.3. Quantitative structure-property relationship (QSPR) models were developed to correlate structures of polycyclic aromatic hydrocarbons (PAHs) with their boiling point (bp), n-octanol/water partition coefficient (logKow), and retention time index (RI) for reversed-phase liquid chromatography analysis. The quantum chemical descriptors of 139 PAHs were calculated from the fully optimized geometry at theory level of B3LYP/6-311G**. The descriptors were firstly screened by genetic algorithm (GA)-support vector regression (SVR) method. And then the parameters of SVR models were optimized based on the leave-one-out cross-validation method. The SVR models for bp, logKow, and RI were developed from training sets consisting of 45, 52, and 90 compounds, respectively. The SVR models for bp, logKow, and RI were then tested using external test sets consisting of 12, 13, and 23 compounds, respectively. The good determination coefficient (R~2=0.997, 0.964, 0.950, respectively) and satisfactory external predictive ability (q~2=0.999, 0.897, 0.931, respectively) for bp, logKow, and RI show that SVR method and DFT based descriptors can be used to model bp, logKow, and RI for a diverse set of PAHs.4. Quantitative structure-property relationship (QSPR) model was developed to correlate structures of aromatic compounds with their n-octanol/water partition coefficient (logKow). The 68 molecular descriptors derived solely from the structures of the aromatic compounds were calculated using Gaussian 03, HyperChem 7.5, and TSAR V3.3. The descriptors were screened by the minimum Redundancy Maximum Relevance (mRMR)-genetic algorithm (GA)-support vector regression (SVR) method. The parameters of SVR model was optimized using the five-fold cross-validation method. The QSPR model was developed from a training set consisting of 300 compounds using SVR method with good determination coefficient (R~2=0.85). The QSPR model was then tested using an external test set consisting of 50 compounds with satisfactory external predictive ability (q2=0.84).5. A quantitative structure-activity relationship (QSAR) study was performed to develop model for correlating the structures of 581 aromatic compounds with their aquatic toxicity to Tetrahymena pyriformis. A set of 68 molecular descriptors derived solely from the structures of the aromatic compounds were calculated based on Gaussian 03, HyperChem 7.5, and TSAR V3.3. A comprehensive feature selection method, mRMR-GA-SVR method, was applied to select the best descriptor subset in QSAR analysis. The SVR method was employed to model the toxicity potency from a training set of 500 compounds. Five-fold cross-validation method was used to optimize the parameters of SVR model. The SVR model was tested using an external test set of 81 compounds. A good coefficient of determination (R~2=0.77) and external predictive ability (q~2=0.67) values were obtained indicating the potential of SVR in facilitating the prediction of toxicity.

  • 【网络出版投稿人】 上海大学
  • 【网络出版年期】2009年 01期
节点文献中: