节点文献

基于支持向量机的有机化合物红外光谱结构解析

The Interpretation of Organic Compounds Based on Support Vector Machine

【作者】 冯晓瑜

【导师】 李梦龙;

【作者基本信息】 四川大学 , 分析化学, 2007, 硕士

【摘要】 自然科学与技术科学的信息化是科技发展的重要趋势。科学数据的大量积累,往往导致重大科学规律的发现。这为化学计量学的数据挖掘研究提供了机遇。几十年以来,人们一直在探索如何从红外谱图中极大可能地提取信息,将解析经验化。随着商品化红外光谱仪的计算机化,出现了许多计算机辅助红外光谱识别方法,这些方法大致可以分为三类:专家系统,谱图检索系统,模式识别方法。其中最常用的模式识别方法是人工神经网络和偏最小二乘法。文献中大部分利用它们对子结构或特定类别的化合物进行识别,而对整个有机化合物的红外光谱的深入研究尚未涉及,对化合物的特征吸收峰也没有深入的讨论。此外,即使应用最多的人工神经网络在识别子结构时,对结构碎片的预测准确度也不是很高,且神经网络存在不稳定、容易陷入局部极小和收敛速度慢等问题。本文尝试利用支持向量机算法对有机化合物的红外光谱进行规律探讨。根据各类有机化合物红外吸收的不同,设计了一个分等级系统对OMNIC数据库中6352个有机化合物进行分类。该系统首先将有机化合物分为五大类:芳香化合物、烃类、含氧化合物以及含氮化合物;然后根据各类化合物的红外光谱特征,进一步对其细分:芳香化合物按照取代类型和邻近官能团的不同分为四大类;烃类分为饱和和不饱和烃;含氧化合物根据氧原子所连接官能团不同分为四大类:羟基化合物、羰基化合物、醚、酸;含氮化合物也同样根据红外光谱的特点分为肼、酰胺、芳香胺、脂肪胺;接着根据各类化合物红外吸收的特点又进行了更细致的分类。将支持向量机所得结果与人工神经网络所得结果进行比较,在大部分有机化合物的识别中,支持向量机均优于人工神经网络。在此基础上,利用支持向量机详细研究了芳香化合物的识别.芳香化合物包含五个特征频率区:苯环=C—H键的伸缩振动、苯环=C—H键的面外振动的倍频和和频、苯环骨架振动、苯环=C—H键的面内弯曲振动和苯环=C—H键的面外弯曲振动。讨论了利用芳香化合物五个特征频率区光谱片断以及它们的组合作为支持向量机输入对识别能力的影响,并比较说明了所得结果。结果表明在有机化合物结构识别中,支持向量机的表现优于人工神经网络,表明支持向量机在红外光谱谱构关系研究中具有优异性,较适合红外光谱的研究;在芳香化合物谱构关系的讨论部分,可以看到苯的五个振动方式中,C-H和C-C面外弯曲振动在区别苯衍生物取代类型时是最有意义的,这与经典红外理论一致;在片段光谱和全谱预测结果相比较时,我们发现最好的结果不一定都由全谱得到。这一结论为红外光谱信息的深度挖掘提供了新的思路。支持向量机在红外光谱领域展示出良好的性能,是一种很好的计算机辅助红外光谱解析的工具。将包含特征峰的光谱片段用于光谱识别的研究则为红外光谱计算机解析领域提供新的思路,为最大限度的提取红外光谱信息,最终实现光谱的完全计算机解析打下基础。

【Abstract】 An important trend of the development of technique is the informationization of science and techniques. Historically, the accumulated collection of the scientific data always results in the discovery of important scientific rules. This provides the opportunity to mine the data of chemometrics. With the bigger amount of the infrared spectra database, the deeper development of the infrared technology and of the computer, it is urgent to find a solution about how to utilize and enlarge the application of infrared spectra. Along with the computerization of the commercialized infrared spectrometry, there are many computer- assisted interpretation of infrared spectra emerged. The automatic structure elucidation of infrared spectra generally falls into three groups: library search, knowledge-based systems, or pattern recognition. Among the last group of method, artificial neural networks (ANNs) and partial least squares (PLS) were most frequently used. Automatic interpretation of infrared spectra by using pattern recognition techniques such as artificial neural networks has dominant focus on specifically sub-structure prediction. The whole organic compounds and absorption bands of compounds are ignored on classification. This paper tried to discuss the rule of infrared spectra of organic compounds. Furthermore, ANNs have several major drawbacks: unsteadiness, local minima and very low speed of convergence.A recently actively used intelligence algorithm, support vector machine (SVM), is introduced to build classifiers for a hierarchical classification structure of 6352 compounds. In this system, the organic compounds were firstly separated into four classes: aromatic compounds, hydrocarbons, oxygen-contained compounds and nitrogen-contained compounds; then a detailed separation was taken on based on the characteristic of infrared spectra for each kind of compound: aromatic compounds were subdivided into four kinds on the base of the substituted types and adjacent functional groups of benzene, hydrocarbons were separated into saturated hydrocarbons and unsaturated hydrocarbons, oxygen-contained compounds were separated into four classes: hydroxyl, carbonyl, ether, carboxylic acids, nitrogen-contained compounds were comprised of aliphatic amines, aromatic amines, amides and hydrazones; in the next place, a more detailed separation were taken on for each compounds according to their characteristic absorbtion in infrerad spectra. Results from support vector machine were compared favorably with those obtained by using artificial neural networks methods. Obviously, support vector machine shows better performance.In addition, aromatic compounds were more studied by support vector machine. Five characteristic infrared absorptions are contained in aromatic compounds: C-H stretch vibration, the overtone and combination of benzene, C=C stretch vibration, C-H wagging in-plane vibration and C-H wagging out-plane vibration. The five segmental spectra aromatic compounds and various combinations of the segmental spectra are fed to SVM to build classifiers respectively.The results showed that in distinguishing the organic compounds, SVM behaved appreciably better than ANN which suggested that SVM approach can be an efficient tool for the information extracting of infrared spectra; in the process of analyzing each Segmental spectrum, it can be concluded that C–H and C–C wagging out-of-plane vibration was the most important vibrational mode in judging different substituted types of ordinary benzene derivatives of all five absorption of aromatic compunds to affecting its substituted types, which agrees with related known research results; When the results from segmental and entire spectra were compared ,we found that some compounds can be well recognized by using only one or two segmental spectra with reasonable results. It means that some segmental spectra may represent the most significant structure information concealed in entire spectra. In another word, the best results are not always got by entire spectra in computer-insistent interpretation of infrared spectra.Support vector machine as a good tool in interpretation spectra shows excellent performance in the filed of infrared spectra. This article provides the quantitative methods and introduces a new strategy for the establishment of infrared spectra intelligent interpretation system. And SVM approach can be an efficient tool for the information extracting of infrared spectra.

  • 【网络出版投稿人】 四川大学
  • 【网络出版年期】2008年 04期
  • 【分类号】O621.1
  • 【被引频次】4
  • 【下载频次】455
节点文献中: 

本文链接的文献网络图示:

本文的引文网络