节点文献

生物命名实体识别及生物文本分类

Biomedical Named Entity Recognition and Classification of Biomedical Literature

【作者】 豆增发

【导师】 高琳;

【作者基本信息】 西安电子科技大学 , 计算机应用技术, 2013, 博士

【摘要】 近年来,随着生物医学文本的大规模出现,对文本进行自动化处理的文本挖掘技术变得越来越重要,如对海量生物医学文本数据进行自动分类,从文本中挖掘感兴趣的生物命名实体,研究这些生物命名实体之间的内在关系等。生物医学文本中生物命名实体的识别是所有生物数据挖掘的最基础部分,也是将非结构化数据转换为结构化数据的关键步骤。本文主要研究生物医学文本中命名实体的识别和生物文本自动分类的关键技术,所取得的主要研究成果为:1、研究了基于改进二进制粒子群优化的特征选择算法。二进制粒子群优化是离散粒子群优化的一个变种,不同与传统的实数粒子群优化,二进制粒子群优化的每个变量取值非0即1。基于改进二进制粒子群优化的特征选择算法用翻转角度来控制粒子群进化,在多维空间搜索目标函数的最优二进制解,求出最佳特征权重向量,权重为0的特征是冗余特征,权重为1的特征为有效特征。2、研究了基于膜粒子群优化的特征选择算法。膜粒子群优化算法利用了膜系统的分层结构和消息传递机制,将粒子群优化算法作为区域子算法部署到各个区域中。不同于传统粒子群优化算法,本文将粒子群优化的搜索速率分解为局部搜索速率和全局搜索速率。膜系统的所有外层区域采用局部搜索速率,搜索局部最优解,最内层区域采用全局搜索速率,搜索全局最优解。所有外部区域将最优解传递给相邻内部区域,内部区域将最差解传递给相邻外部区域,最内层区域向相邻外部区域传递最差解。当各个区域之间的解传递在一段时间内停止,或者算法迭代次数达到限定次数,算法收敛,取最内层区域的最优解为最终解。利用膜粒子群优化算法在多维空间搜索目标函数的最优解,求出最佳特征权重向量,选取权重系数大于阈值的特征,去除权重系数小于阈值的特征,达到清除冗余特征的目的。3、研究了条件随机场模型的参数估计问题。针对传统的条件随机域模型参数估计算法过度拟合的问题,提出了改进粒子群优化算法并将该算法应用到条件随机域的参数估计中。改进的粒子群优化算法引入粒子群聚集度来防止粒子群过早的陷入局部收敛,用迭代间对数似然相对变化率来控制算法的收敛,用线性变化的惯性因子和学习因子来控制搜索范围。该算法在搜索初期具有较好的全局搜索能力,在搜索后期具有较好的局部搜索能力。当迭代间的对数似然相对变化率小于门限值时,或者迭代次数达到限定次数,算法终止。本文用条件随机域模型的对数似然估计作为目标函数,用改进粒子群优化算法来训练条件随机域,寻找使目标函数最大的参数向量作为条件随机域的最佳参数。4、研究了利用条件随机域模型从生物医学文本中识别生物命名实体的方法。针对马尔科夫等模型在命名实体识别中的标签倾向问题,提出了用富特征的条件随机域识别生物命名实体的方法。首先利用改进二进制粒子群优化方法对条件随机域的特征进行选择,然后利用改进粒子群优化算法对条件随机域模型进行训练,接下来基于各种辅助的特征集,用训练好的条件随机域模型进行生物命名实体的识别,标注出生物文本中存在的表示生物命名实体的名词和各种短语。5、研究了基于可拓分类器的生物医学文本分类方法。为了对海量生物医学文本进行自动分类,本文提出了一种新的基于可拓分类器的文本分类方法。可拓分类器用空间向量模型来表示单个生物医学文本,用可拓矩阵表示每个类型模板,通过计算文本与各个类型模板之间的可拓相关度,来判定文本与类型之间的相似程度,选择可拓相关度最大的类型为最终归档类型。为了使可拓矩阵保持最佳分类效果,本文采用改进粒子群优化算法来训练不同类别的文本特征的权重系数,使不同文本类别之间的距离和最大化。

【Abstract】 In recent years, with the growth of biomedical literature, it is more and moreimportant to develop automatic text mining tool, for example, classifying massbiomedical literature, recognizing interesting named entity from text, extracting therelationship between those named entities, etc. Biomedical named entity recognitionfrom biomedical literature is the basic part of all biomedical texting mining, also is theprimary procedure to transform unstructured data to structured data. This dissertation isfocused on the key technologies in biomedical named entity recognition andclassification of biomedical literature, and all major contributions made by author areoutlined as follows:1. Features selection method based on improved binary particle swarm optimizer isstudied. Binary particle swarm optimizer is one of discrete particle swarm optimizer.Different with traditional real-number particle swarm optimizer, the value of solution ofbinary particle swarm optimizer is1or0instead of real number. The feature selectionalgorithm based on improved binary particle swarm evolves by round angle, andsearches for the best binary solution of fitness function in multi-dimension space untilget the best weight vector of features. The features with weight as1will be selected andfeatures with weight as0will be removed.2. Feature selection method based on membrane particle swarm optimizer isstudied. Utilizing the hierarchy structure and massage passing mechanism of membranesystem, membrane particle swarm optimizer assigns particle swarms optimizer to everysub-region. Different with traditional particle swarm optimizer, this dissertationproposes the local velocity and global velocity. All particle swarms in external regionssearch for local best solution in local velocity, and all particle swarms in the innermostregion search for global best solution in global velocity. The best solution in externalregion is passed to adjacent inner region, and the worst solution in inner regions ispassed to adjacent external region. The worst solution in the innermost region is passedto its adjacent external region. Once solution passing stops or iteration runs up tolimitation, iteration of algorithm is stopped and the best solution in the innermost regionis taken as output. We use membrane particle swarm optimizer to search for bestsolution of fitness function and get the best weight vector of features. According to thevalues in best weight vector, those features with weight less than threshold value areremoved and features with weight more than threshold value are selected in order toeliminate redundant features. 3. Parameter estimation of conditional random field model is studied. Aimed tosolve the over fitting issues in traditional parameter estimation of conditional randomfields, we propose an improved particle swarm optimizer algorithm and apply thisalgorithm to estimate parameters of conditional random fields. In improved particleswarm optimizer, aggregation degree of particle swarm is utilized to control early localconvergence of particle swarm optimizer, the relative change ratio of log-likelihoodbetween iterations is employed to end its iterations, and the inertia factor and learningfactor are set as linear variables to control search scope. This algorithm has better globalsearch ability in early stage, and better local search ability in later stage than traditionalparticle swarm optimizer. Once the relative change ratio of log-likelihood betweeniterations is less than threshold or the iteration runs up to limitation, iteration is stopped.We set logarithm estimation of conditional random fields as object function, trainconditional random fields using improved particle swarm optimizer, and search for thebest parameters which maximize the object function.4. Biomedical named entity recognition in biomedical literature based onconditional random fields is studied. Aimed to solve label bias problem in Markovmodel, we utilize conditional random fields with rich features to recognize biomedicalnamed entity. We select features using improved binary particle swarm optimizer firstly,train conditional random fields using improved particle swarm optimizer, and thenrecognize biomedical named entity using trained conditional random fields with richfeature sets, finally, label all biomedical named entities in biomedical literature.5. Classification of biomedical literature based on extenics classifier is studied.Aimed to classify mass biomedical literature automatically, we propose a novelclassification method named extenics classifier. In extenics classifier, single literature ispresented by space vector model, category model is presented by extenics matrix,extenics similarities between the literature and all category models are calculated andthe literature is classified to that category with the maximum extenics similarity. Inorder to maximize the distance between all category models, extenics matrix is trainedusing improved particle swarm optimizer.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络