节点文献
面向生物医学文本的疾病关系挖掘模型及算法研究
Research on Model and Algorithms for Mining Disease-centric Relationships in Biomedicine Literatures
【作者】 杨黎;
【导师】 周艳红;
【作者基本信息】 华中科技大学 , 计算机应用技术, 2013, 博士
【摘要】 生物医学领域的文献记录展现了该领域内的大量成果和实验发现。生物医学文本挖掘技术作为研究热点之一,可以快速有效地在海量的文献中获取相关知识。生物医学文本挖掘技术包括信息检索、文本分类、命名实体识别、关系抽取、假设生成等。随着基因技术的飞速发展,人们迫切希望从分子水平认识各种疾病的发生机制。在生物医学领域的文献中针对疾病进行关系的挖掘并构建疾病网络,挖掘与疾病相关的隐藏信息,给生物医学领域的科学家提供假设生成的依据,对于人类的发展、疾病的预防以及新药的研制都有着重要的意义。首先在生物医学命名实体识别获得良好性能的基础上给出了疾病和其他实体的本体标注方法,然后对文本进行分类以后再标注,进而进行关系的抽取和假设生成,从而对疾病和其他实体的关系进行预测。现有的生物医学命名实体识别方法将实体边界探测和语义标识任务在一个模型中完成,另外生物医学命名实体往往很长,相对单词级的特征而言,构建实体级的特征对于命名实体识别任务更加自然。因此,提出一种基于双层半马尔科夫条件随机场的实体识别方法,将任务划分成两个阶段来进行标记将是一个可行的解决方法。在第一阶段,命名实体和非实体被检测出来,分别标记为C和O。在第二阶段,命名实体被标记为具体的实体类别如蛋白质、DNA、RNA、Cell_line、Cell_type等。针对每一个阶段,挖掘了新的有用的特征。鉴于有些特征只作用于某一阶段,双层模型极大的减少了特征的维度。通过实验验证了算法的有效性,较之现有算法,基于双层半马尔科夫条件随机场的实体识别方法在JNLPBA2004语料集上达到了74.64%的F值。针对生物医学文献中关于疾病的命名实体识别存在类型不明确、精度低的问题,提出了基于疾病本体的标注方法,使用标准词表对疾病概念进行标注和标准化。采用双层半马尔科夫条件随机场模型对疾病实体进行识别,包括在文本中的位置信息和标识。随后,通过计算疾病实体和疾病本体中概念的相似度对已识别的疾病进行标注。最后,疾病实体根据相似度分别被识别为疾病概念和疾病实例。该实验基于Arizona疾病语料集并取得了很好的实验结果。研究了基于文本发现的疾病语义关系挖掘。首先对文本进行疾病本体和基因本体的标注,建立基于文本的描述疾病和基因功能关系的语义网络。其次,从网络中抽取相似的子图并由子图的相似度来推导疾病之间的关系。从MEDLINE中随机选取了初始语料集,该实验获得了较好的性能并能够发现疾病之间的潜在关系。研究了关于疾病的假设生成问题。通过探索疾病与基因功能、药物实体之间的语义网络,抽取文本中与疾病有关的子语义网络,提取疾病与其他实体之间的语义关系。使用主题模型对相关实体进行语义扩展,并按照四类主题对文章进行分类,包括疾病与疾病,疾病与基因功能,药物与基因功能,以及疾病与药物。并在以上分类结果的基础上,根据句子级的概念共现和实体间的语义关联,以找出实体间隐含的关系。通过上述方法构建的疾病网络具有较强的实用性,能够对疾病之间、疾病和基因、药物和基因、疾病和药物之间的假设生成进行预测,为科研人员进行临床验证提供依据。
【Abstract】 The rapidly increasing amount of literature in biomedical domain promotes theapplication of text mining. As one of the hot topics, biomedical text mining could getuseful knowledge from a large number of literatures rapidly and efficiently. Biomedicaltext mining techniques contain information retrieval, text classification, named entityrecognition, relationship extraction and hypothesis generation. With the rapiddevelopment of gene techniques, recognizing pathogenetic mechanism from molecularlevel becomes very important. Relationship mining of disease and building diseasecentric network from biomedicine literature could provide evidence of hypothesisgeneration for scientists. Mining hidden information of disease makes good sense for thedisease prevention and development of new drugs. After a good performance onbiomedical named entity recognition, the ontology annotation would be carried out on aresult of a classification for literature. Subsequently, relationships between diseases andother entities would be predicted.The most methods in biomedical named entity recognition are single-phase.That is,making term boundary detection and semantic labeling into one task. Semi-Markovconditional random fields model (semi-CRFs) put the label to a segment not a singleword which is more natural than the other machine learning methods. We represent atwo-phase approach based on semi-Markov conditional random fields model (semi-CRFs)and explores novel feature sets for identifying the entities in text into5types: protein,DNA, RNA, cell_line and cell_type. Our approach divides the biomedical named entityrecognition (NER) task into two sub-tasks: term boundary detection and semanticlabeling. At the first phase, term boundary detection sub-task detects the boundary of theentities and classifies the entities into one type C. At the second phase, semantic labelingsub-task labels the entities detected at the first phase the correct entity type. We explorenovel feature sets at both phases to improve the performance. Our experiments based onsemi-CRFs without deep domain knowledge and post-processing algorithms gets anF-score of74.64%on the JNLPBA2004corpus, which outperforms most of thestate-of-the-art systems.Up to now, the biomedical text mining for diseases is limited to the recognition ofdisease names. Few work focus on the type of diseases and relations between diseases.Only the recognition of the biomedical concepts in literature is not enough, annotationsand normalizations of the concepts with normalized Metathesaurus get even moreimportant. We propose a system to annotate the literature with normalized Metathesaurus. First, a two-phase semi-Markov Conditional Random Fields (semi-CRFs) is used torecognize the disease mentions, including the location and identification. Then, we adaptthe Disease Ontology (DO) to annotate the diseases recognized for normalization bycomputing the similarity between disease mentions and concepts. According to thesimilarities, the disease mentions are denoted as disease concepts and instancesdistinctively. The experiments carried out on the Arizona Disease Corpus show that oursystem makes a good achievement and outperforms the other works.There is a lot of knowledge hidden in biomedicine literatures. With the everincreasing amount of biomedicine literatures, mining the relations automatically is veryurgent. The relations between diseases and gene functions are waiting to be mining. Wepropose a method to mine relations between diseases with common gene functions in theliterature with normalized Metathesaurus. First, a two-phase semi-CRFs model is used torecognize the disease mentions and gene function mentions, including the location andidentification. Then, we adapt the Disease Ontology (DO) and the Gene Ontology (GO)to annotate the diseases and gene functions recognized for normalization by computingthe similarity between mentions and concepts. According to the similarities, the mentionsare denoted as concepts and instances distinctively. Thirdly, we build a network andmeasure relations between diseases by computing similarities between commonsub-graphs. The experiments carried out on a corpus randomly selected by GoPubMedwith disease and the three domains in GO. The performance shows a lot of hiddenrelations between diseases and gives an explanation.Finally, hypothesis generation of diseases should work. We build semantic networksamong diseases, gene functions and drug entities, extract sub-semantic networks aboutdiseases and get semantic relationships among diseases and other entities through text.We make semantic extension to entities using topic model. The documents are classifiedinto four topics: diseases, diseases and gene functions, drugs and gene functions, diseasesand drugs. We mine hidden relationships among diseases according to co-occurrence insentences and semantic association of entities.Hence, the disease network building by the above methods has a good application. Itcould predict hypothesis among diseases, drugs, gene functions, then provides evidencefor test with researchers.