节点文献
面向医疗文本的复杂实体识别
Complicated Named Entity Recognition for Biomedical Texts
【作者】 廖澍锴;
【导师】 董远;
【作者基本信息】 北京邮电大学 , 信息与通信工程, 2022, 硕士
【摘要】 随着医疗领域的不断发展和医学技术的不断深入,以及医疗系统的电子化和信息化,包括医学文献和电子病例在内的医疗数据量呈现出高速增长的态势。如何从大量的非结构化医疗文本中提取出相关的医疗实体成为当前的研究热点。尽管在自然语言处理领域,命名实体识别已经趋于成熟。但是医疗文本相较于其他领域文本而言具有一定的特殊性,一方面实体结构复杂多样,嵌套实体、非连续实体和部分重叠实体的出现率高,传统的序列标记模型无法胜任。另一方面知识门槛较高,需要标注者同时具备医学领域知识和机器学习标注的相关知识,这就导致文本标注时易出错,质量低,同时远距离监督构建的命名实体识别数据集效果并不好,噪声量大。因此,本文主要研究了复杂实体识别及噪声鲁棒的命名实体识别训练方法。针对医疗领域命名实体结构复杂的情况,本文提出了基于路径感知的复杂命令实体识别方法,该方法能够不含歧义地表示句子中所有的实体情况,实现了复杂实体识别框架的统一。该方法在CADEC和DDI数据集上进行试验。在非连续实体的F1值上分别取得了 2.3%和0.6%的提升,并且在所有实体的F1值上较好的效果。本文还提出了基于训练轮次的带噪声数据集学习方法,用于处理医疗文本领域命名实体识别数据集噪声样本较多的情况。我们利用模型以往保留下来的checkpoint和当前的模型进行联合优化,来避免同时训练多个模型。通过引入一致性损失函数,鼓励模型做出与之前checkpoint一致的预测结果,来防止噪声的过拟合。同时设计了一种随训练轮次增加的高斯噪声,使模型在训练前期拟合正确的样本,并在训练后期防止对噪声的拟合。实验结果表明我们的方法在减少计算开销的同时,能够达到与对比模型接近的性能。
【Abstract】 With the continuous development of the biomedical field and biomedical technology,as well as the electronization and informatization of the biomedical system,the amount of biomedical data,including medical literature and electronic cases,shows a trend of rapid growth.How to extract relevant biomedical entities from a large number of unstructured biomedical texts has become a research hotspot.Although in the field of natural language processing,named entity recognition has become mature.However,compared with texts in other fields,medical texts have certain particularity.On one hand,the entity structure is complex and diverse.Nested entities,discontinuous entities and partially are pretty common.The traditional sequence tagging schema is not competent.On the other hand,the knowledge threshold is high.It requires the annotator to have the knowledge of biomedical field and machine learning annotation at the same time,which may lead to error prone and low quality of text annotation.At the same time,the named entity recognition data set constructed by distant supervision has poor effect and large amount of noise.Therefore,this paper mainly studies complex entity recognition and noise robust named entity recognition training method.To tackle the complex structure of named entities in the biomedical field,we proposes a route-aware model for entity recognition with diverse structures.This schema can represent all entities in the sentence without ambiguity,and realizes the unification of the complex entity recognition framework.The method is tested on CADEC and DDI data sets.The F1 of discontinuous entities are improved by 2.3%and 0.6%respectively,and the F1 of all entities are relatively good.We also proposes a noisy data set learning method based on training epochs,which is used to deal with the situation that there are many noise samples in the data set of named entity recognition in the field of biomedical text.We use the checkpoint retained by the model in the past epochs and the current model for joint optimization to avoid training multiple models at the same time.By introducing the consistency loss function,the model is encouraged to make prediction consistent with the previous checkpoint to prevent overfitting of noise.At the same time,a Gaussian noise increasing with the training epochs is introduced to make the model fit the correct samples in the early stage and prevent the fitting of noise samples in the later.The experiment results show that our method can achieve a performance close to the comparison model while reducing the computational cost.
【Key words】 biomedicine; complex named entity; noise robust; named entity recognition;
- 【网络出版投稿人】 北京邮电大学 【网络出版年期】2024年 01期
- 【分类号】R319;TP391.1
- 【下载频次】52