节点文献

基于内容分析的专利挖掘技术研究

Content Analysis Based Patent Mining Research

【作者】 曹菲菲

【导师】 朱靖波;

【作者基本信息】 东北大学 , 计算机软件与理论, 2008, 硕士

【摘要】 近十几年来,专利挖掘的研究越来越被重视。早先,专利研究主要基于在专利数据库,近几年,专利研究转向基于自然语言处理的技术或者信息检索的技术。推动专利挖掘技术发展的主要因素:一方面统计机器学习的方法不断的发展和改进,为解决专利挖掘以及自然语料处理提供了强大的方法论武器;另一方面,自然语言处理的技术以及信息检索的技术的进步,促进了专利文本挖掘的发展。同时,专利挖掘的评测举办,为专利挖掘提供了技术交流的平台,促进了专利挖掘研究的进步,并为专利文本处理提供了发展的方向。本文通过研究专利文本的特点,对不同的训练语料做数据统计,分析专利挖掘任务中的难点问题。基于自然语言处理的专利挖掘技术,遇到几大问题:(1)专利挖掘是一个大规模的文本分析任务;(2)专利文本内容涉及到技术发展的各个领域,领域之间交叉现象严重,不利于文本分类;(3)专利文本在各个领域上数量分布不均衡,大量的类别下训练数据不充分;(4)专利文本的分类体系与传统分类体系不同,尤其是国际专利分类标准,具有超大规模的类别空间,多层次等特点;(5)专利的国际分类都是多标签标记,因此专利分类是多标签的分类问题。上述几个主要问题,决定了专利文本处理与传统的文本处理的不同。本文围绕专利挖掘任务中的问题,从不同的方面研究提高专利挖掘系统的性能。作者在前人的工作基础上,综合了多个领域的技术,提出了一些专利挖掘的处理技术。文本解决专利挖掘问题的主要技术:(1)本文采用基于自然处理的分类系统的框架,处理专利挖掘的任务。(2)本文研究了在大规模的数据的分类问题,采用信息检索中常用的检索技术——倒排索引文档——应用到分类模型中,提高分类模型的计算速度。(3)本文提出了类别归并的方法解决数据分布不均衡的问题。在国际专利分类系统下,大量的类别中数据样本很少,采用多种归并的方法将小类别聚合成大类别,解决分布不均衡的问题。(4)专利挖掘任务中,文本之间的相似度计算的是重要的研究环节。本文采用了多种相似度计算方法,在数据非同源的任务中,BM25的计算方法性能较好,并比较稳定。(5)本文提出了多种类别排序的决策方法。分类器给定样本之间的相似度的方法,需要通过某种转化的机制,映射成类别标记的排序。文本提出了带用类别信息的相似度加和的方法以及基于Log-linear模型的线性加和方法,对类别进行Rank,实验结果显示带用类别信息的相似度加和的方法以及基于Log-linear模型的线性加和方法性能较好。本文基于NTCIRT-7的专利挖掘评测任务的平台,在美国专利以及日本专利的英文翻译的数据上,实现专利挖掘的分类系统,并针对专利挖掘的主要问题和核心技术做了大量实验,并做了详细的数据分析。最后确定解决专利挖掘任务的最可信的系统。

【Abstract】 In the recent decade, Patent Mining has experienced a prominent flourish. In the past, much of the focus for patent search and retrieval has been from the database community, but in recent years, it has been from Natual Language Processing (NLP) technology and Information Retrieval (R) community. The improvement of Patent Mining can be attributed to the two factors:the boom of statistical machine learning approaches provided new methodology for solving Patent Mining and Natual Language Processing tasks; the improvement of Natual Language Processing and Information Retrieval technology. The platform of International Patent Evaluation and workshop provides a forum in which researchers and practitioners from relevant communities can share their ideas, approaches, perspectives, and experiences from their work in progress.In this paper, we research the content characteristic of the patent text and data statistic based on different patent corpus. Then we analyse the difficult problem of Patent Mining task. Based on the Natual langugage processing Patent mining task has several questions:(1) Scalar of patent corpus is huge, there are almost several million patent samples; (2) Content of Patent text refers to all technology domains. The phenomenon of cross-cutting issue between domains is common, which is adverse to text classification; (3) The data distribute of the patent text on International Patent Classification (IPC) classification system is imbalance and train data in main class is insufficiency; (4) The classification system of patent is different from that of the traditional text classification, especially IPC system has large scale number of classes which is Hierarchy; (5) Patent text has multi-classclassification tag.This dissertation focuses on how to resolve the main problem of Patent Mining task and research technology to improve the performance of patent mining system. We propose some models and methods for patent mining task based on the previous works. We focus on the following issue:(1) Using the frame of text classification based on NLP technology to process the Patent Mining task.(2) Using inverted indexing to improve the speed of text classification, which is common technology Information Retrieval community.(3) Propose class clustering method to improve data imbalance problem.(4) Using several similarity calculation methods for Patent Mining task. (5) Propose several Ranking methods for class decision-making process, especially, the method based on log-linear and the system combine method based on Rank-SVM model.In this paper, all the research work bases on Patent Mining Evaluation task of NTCIR-7, and build the creditable system for patent mining task used U.S. patent and the English translation of the Japanese patent data.

  • 【网络出版投稿人】 东北大学
  • 【网络出版年期】2012年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络