Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed (original) (raw)

Development of an information retrieval tool for biomedical patents

Computer methods and programs in biomedicine, 2018

The volume of biomedical literature has been increasing in the last years. Patent documents have also followed this trend, being important sources of biomedical knowledge, technical details and curated data, which are put together along the granting process. The field of Biomedical text mining (BioTM) has been creating solutions for the problems posed by the unstructured nature of natural language, which makes the search of information a challenging task. Several BioTM techniques can be applied to patents. From those, Information Retrieval (IR) includes processes where relevant data are obtained from collections of documents. In this work, the main goal was to build a patent pipeline addressing IR tasks over patent repositories to make these documents amenable to BioTM tasks. The pipeline was developed within @Note2, an open-source computational framework for BioTM, adding a number of modules to the core libraries, including patent metadata and full text retrieval, PDF to text conve...

Developing Semantic Search for the Patent Domain

The patent domain is a very important source of scientific information that is currently not used to its full potential. Issues such as high numbers of patents, complicated language style and inconsistently used vocabulary make the task of searching for relevant patents extremely complex. While this is already a problem for patent professionals who have to invest a lot of time and effort into their search, it is even more problematic for academic scientists with little experience in this domain. Semantic search functionality has been demonstrated to provide large advantages for document search in other domains. As an example, the search engine GoPubMed offers advanced search functionality for the biomedical domain based on annotating documents with relevant concepts from various ontologies. In this paper, we report on our efforts to provide comparable advances for the patent domain. We introduce the patent search prototype GoPatents, and we describe the experiments that we performed during its development in the areas of term extraction, term and IPC class co-occurrence analysis, automated patent categorization, and automated annotation with ontology concepts.

A Survey of Automated Hierarchical Classification of Patents

In this era of "big data", hundreds or even thousands of patent applications arrive every day to patent offices around the world. One of the first tasks of the professional analysts in patent offices is to assign classification codes to those patents based on their content. Such classification codes are usually organized in hierarchical structures of concepts. Traditionally the classification task has been done manually by professional experts. However, given the large amount of documents, the patent professionals are becoming overwhelmed. If we add that the hierarchical structures of classification are very complex (containing thousands of categories), reliable, fast and scalable methods and algorithms are needed to help the experts in patent classification tasks. This chapter describes, analyzes and reviews systems that, based on the textual content of patents, automatically classify such patents into a hierarchy of categories. This chapter focuses specially in the patent classification task applied for the International Patent Classification (IPC) hierarchy. The IPC is the most used classification structure to organize patents, it is worldwide recognized, and several other structures use or are based on it to ensure office inter-operability.

Automated categorization in the international patent classification

ACM SIGIR Forum, 2003

A new reference collection of patent documents for training and testing automated categorization systems is established and described in detail. This collection is tailored for automating the attribution of international patent classification codes to patent applications and is made publicly available for future research work. We report the results of applying a variety of machine learning algorithms to the automated categorization of English-language patent documents. This procedure involves a complex hierarchical taxonomy, within which we classify documents into 114 classes and 451 subclasses. Several measures of categorization success are described and evaluated. We investigate how best to resolve the training problems related to the attribution of multiple classification codes to each patent document.

Developing a Comprehensive Patent Related Information Retrieval Tool

Journal of theoretical and applied electronic commerce research, 2011

In recent years, there has been a massive growth of regulatory and related information available online. This information is distributed across many different domains creating a problem for accessing and managing this data. This paper proposes a framework to access information across two such domainspatents and court cases. The framework is designed to boost the value of a set of patents based on information available in court cases by identifying and cross-referencing mutual information in the two domains. We test our framework by constructing a use case involving the hormone erythropoietin. A corpus of 1150 patents (including 135 closely related patents) and 30 court cases is gathered. Challenges associated with such integration and future plans are briefly discussed.

A Patent Retrieval Method using Semantic Annotations

Automatic annotation of key phrases for their semantic categories can help improving effectiveness of a variety of text-based systems including information retrieval, summarization, question answering, etc. In this paper, we exploit semantic annotations for patent retrieval (i.e., patent invalidity search). We first annotated key phrases for two semantic categories, PROBLEM (e.g. "pattern matching") and SOLUTION (e.g. "dynamic programming") in a patent document, which constitute a particular technology. Semantic clusters are formed by grouping patent documents with the same PROBLEM or SOLUTION tag. A language modelling approach to information retrieval is extended to consider the semantically oriented clusters as well as document models. Our retrieval evaluation of the proposed approach using a collection of United States patent documents shows a 22% improvement over the baseline, a smoothed language modelling approach without using the semantic annotations.

Development of Text Mining Tools for Information Retrieval from Patents

Advances in Intelligent Systems and Computing, 2017

Biomedical literature is composed of an ever increasing number of publications in natural language. Patents are a relevant fraction of those, being important sources of information due to all the curated data from the granting process. However, their unstructured data turns the search of information a challenging task. To surpass that, Biomedical text mining (BioTM) creates methodologies to search and structure that data. Several BioTM techniques can be applied to patents. From those, Information Retrieval is the process where relevant data is obtained from collections of documents. In this work, a patent pipeline was developed and integrated into @Note2, an open-source computational framework for BioTM. This integration allows to run further BioTM tools over the patent documents, including Information Extraction processes as Named Entity Recognition or Relation Extraction.

Enhancing patent expertise through automatic matching with scientific papers

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012

This paper focuses on a subtask of the QUAERO 1 research program, a major innovating research project related to the automatic processing of multimedia and multilingual content. The objective discussed in this article is to propose a new method for the classification of scientific papers, developed in the context of an international patents classification plan related to the same field. The practical purpose of this work is to provide an assistance tool to experts in their task of evaluation of the originality and novelty of a patent, by offering to the latter the most relevant scientific citations. This issue raises new challenges in categorization research as the patent classification plan is not directly adapted to the structure of scientific documents, classes have high citation or cited topic and that there is not always a balanced distribution of the available examples within the different learning classes. We propose, as a solution to this problem, to apply an improved K-nearest-neighbors (KNN) algorithm based on the exploitation of association rules occurring between the index terms of the documents and the ones of the patent classes. By using a reference dataset of patents belonging to the field of pharmacology, on the one hand, and a bibliographic dataset of the same field issued from the Medline collection, on the other hand, we show that this new approach, which combines the advantages of numerical and symbolical approaches, improves considerably categorization performance, as compared to the usual categorization methods.

Computer-Assisted Categorization of Patent Documents in the International Patent Classification

2003

The World Intellectual Property Organization is currently developing a system for assisting users in categorizing patent documents in the International Patent Classification (IPC). The system should support the classification of documents in several languages and aims to assist users in locating relevant IPC symbols by providing them with a convenient web-based service. The approach taken for developing such a system relies on powerful machine learning algorithms that are trained on manually classified documents to recognize IPC topics. We detail in-house results of applying a custom-built state-of-the-art computer-assisted categorizer to English, French, Russian, and Germanlanguage patent documents. We find that reliable computer-assisted categorization at IPC subclass level is an achievable goal for the statistical methods employed here. A categorization system suggesting three IPC symbols for each document can predict the main IPC class correctly for around 90% of documents, and the main IPC subclass for about 85% of documents. The accuracy of the system at main group level is enhanced if the user first validates the correct IPC class.

Patent Retrieval in Chemistry based on semantically tagged Named Entities

The Eighteenth Text Retrieval Conference Proceedings, 2009

This paper reports on the work that has been conducted by Fraunhofer SCAI for Trec Chemistry (Trec-Chem) track 2009. The team of Fraunhofer SCAI participated in two tasks, namely Technology Survey and Prior Art Search. The core of the framework is an index of 1.2 million chemical patents provided as a data set by Trec. For the technology survey, three runs were submitted based on semantic dictionaries and noun phrases. For the prior art search task, several fields were introduced into the index that contained normalized noun phrases, biomedical as well as chemical entities. Altogether, 36 runs were submitted for this task that were based on automatic querying with tokens, noun phrases and entities along with different search strategies.