Junichi Tsujii - Academia.edu (original) (raw)

Papers by Junichi Tsujii

Research paper thumbnail of Towards a sublanguage-based semantic clustering algorithm

Current Issues in Linguistic Theory, 1997

Research paper thumbnail of Monolingual Phrase Alignment on Parse Forests

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017

Research paper thumbnail of Use of a Full Parser for Information Extraction in Molecular Biology Domain

There is an increasing need for automatic information extraction (IE) to support database buildin... more There is an increasing need for automatic information extraction (IE) to support database building and to intelligently find novel knowledge of biological events from online journal collections. Many of the previous researchers (e.g., [3]) extracted such information by using hand-tailored patterns in regular expressions on some pre-defined set of verbs representing a certain type of reaction. However, as a fact can be represented in various forms in natural language text, many patterns of surface expressions need to be prepared for one event. We propose an alternative information extraction method based on full parsing with a large-scale, general-purpose grammar. In our system, a parser converts the variety of sentences that describe the same event into a canonical structure (argument structure) regarding the verb representing the event and its arguments such as (semantic) subject and object. Information extraction itself is done using pattern matching on the canonical structure. Si...

Research paper thumbnail of O-line Raising, Dependency Analysis and Partial Unication

A c ompilation method for HPSG is presented. The compiler generates skeletal parts of possible st... more A c ompilation method for HPSG is presented. The compiler generates skeletal parts of possible structures prior to parsing and converts them to Finite State Automata (FSA) augmented with feature structures and denite clause programs. Amount o f unication required for parsing is reduced. This is due to three techniques, namely, 1) o-line raising, 2) dependency analysis on feature structures and 3) partial unication. This paper focuses on these three techniques and consider their consequences on linguistic consideration in HPSG. Finally, w e s how their eectiveness with a series of experiments on Japanese newspapers.

Research paper thumbnail of Compositional Phrase Alignment and Beyond

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

Research paper thumbnail of Transfer Fine-Tuning: A BERT Case Study

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

Research paper thumbnail of Machine translation

Current Issues in Linguistic Theory, 1997

Research paper thumbnail of A Re-Evaluation of Biomedical Named Entity–Term Relations

Journal of Bioinformatics and Computational Biology, 2010

Text mining can support the interpretation of the enormous quantity of textual data produced in b... more Text mining can support the interpretation of the enormous quantity of textual data produced in biomedical field. Recent developments in biomedical text mining include advances in the reliability of the recognition of named entities (NEs) such as specific genes and proteins, as well as movement toward richer representations of the associations of NEs. We argue that this shift in representation should be accompanied by the adoption of a more detailed model of the relations holding between NEs and other relevant domain terms. As a step toward this goal, we study NE–term relations with the aim of defining a detailed, broadly applicable set of relation types based on accepted domain standard concepts for use in corpus annotation and domain information extraction approaches.

Research paper thumbnail of Improving the Inter-Corpora Compatibility for Protein Annotations

Journal of Bioinformatics and Computational Biology, 2010

Although there are several corpora with protein annotation, incompatibility between the annotatio... more Although there are several corpora with protein annotation, incompatibility between the annotations in different corpora remains a problem that hinders the progress of automatic recognition of protein names in biomedical literature. Here, we report on our efforts to find a solution to the incompatibility issue, and to improve the compatibility between two representative protein-annotated corpora: the GENIA corpus and the GENETAG corpus. In a comparative study, we improve our insight into the two corpora, and a series of experimental results show that most of the incompatibility can be removed.

Research paper thumbnail of Text mining and its potential applications in systems biology

Trends in biotechnology, 2006

With biomedical literature increasing at a rate of several thousand papers per week, it is imposs... more With biomedical literature increasing at a rate of several thousand papers per week, it is impossible to keep abreast of all developments; therefore, automated means to manage the information overload are required. Text mining techniques, which involve the processes of information retrieval, information extraction and data mining, provide a means of solving this. By adding meaning to text, these techniques produce a more structured analysis of textual knowledge than simple word searches, and can provide powerful tools for the production and analysis of systems biology models.

Research paper thumbnail of Patrick Olivier

Research paper thumbnail of Poster: Analysis of gene ranking algorithms with extraction of relevant biomedical concepts from PubMed publications

2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), 2011

Research paper thumbnail of Machine translation from japanese into english

Proceedings of the IEEE, 1986

Research paper thumbnail of Mining metabolites: extracting the yeast metabolome from the literature

Research paper thumbnail of Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2010

Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biologi... more Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biological information extraction. In 2009, the majority of these RE and EE research efforts were centered around the BioCreative II.5 Protein-Protein Interaction (PPI) challenge and the "BioNLP event extraction shared task." Although these challenges took somewhat different approaches, they share the same ultimate goal of extracting bio-knowledge from the literature. This paper compares the two challenge task definitions, and presents a unified system that was successfully applied in both these and several other PPI extraction task settings. The AkaneRE system has three parts: A core engine for RE, a pool of modules for specific solutions, and a configuration language to adapt the system to different tasks. The core engine is based on machine learning, using either Support Vector Machines or Statistical Classifiers and features extracted from given training data. The specific modules solve tasks like sentence boundary detection, tokenization, stemming, part-of-speech tagging, parsing, named entity recognition, generation of potential relations, generation of machine learning features for each relation, and finally, assignment of confidence scores and ranking of candidate relations. With these components, the AkaneRE system produces state-of-the-art results, and the system is freely available for academic purposes at http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/.

Research paper thumbnail of Proximity-Based Frameworks for Generating Embeddings from Multi-Output Data

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012

Research paper thumbnail of U-Compare: A modular NLP workflow construction and evaluation system

IBM Journal of Research and Development, 2011

Research paper thumbnail of Accomplishments and challenges in literature data mining for biology

Research paper thumbnail of AGRA: analysis of gene ranking algorithms

Research paper thumbnail of Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task}

Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task}, 2009

@Book{BioNLP-ST:2009, editor = {Jun&a... more @Book{BioNLP-ST:2009, editor = {Jun'ichi Tsujii}, title = {Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task}, month = {June}, year = {2009}, address = {Boulder, Colorado}, publisher = {Association for Computational Linguistics}, url = {http://www.aclweb.org/anthology/W09-14} } @InProceedings{kim-EtAl:2009:BioNLP-ST, author = {Kim, Jin-Dong and Ohta, Tomoko and Pyysalo, Sampo and Kano, Yoshinobu and Tsujii, Jun'ichi}, title = {Overview of BioNLP'09 Shared Task on Event Extraction}, booktitle = {Proceedings ...

Research paper thumbnail of Towards a sublanguage-based semantic clustering algorithm

Current Issues in Linguistic Theory, 1997

Research paper thumbnail of Monolingual Phrase Alignment on Parse Forests

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017

Research paper thumbnail of Use of a Full Parser for Information Extraction in Molecular Biology Domain

There is an increasing need for automatic information extraction (IE) to support database buildin... more There is an increasing need for automatic information extraction (IE) to support database building and to intelligently find novel knowledge of biological events from online journal collections. Many of the previous researchers (e.g., [3]) extracted such information by using hand-tailored patterns in regular expressions on some pre-defined set of verbs representing a certain type of reaction. However, as a fact can be represented in various forms in natural language text, many patterns of surface expressions need to be prepared for one event. We propose an alternative information extraction method based on full parsing with a large-scale, general-purpose grammar. In our system, a parser converts the variety of sentences that describe the same event into a canonical structure (argument structure) regarding the verb representing the event and its arguments such as (semantic) subject and object. Information extraction itself is done using pattern matching on the canonical structure. Si...

Research paper thumbnail of O-line Raising, Dependency Analysis and Partial Unication

A c ompilation method for HPSG is presented. The compiler generates skeletal parts of possible st... more A c ompilation method for HPSG is presented. The compiler generates skeletal parts of possible structures prior to parsing and converts them to Finite State Automata (FSA) augmented with feature structures and denite clause programs. Amount o f unication required for parsing is reduced. This is due to three techniques, namely, 1) o-line raising, 2) dependency analysis on feature structures and 3) partial unication. This paper focuses on these three techniques and consider their consequences on linguistic consideration in HPSG. Finally, w e s how their eectiveness with a series of experiments on Japanese newspapers.

Research paper thumbnail of Compositional Phrase Alignment and Beyond

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

Research paper thumbnail of Transfer Fine-Tuning: A BERT Case Study

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

Research paper thumbnail of Machine translation

Current Issues in Linguistic Theory, 1997

Research paper thumbnail of A Re-Evaluation of Biomedical Named Entity–Term Relations

Journal of Bioinformatics and Computational Biology, 2010

Text mining can support the interpretation of the enormous quantity of textual data produced in b... more Text mining can support the interpretation of the enormous quantity of textual data produced in biomedical field. Recent developments in biomedical text mining include advances in the reliability of the recognition of named entities (NEs) such as specific genes and proteins, as well as movement toward richer representations of the associations of NEs. We argue that this shift in representation should be accompanied by the adoption of a more detailed model of the relations holding between NEs and other relevant domain terms. As a step toward this goal, we study NE–term relations with the aim of defining a detailed, broadly applicable set of relation types based on accepted domain standard concepts for use in corpus annotation and domain information extraction approaches.

Research paper thumbnail of Improving the Inter-Corpora Compatibility for Protein Annotations

Journal of Bioinformatics and Computational Biology, 2010

Although there are several corpora with protein annotation, incompatibility between the annotatio... more Although there are several corpora with protein annotation, incompatibility between the annotations in different corpora remains a problem that hinders the progress of automatic recognition of protein names in biomedical literature. Here, we report on our efforts to find a solution to the incompatibility issue, and to improve the compatibility between two representative protein-annotated corpora: the GENIA corpus and the GENETAG corpus. In a comparative study, we improve our insight into the two corpora, and a series of experimental results show that most of the incompatibility can be removed.

Research paper thumbnail of Text mining and its potential applications in systems biology

Trends in biotechnology, 2006

With biomedical literature increasing at a rate of several thousand papers per week, it is imposs... more With biomedical literature increasing at a rate of several thousand papers per week, it is impossible to keep abreast of all developments; therefore, automated means to manage the information overload are required. Text mining techniques, which involve the processes of information retrieval, information extraction and data mining, provide a means of solving this. By adding meaning to text, these techniques produce a more structured analysis of textual knowledge than simple word searches, and can provide powerful tools for the production and analysis of systems biology models.

Research paper thumbnail of Patrick Olivier

Research paper thumbnail of Poster: Analysis of gene ranking algorithms with extraction of relevant biomedical concepts from PubMed publications

2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), 2011

Research paper thumbnail of Machine translation from japanese into english

Proceedings of the IEEE, 1986

Research paper thumbnail of Mining metabolites: extracting the yeast metabolome from the literature

Research paper thumbnail of Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2010

Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biologi... more Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biological information extraction. In 2009, the majority of these RE and EE research efforts were centered around the BioCreative II.5 Protein-Protein Interaction (PPI) challenge and the "BioNLP event extraction shared task." Although these challenges took somewhat different approaches, they share the same ultimate goal of extracting bio-knowledge from the literature. This paper compares the two challenge task definitions, and presents a unified system that was successfully applied in both these and several other PPI extraction task settings. The AkaneRE system has three parts: A core engine for RE, a pool of modules for specific solutions, and a configuration language to adapt the system to different tasks. The core engine is based on machine learning, using either Support Vector Machines or Statistical Classifiers and features extracted from given training data. The specific modules solve tasks like sentence boundary detection, tokenization, stemming, part-of-speech tagging, parsing, named entity recognition, generation of potential relations, generation of machine learning features for each relation, and finally, assignment of confidence scores and ranking of candidate relations. With these components, the AkaneRE system produces state-of-the-art results, and the system is freely available for academic purposes at http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/.

Research paper thumbnail of Proximity-Based Frameworks for Generating Embeddings from Multi-Output Data

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012

Research paper thumbnail of U-Compare: A modular NLP workflow construction and evaluation system

IBM Journal of Research and Development, 2011

Research paper thumbnail of Accomplishments and challenges in literature data mining for biology

Research paper thumbnail of AGRA: analysis of gene ranking algorithms

Research paper thumbnail of Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task}

Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task}, 2009

@Book{BioNLP-ST:2009, editor = {Jun&a... more @Book{BioNLP-ST:2009, editor = {Jun'ichi Tsujii}, title = {Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task}, month = {June}, year = {2009}, address = {Boulder, Colorado}, publisher = {Association for Computational Linguistics}, url = {http://www.aclweb.org/anthology/W09-14} } @InProceedings{kim-EtAl:2009:BioNLP-ST, author = {Kim, Jin-Dong and Ohta, Tomoko and Pyysalo, Sampo and Kano, Yoshinobu and Tsujii, Jun'ichi}, title = {Overview of BioNLP'09 Shared Task on Event Extraction}, booktitle = {Proceedings ...