Tadashi Nomoto - Academia.edu (original) (raw)

Papers by Tadashi Nomoto

Research paper thumbnail of The Fewer Splits are Better: Deconstructing Readability in Sentence Splitting

Research paper thumbnail of Analysis of the Drug Monitoring Information by Using the CYP-Database for Predicting Drug-Drug Interactions

Research paper thumbnail of Does splitting make sentence easier?

Frontiers in artificial intelligence, Sep 17, 2023

Research paper thumbnail of Two-Tier Similarity Model in Story Link Detection

IEICE Technical Report; IEICE Tech. Rep., May 21, 2010

Research paper thumbnail of Extended Multilingual Protest News Detection - Shared Task 1, CASE 2021 and 2022

Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)

We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This ta... more We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This task is a continuation of CASE 2021 that consists of four subtasks that are i) document classification, ii) sentence classification, iii) event sentence coreference identification, and iv) event extraction. The CASE 2022 extension consists of expanding the test data with more data in previously available languages, namely, English, Hindi, Portuguese, and Spanish, and adding new test data in Mandarin, Turkish, and Urdu for Sub-task 1, document classification. The training data from CASE 2021 in English, Portuguese and Spanish were utilized. Therefore, predicting document labels in Hindi, Mandarin, Turkish, and Urdu occurs in a zero-shot setting. The CASE 2022 workshop accepts reports on systems developed for predicting test data of CASE 2021 as well. We observe that the best systems submitted by CASE 2022 participants achieve between 79.71 and 84.06 F1-macro for new languages in a zero-shot setting. The winning approaches are mainly ensembling models and merging data in multiple languages. The best two submissions on CASE 2021 data outperform submissions from last year for Subtask 1 and Subtask 2 in all languages. Only the following scenarios were not outperformed by new submissions on CASE 2021: Subtask 3 Portuguese & Subtask 4 English.

Research paper thumbnail of Extended Multilingual Protest News Detection -- Shared Task 1, CASE 2021 and 2022

arXiv (Cornell University), Nov 21, 2022

We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This ta... more We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This task is a continuation of CASE 2021 that consists of four subtasks that are i) document classification, ii) sentence classification, iii) event sentence coreference identification, and iv) event extraction. The CASE 2022 extension consists of expanding the test data with more data in previously available languages, namely, English, Hindi, Portuguese, and Spanish, and adding new test data in Mandarin, Turkish, and Urdu for Sub-task 1, document classification. The training data from CASE 2021 in English, Portuguese and Spanish were utilized. Therefore, predicting document labels in Hindi, Mandarin, Turkish, and Urdu occurs in a zero-shot setting. The CASE 2022 workshop accepts reports on systems developed for predicting test data of CASE 2021 as well. We observe that the best systems submitted by CASE 2022 participants achieve between 79.71 and 84.06 F1-macro for new languages in a zero-shot setting. The winning approaches are mainly ensembling models and merging data in multiple languages. The best two submissions on CASE 2021 data outperform submissions from last year for Subtask 1 and Subtask 2 in all languages. Only the following scenarios were not outperformed by new submissions on CASE 2021: Subtask 3 Portuguese & Subtask 4 English.

Research paper thumbnail of National Institute of Japanese Literature 10-3 Modori Tachikawa , Japan nomoto

This paper introduces MediaMeter, an application that works to detect and track emergent topics i... more This paper introduces MediaMeter, an application that works to detect and track emergent topics in the US online news media. What makes MediaMeter unique is its reliance on a labeling algorithm which we call WikiLabel, whose primary goal is to identify what news stories are about by looking up Wikipedia. We discuss some of the major news events that were successfully detected and how it compares to prior work.

Research paper thumbnail of The Fewer Splits are Better: Deconstructing Readability in Sentence Splitting

arXiv (Cornell University), Feb 2, 2023

Research paper thumbnail of Two-tier similarity model for story link detection

Proceedings of the 19th ACM international conference on Information and knowledge management

The paper presents a novel approach to story link detection, where the goal is to determine wheth... more The paper presents a novel approach to story link detection, where the goal is to determine whether a pair of news stories are linked, i.e., talk about the same event. The present work marks a departure from the prior work in that we measure similarity at two distinct levels of textual organization, the document and its collection, and combine scores

Research paper thumbnail of Keyword Extraction: A Modern Perspective

SN Computer Science

The goal of keyword extraction is to extract from a text, words, or phrases indicative of what it... more The goal of keyword extraction is to extract from a text, words, or phrases indicative of what it is talking about. In this work, we look at keyword extraction from a number of different perspectives: Statistics, Automatic Term Indexing, Information Retrieval (IR), Natural Language Processing (NLP), and the emerging Neural paradigm. The 1990s have seen some early attempts to tackle the issue primarily based on text statistics [13, 17]. Meanwhile, in IR, efforts were largely led by DARPA’s Topic Detection and Tracking (TDT) project [2]. In this contribution, we discuss how past innovations paved a way for more recent developments, such as LDA, PageRank, and Neural Networks. We walk through the history of keyword extraction over the last 50 years, noting differences and similarities among methods that emerged during the time. We conduct a large meta-analysis of the past literature using datasets from news media, science, and medicine to business and bureaucracy, to draw a general pict...

Research paper thumbnail of Learning to Simplify with Data Hopelessly Out of Alignment

arXiv (Cornell University), Apr 1, 2022

We consider whether it is possible to do text simplification without relying on a 'parallel' corp... more We consider whether it is possible to do text simplification without relying on a 'parallel' corpus, one that is made up of sentence-bysentence alignments of complex and ground truth simple sentences. To this end, we introduce a number of concepts, some new and some not, including what we call Conjoined Twin Networks, Flip-Flop Auto-Encoders (FFA) and Adversarial Networks (GAN). A comparison is made between Jensen-Shannon (JS-GAN) and Wasserstein GAN, to see how they impact performance, with stronger results for the former. An experiment we conducted with a large dataset derived from Wikipedia found the solid superiority of Twin Networks equipped with FFA and JS-GAN, over the current best performing system. Furthermore, we discuss where we stand in a relation to fully supervised methods in the past literature, and highlight with examples qualitative differences that exist among simplified sentences generated by supervision-free systems.

Research paper thumbnail of Machine Learning Approaches to Rhetorical Parsing and Open-Domain Text Summarization

Research paper thumbnail of A comparison of model free versus model intensive approaches to sentence compression

Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing Volume 1 - EMNLP '09, 2009

This work introduces a model free approach to sentence compression, which grew out of ideas from ... more This work introduces a model free approach to sentence compression, which grew out of ideas from Nomoto (2008), and examines how it compares to a state-of-art model intensive approach known as Tree-to-Tree Transducer, or T3 (Cohn and Lapata, 2008). It is found that a model free approach significantly outperforms T3 on the particular data we created from the Internet. We also discuss what might have caused T3's poor performance.

Research paper thumbnail of Effects of grammatical annotation on a topic identification task

Recent Advances in Natural Language Processing, 1997

Research paper thumbnail of Model Free versus Model Intensive Approaches to Sentence Compression

Research paper thumbnail of NEAL: A Neurally Enhanced Approach to Linking Citation and Reference

As a way to tackle Task 1A in CL-SciSumm 2016, we introduce a composite model consisting of TFIDF... more As a way to tackle Task 1A in CL-SciSumm 2016, we introduce a composite model consisting of TFIDF and Neural Network (NN), the latter being a adaptation of the embedding model originally proposed for the Q/A domain [2, 7]. We discuss an experiment using a development data, results thereof, and some remaining issues.

Research paper thumbnail of ModDBS-X M : A Diversity Based Summarizer for DUC 2001

1 Description of the system ModDBS-X is a clustering based single document summarizer. It is an o... more 1 Description of the system ModDBS-X is a clustering based single document summarizer. It is an open-domain extractive summarizer, demanding of the input nothing more than the availability of basic IR statistics such as term and document frequency. Therefore it could be adapted for any language and domain without much effort. The system goes through three major states to generate a summary: data preparation, summarization, and post-summarization. An input text is first examined for its conformity to the XML syntax; some portions of it are extracted for use in summarization, which are passed on to the sentence selection step, which in turn builds diverse topical clusters over the input and chooses representative sentences thereof. The selected sentences are then put through a post-summarization process, where parenthetical expressions are identified and removed.

Research paper thumbnail of Grounding NBA Matchup Summaries

The present paper summarizes an attempt we made to meet a shared task challenge on grounding mach... more The present paper summarizes an attempt we made to meet a shared task challenge on grounding machine-generated summaries of NBA matchups (https://github.com/ehudreiter/accuracySharedTask.git). In the first half, we discuss methods and in the second, we report results, together with a discussion on what feature may have had an effect on the performance.

Research paper thumbnail of Meeting the 2020 Duolingo Challenge on a Shoestring

Proceedings of the Fourth Workshop on Neural Generation and Translation, 2020

What is given below is a brief description of the two systems, called gFCONV and c-VAE, which we ... more What is given below is a brief description of the two systems, called gFCONV and c-VAE, which we built in a response to the 2020 Duolingo Challenge. Both are neural models that aim at disrupting a sentence representation the encoder generates with an eye on increasing the diversity of sentences that emerge out of the process. Importantly, we decided not to turn to external sources for extra ammunition, curious to know how far we can go while confining ourselves to the data released by Duolingo (Mayhew et al., 2020). gFCONV works by taking over a pre-trained sequence model, intercepting the output its encoder produces on its way to the decoder. c-VAE is a conditional variational auto-encoder, seeking the diversity by blurring the representation that the encoder derives. Experiments on a corpus constructed out of the public dataset from Duolingo, containing some 4 million pairs of sentences, found that gFCONV is a consistent winner over c-VAE though both suffered heavily from a low recall.

Research paper thumbnail of Resolving Citation Links With Neural Networks

Frontiers in Research Metrics and Analytics, 2018

This work demonstrates how neural network models (NNs) can be exploited toward resolving citation... more This work demonstrates how neural network models (NNs) can be exploited toward resolving citation links in the scientific literature, which involves locating passages in the source paper the author had intended when citing the paper. We look at two kinds of models: triplet and binary. The triplet network model works by ranking potential candidates, using what is generally known as the triplet loss, while the binary model tackles the issue by turning it into a binary decision problem, i.e., by labeling a candidate as true or false, depending on how likely a target it is. Experiments are conducted using three datasets developed by the CL-SciSumm project from a large repository of scientific papers in the Association for Computational Linguistics (ACL) repository. The results find that NNs are extremely susceptible to how the input is represented: they perform better on inputs expressed in binary format than on those encoded using the TFIDF metric or neural embeddings of specific kinds. Furthermore, in response to a difficulty NNs and baselines faced in predicting the exact location of a target, we introduce the idea of approximately correct targets (ACTs) where the goal is to find a region which likely contains a true target rather than its exact location. We show that with the ACTs, NNs consistently outperform Ranking SVM and TFIDF on the aforementioned datasets.

Research paper thumbnail of The Fewer Splits are Better: Deconstructing Readability in Sentence Splitting

Research paper thumbnail of Analysis of the Drug Monitoring Information by Using the CYP-Database for Predicting Drug-Drug Interactions

Research paper thumbnail of Does splitting make sentence easier?

Frontiers in artificial intelligence, Sep 17, 2023

Research paper thumbnail of Two-Tier Similarity Model in Story Link Detection

IEICE Technical Report; IEICE Tech. Rep., May 21, 2010

Research paper thumbnail of Extended Multilingual Protest News Detection - Shared Task 1, CASE 2021 and 2022

Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)

We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This ta... more We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This task is a continuation of CASE 2021 that consists of four subtasks that are i) document classification, ii) sentence classification, iii) event sentence coreference identification, and iv) event extraction. The CASE 2022 extension consists of expanding the test data with more data in previously available languages, namely, English, Hindi, Portuguese, and Spanish, and adding new test data in Mandarin, Turkish, and Urdu for Sub-task 1, document classification. The training data from CASE 2021 in English, Portuguese and Spanish were utilized. Therefore, predicting document labels in Hindi, Mandarin, Turkish, and Urdu occurs in a zero-shot setting. The CASE 2022 workshop accepts reports on systems developed for predicting test data of CASE 2021 as well. We observe that the best systems submitted by CASE 2022 participants achieve between 79.71 and 84.06 F1-macro for new languages in a zero-shot setting. The winning approaches are mainly ensembling models and merging data in multiple languages. The best two submissions on CASE 2021 data outperform submissions from last year for Subtask 1 and Subtask 2 in all languages. Only the following scenarios were not outperformed by new submissions on CASE 2021: Subtask 3 Portuguese & Subtask 4 English.

Research paper thumbnail of Extended Multilingual Protest News Detection -- Shared Task 1, CASE 2021 and 2022

arXiv (Cornell University), Nov 21, 2022

We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This ta... more We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This task is a continuation of CASE 2021 that consists of four subtasks that are i) document classification, ii) sentence classification, iii) event sentence coreference identification, and iv) event extraction. The CASE 2022 extension consists of expanding the test data with more data in previously available languages, namely, English, Hindi, Portuguese, and Spanish, and adding new test data in Mandarin, Turkish, and Urdu for Sub-task 1, document classification. The training data from CASE 2021 in English, Portuguese and Spanish were utilized. Therefore, predicting document labels in Hindi, Mandarin, Turkish, and Urdu occurs in a zero-shot setting. The CASE 2022 workshop accepts reports on systems developed for predicting test data of CASE 2021 as well. We observe that the best systems submitted by CASE 2022 participants achieve between 79.71 and 84.06 F1-macro for new languages in a zero-shot setting. The winning approaches are mainly ensembling models and merging data in multiple languages. The best two submissions on CASE 2021 data outperform submissions from last year for Subtask 1 and Subtask 2 in all languages. Only the following scenarios were not outperformed by new submissions on CASE 2021: Subtask 3 Portuguese & Subtask 4 English.

Research paper thumbnail of National Institute of Japanese Literature 10-3 Modori Tachikawa , Japan nomoto

This paper introduces MediaMeter, an application that works to detect and track emergent topics i... more This paper introduces MediaMeter, an application that works to detect and track emergent topics in the US online news media. What makes MediaMeter unique is its reliance on a labeling algorithm which we call WikiLabel, whose primary goal is to identify what news stories are about by looking up Wikipedia. We discuss some of the major news events that were successfully detected and how it compares to prior work.

Research paper thumbnail of The Fewer Splits are Better: Deconstructing Readability in Sentence Splitting

arXiv (Cornell University), Feb 2, 2023

Research paper thumbnail of Two-tier similarity model for story link detection

Proceedings of the 19th ACM international conference on Information and knowledge management

The paper presents a novel approach to story link detection, where the goal is to determine wheth... more The paper presents a novel approach to story link detection, where the goal is to determine whether a pair of news stories are linked, i.e., talk about the same event. The present work marks a departure from the prior work in that we measure similarity at two distinct levels of textual organization, the document and its collection, and combine scores

Research paper thumbnail of Keyword Extraction: A Modern Perspective

SN Computer Science

The goal of keyword extraction is to extract from a text, words, or phrases indicative of what it... more The goal of keyword extraction is to extract from a text, words, or phrases indicative of what it is talking about. In this work, we look at keyword extraction from a number of different perspectives: Statistics, Automatic Term Indexing, Information Retrieval (IR), Natural Language Processing (NLP), and the emerging Neural paradigm. The 1990s have seen some early attempts to tackle the issue primarily based on text statistics [13, 17]. Meanwhile, in IR, efforts were largely led by DARPA’s Topic Detection and Tracking (TDT) project [2]. In this contribution, we discuss how past innovations paved a way for more recent developments, such as LDA, PageRank, and Neural Networks. We walk through the history of keyword extraction over the last 50 years, noting differences and similarities among methods that emerged during the time. We conduct a large meta-analysis of the past literature using datasets from news media, science, and medicine to business and bureaucracy, to draw a general pict...

Research paper thumbnail of Learning to Simplify with Data Hopelessly Out of Alignment

arXiv (Cornell University), Apr 1, 2022

We consider whether it is possible to do text simplification without relying on a 'parallel' corp... more We consider whether it is possible to do text simplification without relying on a 'parallel' corpus, one that is made up of sentence-bysentence alignments of complex and ground truth simple sentences. To this end, we introduce a number of concepts, some new and some not, including what we call Conjoined Twin Networks, Flip-Flop Auto-Encoders (FFA) and Adversarial Networks (GAN). A comparison is made between Jensen-Shannon (JS-GAN) and Wasserstein GAN, to see how they impact performance, with stronger results for the former. An experiment we conducted with a large dataset derived from Wikipedia found the solid superiority of Twin Networks equipped with FFA and JS-GAN, over the current best performing system. Furthermore, we discuss where we stand in a relation to fully supervised methods in the past literature, and highlight with examples qualitative differences that exist among simplified sentences generated by supervision-free systems.

Research paper thumbnail of Machine Learning Approaches to Rhetorical Parsing and Open-Domain Text Summarization

Research paper thumbnail of A comparison of model free versus model intensive approaches to sentence compression

Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing Volume 1 - EMNLP '09, 2009

This work introduces a model free approach to sentence compression, which grew out of ideas from ... more This work introduces a model free approach to sentence compression, which grew out of ideas from Nomoto (2008), and examines how it compares to a state-of-art model intensive approach known as Tree-to-Tree Transducer, or T3 (Cohn and Lapata, 2008). It is found that a model free approach significantly outperforms T3 on the particular data we created from the Internet. We also discuss what might have caused T3's poor performance.

Research paper thumbnail of Effects of grammatical annotation on a topic identification task

Recent Advances in Natural Language Processing, 1997

Research paper thumbnail of Model Free versus Model Intensive Approaches to Sentence Compression

Research paper thumbnail of NEAL: A Neurally Enhanced Approach to Linking Citation and Reference

As a way to tackle Task 1A in CL-SciSumm 2016, we introduce a composite model consisting of TFIDF... more As a way to tackle Task 1A in CL-SciSumm 2016, we introduce a composite model consisting of TFIDF and Neural Network (NN), the latter being a adaptation of the embedding model originally proposed for the Q/A domain [2, 7]. We discuss an experiment using a development data, results thereof, and some remaining issues.

Research paper thumbnail of ModDBS-X M : A Diversity Based Summarizer for DUC 2001

1 Description of the system ModDBS-X is a clustering based single document summarizer. It is an o... more 1 Description of the system ModDBS-X is a clustering based single document summarizer. It is an open-domain extractive summarizer, demanding of the input nothing more than the availability of basic IR statistics such as term and document frequency. Therefore it could be adapted for any language and domain without much effort. The system goes through three major states to generate a summary: data preparation, summarization, and post-summarization. An input text is first examined for its conformity to the XML syntax; some portions of it are extracted for use in summarization, which are passed on to the sentence selection step, which in turn builds diverse topical clusters over the input and chooses representative sentences thereof. The selected sentences are then put through a post-summarization process, where parenthetical expressions are identified and removed.

Research paper thumbnail of Grounding NBA Matchup Summaries

The present paper summarizes an attempt we made to meet a shared task challenge on grounding mach... more The present paper summarizes an attempt we made to meet a shared task challenge on grounding machine-generated summaries of NBA matchups (https://github.com/ehudreiter/accuracySharedTask.git). In the first half, we discuss methods and in the second, we report results, together with a discussion on what feature may have had an effect on the performance.

Research paper thumbnail of Meeting the 2020 Duolingo Challenge on a Shoestring

Proceedings of the Fourth Workshop on Neural Generation and Translation, 2020

What is given below is a brief description of the two systems, called gFCONV and c-VAE, which we ... more What is given below is a brief description of the two systems, called gFCONV and c-VAE, which we built in a response to the 2020 Duolingo Challenge. Both are neural models that aim at disrupting a sentence representation the encoder generates with an eye on increasing the diversity of sentences that emerge out of the process. Importantly, we decided not to turn to external sources for extra ammunition, curious to know how far we can go while confining ourselves to the data released by Duolingo (Mayhew et al., 2020). gFCONV works by taking over a pre-trained sequence model, intercepting the output its encoder produces on its way to the decoder. c-VAE is a conditional variational auto-encoder, seeking the diversity by blurring the representation that the encoder derives. Experiments on a corpus constructed out of the public dataset from Duolingo, containing some 4 million pairs of sentences, found that gFCONV is a consistent winner over c-VAE though both suffered heavily from a low recall.

Research paper thumbnail of Resolving Citation Links With Neural Networks

Frontiers in Research Metrics and Analytics, 2018

This work demonstrates how neural network models (NNs) can be exploited toward resolving citation... more This work demonstrates how neural network models (NNs) can be exploited toward resolving citation links in the scientific literature, which involves locating passages in the source paper the author had intended when citing the paper. We look at two kinds of models: triplet and binary. The triplet network model works by ranking potential candidates, using what is generally known as the triplet loss, while the binary model tackles the issue by turning it into a binary decision problem, i.e., by labeling a candidate as true or false, depending on how likely a target it is. Experiments are conducted using three datasets developed by the CL-SciSumm project from a large repository of scientific papers in the Association for Computational Linguistics (ACL) repository. The results find that NNs are extremely susceptible to how the input is represented: they perform better on inputs expressed in binary format than on those encoded using the TFIDF metric or neural embeddings of specific kinds. Furthermore, in response to a difficulty NNs and baselines faced in predicting the exact location of a target, we introduce the idea of approximately correct targets (ACTs) where the goal is to find a region which likely contains a true target rather than its exact location. We show that with the ACTs, NNs consistently outperform Ranking SVM and TFIDF on the aforementioned datasets.