Keh-Yih Su - Academia.edu (original) (raw)

Papers by Keh-Yih Su

2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI)

This paper proposes two novel approaches to supplement BERT for making negation-word (NW) entailm... more This paper proposes two novel approaches to supplement BERT for making negation-word (NW) entailment judgment on a Chinese Social Studies QA dataset. Recently, BERT has shown excellent performance (and even outperforms human) on several natural language inference tasks. However, BERT is found to achieve the remarkable results mainly via utilizing the surface features (such as lexical distribution bias) that are unnoticed by humans. In our test, BERT’s performance would degrade significantly if it is tested on a NW involved dataset in which the influence of lexical distribution bias has been removed. Since a single unmatched NW could toggle the overall judgment, we propose a Negation-Word-Aggregative-Pattern (NWAP) for reflecting the NW matching status, and use it to supplement BERT. Our first approach then supplements BERT with a NW toggling module, which decides if BERT’s answer should be toggled according to the NWAP. The second approach concatenates the embedding of above NWAP with BERT’s output vectors, and then feed them into a feedforward (FF) neural network (NN) to make the final judgment. Experiments show our toggling module outperforms BERT by 59% accuracy on those lexical-bias removed NW data-set. With adversary training data added to both BERT and our FF NN, our model still outperforms BERT by 23% accuracy.

In this paper, the major problems of the current machine translation systems are first outlined. ... more In this paper, the major problems of the current machine translation systems are first outlined. A new direction, highlighting the system capability to be customizable and self-learnable, is then proposed for attacking the described problems, which are mainly resulted from the very complicated characteristics of natural languages. The proposed solution adopts an unsupervised two-way training mechanism and a parameterized architecture to acquire the required statistical knowledge, such that the system can be easily adapted to different domains and various preferences of individual users.

An improved statistical model is proposed in this paper for extracting compound words from a text... more An improved statistical model is proposed in this paper for extracting compound words from a text corpus. Traditional terminology extraction methods rely heavily on simple filtering-and-thresholding methods, which are unable to minimize the error counts objectively. Therefore, a method for minimizing the error counts is very desirable. In this paper, an improved statistical model is developed to integrate parts of speech information as well as other frequently used word association metrics to jointly optimize the extraction tasks. The features are modelled with a multivariate Gaussian mixture for handling the inter-feature correlations properly. With a training (resp. testing) corpus of 20715 (resp. 2301) sentences, the weighted precision & recall (WPR) can achieve about 84% for bigram compounds, and 86% for trigram compounds. The F-measure performances are about 82% for bigrams and 84% for trigrams. 1. Compound Word Extraction Problems 1.1 Motivation Compound words are very common ...

Since statistical machine translation (SMT) and translation memory (TM) complement each other in ... more Since statistical machine translation (SMT) and translation memory (TM) complement each other in matched and unmatched regions, integrated models are proposed in this paper to incorporate TM information into phrase-based SMT. Unlike previous multi-stage pipeline approaches, which directly merge TM result into the final output, the proposed models refer to the corresponding TM information associated with each phrase at SMT decoding. On a Chinese–English TM database, our experiments show that the proposed integrated Model-III is significantly better than either the SMT or the TM systems when the fuzzy match score is above 0.4. Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. Besides, the proposed models also outperform previous approaches significantly.

In a natural language processing system, a large amount of ambiguity and a large branching factor... more In a natural language processing system, a large amount of ambiguity and a large branching factor are hindering factors in obtaining the desired analysis for a given sentence in a short time. In this paper, we are proposing a sequential truncation parsing algorithm to reduce the searching space and thus lowering the parsing time. The algorithm is based on a score function which takes the advantages of probabilistic characteristics of syntactic information in the sentences. A preliminary test on this algorithm was conducted with a special version of our machine translation system, the ARCHTRAN, and an encouraging result was observed.

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

With the recent advancements in deep learning, neural solvers have gained promising results in so... more With the recent advancements in deep learning, neural solvers have gained promising results in solving math word problems. However, these SOTA solvers only generate binary expression trees that contain basic arithmetic operators and do not explicitly use the math formulas. As a result, the expression trees they produce are lengthy and uninterpretable because they need to use multiple operators and constants to represent one single formula. In this paper, we propose sequence-to-general tree (S2G) that learns to generate interpretable and executable operation trees where the nodes can be formulas with an arbitrary number of arguments. With nodes now allowed to be formulas, S2G can learn to incorporate mathematical domain knowledge into problem-solving, making the results more interpretable. Experiments show that S2G can achieve a better performance against strong baselines on problems that require domain knowledge. 1

ArXiv, 2021

We present a novel approach to answer the Chinese elementary school Social Study Multiple Choice ... more We present a novel approach to answer the Chinese elementary school Social Study Multiple Choice questions. Although BERT has demonstrated excellent performance on Reading Comprehension tasks, it is found not good at handling some specific types of questions, such as Negation, All-of-the-above, and None-of-the-above. We thus propose a novel framework to cascade BERT with a Pre-Processor and an Answer-Selector modules to tackle the above challenges. Experimental results show the proposed approach effectively improves the performance of BERT, and thus demonstrate the feasibility of supplementing BERT with additional modules.

This study presents a novel QA-based sequence labeling (QASL) approach to naturally tackle both f... more This study presents a novel QA-based sequence labeling (QASL) approach to naturally tackle both flat and nested Named Entity Recogntion (NER) tasks on a Chinese Electronic Health Records (CEHRs) dataset. This proposed QASL approach parallelly asks a corresponding natural language question for each specific named entity type, and then identifies those associated NEs of the same specified type with the BIO tagging scheme. The associated nested NEs are then formed by overlapping the results of various types. In comparison with those pure sequence-labeling (SL) approaches, since the given question includes significant prior knowledge about the specified entity type and the capability of extracting NEs with different types, the performance for nested NER task is thus improved, obtaining 90.70% of F1-score. Besides, in comparison with the pure QA-based approach, our proposed approach retains the SL features, which could extract multiple NEs with the same types without knowing the exact nu...

In this paper, an unsupervised approach for constructing a large-scale Chinese electronic diction... more In this paper, an unsupervised approach for constructing a large-scale Chinese electronic dictionary is surveyed. The main purpose is to enable cheap and quick acquisition of a large-scale dictionary from a large untagged text corpus with the aid of the information in a small tagged seed corpus. The basic model is based on a Viterbi reestimation technique. During the dictionary construction process, it tries to optimize the automatic segmentation and tagging process by repeatedly refining the set of parameters of the underlying language model. The refined parameters are then used to furtherget a better tagging result. In addition, a two-class classifier, which is capable of classifying an n-gram either as a word or a non-word, is used in combination with the Viterbi training module to improve the system performance. Two different system configurations had been developed to construct the dictionary. The configurations include (1) a Viterbi word identification module followed by a Vit...

IEEE Transactions on Medical Imaging, 1992

Phosphorus-containing compounds of the formula WHEREIN R is phenyl or alkyl of 1 to 4 carbon atom... more Phosphorus-containing compounds of the formula WHEREIN R is phenyl or alkyl of 1 to 4 carbon atoms, m and p are each integers from 2 to 6, n and q are each integers from 1 to 10, y is an integer of at least 2, and flame retardant polymers containing such phosphorus-containing compounds.

The character-based tagging approach is a dominant technique for Chinese word segmentation, and b... more The character-based tagging approach is a dominant technique for Chinese word segmentation, and both discriminative and generative models can be adopted in that framework. However, generative and discriminative character-based approaches are significantly different and complement each other. A simple joint model combining the character-based generative model and the discriminative one is thus proposed in this paper to take advantage of both approaches. Experiments on the Second SIGHAN Bakeoff show that this joint approach achieves 21% relative error reduction over the discriminative model and 14% over the generative one. In addition, closed tests also show that the proposed joint model outperforms all the existing approaches reported in the literature and achieves the best F-score in four out of five corpora.

Current neural math solvers learn to incorporate commonsense or domain knowledge by utilizing pre... more Current neural math solvers learn to incorporate commonsense or domain knowledge by utilizing pre-specified constants or formulas. However, as these constants and formulas are mainly human-specified, the generalizability of the solvers is limited. In this paper, we propose to explicitly retrieve the required knowledge from math problemdatasets. In this way, we can determinedly characterize the required knowledge andimprove the explainability of solvers. Our two algorithms take the problem text andthe solution equations as input. Then, they try to deduce the required commonsense and domain knowledge by integrating information from both parts. We construct two math datasets and show the effectiveness of our algorithms that they can retrieve the required knowledge for problem-solving.

We investigate whether suffix related features can significantly improve the performance of chara... more We investigate whether suffix related features can significantly improve the performance of character-based approaches for Chinese word segmentation (CWS). Since suffixes are quite productive in forming new words, and OOV is the main error source for CWS, many researchers expect that suffix information can further improve the performance. With this belief, we tried several suffix related features in both generative and discriminative approaches. However, our experiment results have shown that significant improvement can hardly be achieved by incorporating suffix related features into those widely adopted surface features, which is against the commonly believed supposition. Error analysis reveals that the main problem behind this surprising finding is the conflict between the degree of reliability and the coverage rate of suffix related features.

Comput. Linguistics, 1995

Statistical approaches to natural language processing generally obtain the parameters by using th... more Statistical approaches to natural language processing generally obtain the parameters by using the maximum likelihood estimation (MLE) method. The MLE approaches, however, may fail to achieve good performance in difficult tasks, because the discrimination and robustness issues are not taken into consideration in the estimation processes. Motivated by that concern, a discrimination-and robustness-oriented learning algorithm is proposed in this paper for minimizing the error rate. In evaluating the robust learning procedure on a corpus of 1,000 sentences, 64.3% of the sentences are assigned their correct syntactic structures, while only 53.1% accuracy rate is obtained with the MLE approach.In addition, parameters are usually estimated poorly when the training data is sparse. Smoothing the parameters is thus important in the estimation process. Accordingly, we use a hybrid approach combining the robust learning procedure with the smoothing method. The accuracy rate of 69.8% is attained...

Goal: select the best set of d features which optimizes a criterion function from a large set of ... more Goal: select the best set of d features which optimizes a criterion function from a large set of features. • select the most discriminative features for processing • reduce the dimension of the feature space and the size of the parameter space. • reduce redundant information without degrading system performance • eliminate irrelevant or noisy features to reduce their effects on performance Procedures: • Initially the feature set contains no feature. • Add one feature to the current feature set to form an enlarged feature set. — the one being selected is the one that maximizes some criterion function (e.g., accuracy rate) when used jointly with the current feature set. • Repeat until the feature set contains d features.

2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI)

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

ArXiv, 2021

IEEE Transactions on Medical Imaging, 1992

Comput. Linguistics, 1995