Benjamin Tsou - Academia.edu (original) (raw)

Papers by Benjamin Tsou

There is increasing practical need for bilingual lexicon in wide ranging applications such as CAT... more There is increasing practical need for bilingual lexicon in wide ranging applications such as CAT systems and in cross-lingual information retrieval. The need for more sophisticated and more extensive resources involving bilingual multi-word expressions has increased even more for the bilingual processing of complex texts such as legal documents, technical names, contracts and patents. We report here a preliminary attempt at successful efforts to mine bilingual MWEs from 300+k Chinese-English patents.

6th International Conference on Spoken Language Processing (ICSLP 2000)

Page 1. A COMPLEMENTARY APPROACH TO COMPUTER-AIDED TRANSCRIPTION: SYNERGY OF STATISTICAL-BASED AN... more Page 1. A COMPLEMENTARY APPROACH TO COMPUTER-AIDED TRANSCRIPTION: SYNERGY OF STATISTICAL-BASED AND KNOWLEDGE DISCOVERY PARADIGMS Benjamin K. T,sou and Tom BY Lai Language Information ...

2019 International Conference on Asian Language Processing (IALP)

This paper presents the first data-driven model for selecting carrier sentences with word and con... more This paper presents the first data-driven model for selecting carrier sentences with word and context embeddings. In computer-assisted language learning systems, fill-in-the-blank items help users review or learn new vocabulary. A crucial step in automatic generation of fill-in-the-blank items is the selection of carrier sentences that illustrate the usage and meaning of the target word. Previous approaches for carrier sentence selection have mostly relied on features related to sentence length, vocabulary difficulty and word association strength. We train a statistical classifier on a large-scale, automatically constructed corpus of sample carrier sentences for learning Chinese as a foreign language, and use it to predict the suitability of a candidate carrier sentence for a target word. Human evaluation shows that our approach leads to substantial improvement over a word co-occurrence heuristic, and that context embeddings further enhance selection performance.

Int. J. Comput. Linguistics Chin. Lang. Process., 1997

In Chinese text, discourse connectives constitute a major linguistic device available for a write... more In Chinese text, discourse connectives constitute a major linguistic device available for a writer to explicitly indicate the structure of a discourse. This set of discourse connectives, consisting of a few hundred entries in modern Chinese, is relatively stable and domain independent. In a recently published paper [T'sou 1996], a computational procedure was introduced to generate the abstract of an input text using mainly the discourse connectives appearing in the text. This paper attempts to demonstrate the validity, of this approach to full-text abstraction by means of an evaluation method, which compares human efforts in text abstraction with the performance of an experimental system called ACFAS. Specifically, our concern is about the relationship between the perceived importance of each individual sentence as judged by human beings and the sentences containing discourse connectives within an argumentative discourse.

Following the reversion of sovereignty from Britain to China in 1997, newly introduced legal bili... more Following the reversion of sovereignty from Britain to China in 1997, newly introduced legal bilingualism in Hong Kong has brought on an urgent need to create a Computer-Aided Transcription (CAT) system for Chinese. The production and retention of verbatim records of court proceedings is vital for the retention of the Common Law system. The existing monolingual English CAT has to be adapted in order to produce the legally tenable court proceedings in Cantonese, the predominant Chinese dialect in Hong Kong. There are two major challenges in the design of a Chinese CAT system. First, linguistic differences in phonology and orthography mandate the adoption of a new conversion mechanism of stenograph code for Chinese. The key issue lies in the resolution of ambiguity arising from problematical homonymy in the Chinese language. With the support of a 0.85 million-character corpus, the bigram statistical model has been adopted to compute the most likely Chinese character string for each se...

Chinese Studies, 2019

Like many dynamic systems, language undergoes change over time. For Chinese, the changes have com... more Like many dynamic systems, language undergoes change over time. For Chinese, the changes have come about in different ways, which could qualitatively affect the system underlying the language, and appropriate new classes or entities have to be recognized and given new labels or names. They could involve neologism, or there could be, for example, the emergence of tones in the archaic Chinese language (or more recently in the non-Sinitic Huihui language of Hainan Island), or development toward disyllabicity, or attrition of tones in the Dungan language, which is found in Kyrgyzstan and Kazakhstan and which has its origin in Shaanxi Province, China. The changes could give rise to new subsystems or even new alternate parallel systems (e.g., sub-dialects, pidgins and creoles, new languages, and new scripts). The impetus for such changes can be due to internal dynamics, or may have external origin as a result of contact. Very often the mutual influence of these linguistic traits can also ...

1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227)

ABSTRACT Discourse analysis has become a major focus of research from various disciplines, includ... more ABSTRACT Discourse analysis has become a major focus of research from various disciplines, including computer science, linguistics, and psychology, in recent decades. The increasing recognition of discourse structure in the field of textual information retrieval makes the development of a computational method necessary. The article attempts to describe a quantitative system of discourse analysis based on the study of cohesion. What distinguishes it from previous studies is that attention is not primarily focused on itemizing cohesive features between lexical items but on observing how they combine to organize texts. We present a connectionist tool for selecting the most representative segments from a text on the basis of repeated lexical features. This follows the work on lexical cohesion which is identified to be one of the key factors in contributing to textual continuity. A methodology is developed for the production of readable summary of text which is capable of some degree of automation

Lecture Notes in Computer Science

This paper presents a novel approach to Chinese disyllabic word extraction based on semantic info... more This paper presents a novel approach to Chinese disyllabic word extraction based on semantic information of characters. Two thesauri of Chinese characters, manually-crafted and machine-generated, are conducted. A Chinese wordlist with 63,738 two-character words, together with the character thesauri, are explored to learn semantic constraints between characters in Chinese word-formation, resulting in two types of semantic-tag-based HMM. Experiments show that: (1) both schemes outperform their character-based counterpart; (2) the machine-generated thesaurus outperforms the hand-crafted one to some extent in word extraction, and (3) the proper combination of semantic-tag-based and character-based methods could benefit word extraction.

Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy, 2009

One of the popular techniques of active learning for data annotations is uncertainty sampling, ho... more One of the popular techniques of active learning for data annotations is uncertainty sampling, however, which often presents problems when outliers are selected. To solve this problem, this paper proposes a density-based reranking technique, in which a density measure is adopted to determine whether an unlabeled example is an outlier. The motivation of this study is to prefer not only the most informative example in terms of uncertainty measure, but also the most representative example in terms of density measure. Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets show that our proposed density-based re-ranking technique can improve uncertainty sampling.

Proceedings of the 18th ACM conference on Information and knowledge management, 2009

This paper presents an unsupervised approach to aspect-based opinion polling from raw textual rev... more This paper presents an unsupervised approach to aspect-based opinion polling from raw textual reviews without explicit ratings. The key contribution of this paper is three-fold. First, a multi-aspect bootstrapping algorithm is proposed to learn from unlabeled data aspect-related terms of each aspect to be used for aspect identification. Second, an unsupervised segmentation model is proposed to address the challenge of

Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion, 2009

Aspect-based sentiment summarization systems generally use sentences associated with relevant asp... more Aspect-based sentiment summarization systems generally use sentences associated with relevant aspects extracted from the reviews as the basis for summarization. However, in real reviews, a single sentence often exhibits several aspects for opinions. This paper proposes a two-stage segmentation model to address the challenge of identifying multiple single-aspect and single-polarity units in one sentence, namely aspect-based sentence segmentation. Our model

Proceedings of the 19th international conference on Computational linguistics -, 2002

Lecture Notes in Computer Science, 2005

Page 1. Comparing Entropies Within the Chinese Language Benjamin K. Tsou, Tom BY Lai, and Ka-po C... more Page 1. Comparing Entropies Within the Chinese Language Benjamin K. Tsou, Tom BY Lai, and Ka-po Chow Language Information Sciences Research Centre, City University of Hong Kong, Tat Chee Avenue, Hong Kong {rlbtsou,cttomlai,kapo.chow}@cityu.edu.hk Abstract. ...

Lecture Notes in Computer Science, 2006

This paper reports on a first step toward the construction of a Pan-Chinese lexical resource. We ... more This paper reports on a first step toward the construction of a Pan-Chinese lexical resource. We investigated the plausibility of extending and enhancing an existing Chinese synonym dictionary, the Tongyici Cilin, with lexical items from the financial news domain obtained from a synchronous Chinese corpus, LIVAC. Results showed that 23-40% of the words from various subcorpora are unique to the

International Journal of Computer Processing of Languages, 2000

ABSTRACT Human beings are unique not only because they use language, but also they can provide su... more ABSTRACT Human beings are unique not only because they use language, but also they can provide summaries or condensation of salient information. This article shows that there is an aspect of consistency in human summarisation of texts. We report on an experiment which examined 150 Chinese subjects of similar background, except for social, cultural and of course, geographical variation. The evaluation was based on their performance on an identical task to singleout salient information. They were asked to mark, under similarly controlled conditions, textual segments (sentences/clauses) in given texts, which they considered to be significant. Analysis of results using several non-parametric tests indicates that human subjects of similar educational and cultural background, considered as a group vis-a-vis other groups, show a fairly high degree of consistency. This implies that if a large group is used, human judgement in determining textual saliency is reliable. Our experimental results also indicate that subjects in Mainland China and those in Taiwan behave as two distinct groups. A closer examination of the individual textual segments that are considered important by these two groups indicates that subjects in the two groups adopt different approaches in determining saliency in textual summarisation.

International Journal of Computer Processing of Languages, 2005

WSPC Journals Online,WorldSciNet.

Software: Practice and Experience, 2002

Page 1. SOFTWAREPRACTICE AND EXPERIENCE Softw. Pract. Exper. 2003; 33:4159 (DOI: 10.1002/spe.49... more Page 1. SOFTWAREPRACTICE AND EXPERIENCE Softw. Pract. Exper. 2003; 33:4159 (DOI: 10.1002/spe.494) Bilingual legal document retrieval and management using XML RWP Luk1,∗,, BKY T&#x27;sou2, TBY Lai2, OOY Kwong2, FCY Chik2 and LYL Cheung2 ...

IEEE Transactions on Audio, Speech, and Language Processing, 2010

... Density for Data Annotations JINGBO ZHU* AND HUIZHEN WANG ... language processing application... more ... Density for Data Annotations JINGBO ZHU* AND HUIZHEN WANG ... language processing applications such as word sense disambiguation (Chen et al. 2006; Chan and Ng 2007), text classification (TC) (Lewis and Gale 1994; Zhu et al. ...