A Study of the Effectiveness of Suffixes for Chinese Word Segmentation (original) (raw)

Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing, 2012

Among statistical approaches to Chinese word segmentation, the word-based n-gram ( generative ) model and the character-based tagging ( discriminative ) model are two dominant approaches in the literature. The former gives excellent performance for the in-vocabulary (IV) words; however, it handles out-of-vocabulary (OOV) words poorly. On the other hand, though the latter is more robust for OOV words, it fails to deliver satisfactory performance for IV words. These two approaches behave differently due to the unit they use (word vs. character) and the model form they adopt (generative vs. discriminative). In general, character-based approaches are more robust than word-based ones, as the vocabulary of characters is a closed set; and discriminative models are more robust than generative ones, since they can flexibly include all kinds of available information, such as future context. This article first proposes a character-based n -gram model to enhance the robustness of the generative...

A Unified Model for Solving the OOV Problem of Chinese Word Segmentation

ACM Transactions on Asian and Low-Resource Language Information Processing

This article proposes a unified, character-based, generative model to incorporate additional resources for solving the out-of-vocabulary (OOV) problem of Chinese word segmentation, within which different types of additional information can be utilized independently in corresponding submodels. This article mainly addresses the following three types of OOV: unseen dictionary words, named entities, and suffix-derived words, none of which are handled well by current approaches. The results show that our approach can effectively improve the performance of the first two types with positive interaction in F-score. Additionally, we also analyze reason that suffix information is not helpful. After integrating the proposed generative model with the corresponding discriminative approach, our evaluation on various corpora---including SIGHAN-2005, CIPS-SIGHAN-2010, and the Chinese Treebank (CTB)---shows that our integrated approach achieves the best performance reported in the literature on all ...

A Study of Chinese Word Segmentation Based on Chinese Characteristics

This paper introduces the research on Chinese word segmentation (CWS). The word segmentation of Chinese expressions is difficult due to the fact that there is no word boundary in Chinese expressions and that there are some kinds of ambiguities that could result in different segmentations. To distinguish itself from the conventional research that usually emphasizes more on the algorithms employed and the workflow designed with less contribution to the discussion of the fundamental problems of CWS, this paper firstly makes effort on the analysis of the characteristics of Chinese and several categories of ambiguities in Chinese to explore potential solutions. The selected conditional random field models are trained with a quasi-Newton algorithm to perform the sequence labeling. To consider as much of the contextual information as possible, an augmented and optimized set of features is developed. The experiments show promising evaluation scores as compared to some related works.

Chinese Word Segmentation by Classification of Characters

Int. J. Comput. Linguistics Chin. Lang. Process., 2004

During the process of Chinese word segmentation, two main problems occur: segmentation ambiguities and unknown word occurrences. This paper describes a method to solve the segmentation problem. First, we use a dictionary-based approach to segment the text. We apply the Maximum Matching algorithm to segment the text forwards (FMM) and backwards (BMM). Based on the difference between FMM and BMM, and the context, we apply a classification method based on Support Vector Machines to re-assign the word boundaries. In so doing, we use the output of a dictionary-based approach, and then apply a machine-learning-based approach to solve the segmentation problem. Experimental results show that our model can achieve an F-measure of 99.0 for overall segmentation, given the condition that there are no unknown words in the text, and an F-measure of 95.1 if unknown words exist.

A Character-Based Joint Model for Chinese Word Segmentation

2010

The character-based tagging approach is a dominant technique for Chinese word segmentation, and both discriminative and generative models can be adopted in that framework. However, generative and discriminative character-based approaches are significantly different and complement each other. A simple joint model combining the character-based generative model and the discriminative one is thus proposed in this paper to take advantage of both approaches. Experiments on the Second SIGHAN Bakeoff show that this joint approach achieves 21% relative error reduction over the discriminative model and 14% over the generative one. In addition, closed tests also show that the proposed joint model outperforms all the existing approaches reported in the literature and achieves the best F-score in four out of five corpora.

A Study of Chinese Word Segmentation Based on the Characteristics of Chinese

This paper introduces the research on Chinese word segmentation (CWS). The word segmentation of Chinese expressions is difficult due to the fact that there is no word boundary in Chinese expressions and that there are some kinds of ambiguities that could result in different segmentations. To distinguish itself from the conventional research that usually emphasizes more on the algorithms employed and the workflow designed with less contribution to the discussion of the fundamental problems of CWS, this paper firstly makes effort on the analysis of the characteristics of Chinese and several categories of ambiguities in Chinese to explore potential solutions. The selected conditional random field models are trained with a quasi-Newton algorithm to perform the sequence labeling. To consider as much of the contextual information as possible, an augmented and optimized set of features is developed. The experiments show promising evaluation scores as compared to some related works. 1

A Realistic and Robust Model for Chinese Word Segmentation

Proceedings of ROCLING …, 2008

A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. , achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. , can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in . The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.

Rethinking Chinese word segmentation

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries (CB's) into either word-boundaries (WB's) and non-word-boundaries. In Chinese, CB's are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB's are WB's.

Enhancing Chinese Word Segmentation with Character Clustering

Lecture Notes in Computer Science, 2013

In semi-supervised learning framework, clustering has been proved a helpful feature to improve system performance in NER and other NLP tasks. However, there hasn't been any work that employs clustering in word segmentation. In this paper, we proposed a new approach to compute clusters of characters and use these results to assist a character based Chinese word segmentation system. Contextual information is considered when we perform character clustering algorithm to address character ambiguity. Experiments show our character clusters result in performance improvement. Also, we compare our clusters features with widely used mutual information (MI). When two features integrated, further improvement is achieved.

Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

Current character-based approaches are not robust for cross domain Chinese word segmentation. In this paper, we alleviate this problem by deriving a novel enhanced character-based generative model with a new abstract aggregate candidate-feature, which indicates if the given candidate prefers the corresponding position-tag of the longest dictionary matching word. Since the distribution of the proposed feature is invariant across domains, our model thus possesses better generalization ability. Open tests on CIPS-SIGHAN-2010 show that the enhanced generative model achieves robust cross-domain performance for various OOV coverage rates and obtains the best performance on three out of four domains. The enhanced generative model is then further integrated with a discriminative model which also utilizes dictionary information. This integrated model is shown to be either superior or comparable to all other models reported in the literature on every domain of this task.