Keh-jiann Chen - Profile on Academia.edu (original) (raw)
Papers by Keh-jiann Chen
International Conference on Acoustics, Speech, and Signal Processing
In this paper, an efficient algorithm is developed to handle the difficulties in parsing noisy wo... more In this paper, an efficient algorithm is developed to handle the difficulties in parsing noisy word lattices (sets of word hypotheses obtained in continuous speech recognition) which include problems such as word boundary overlapping, homonyms, lexical ambiguities, recognition uncertainty and errors, etc. An augmented chart is first proposed, and the new algorithm is then derived on this chart. This algorithm properly integrates the global structural synthesis capabilities of the unification grammar and the local relation estimation capabilities of the Markov language model. The parsing algorithm is island-driven and best-first. In this way, not only the features of the grammatical and statistical approaches can be combined, but the effects of the two different approaches are reflected in a single algorithm such that the overall selectivity can be appropriately
A Mathematical Model for Chinese Input
Temporal relation, includes duration, aspect, frequency, time point and sequence etc., describes ... more Temporal relation, includes duration, aspect, frequency, time point and sequence etc., describes the relationship between time elements and events in complex knowledge networks. Logical compatibility between temporal elements and event types strongly influence semantic interpretation and grammaticality of sentences. It is one of the most complicated, frequently used, and not well understood topics in linguistics. In this paper, we focus our attention on duration only. We made fine-grain distinctions for time intervals and provided explanatory reasons for their common functionalities and idiosyncrasies. We pointed out that types of collocated events and semantic of time intervals are main factors which control the usage of time interval words. Furthermore, we also proved that morpho-syntactic structure of time interval words also reduces the flexibility of their usages. We had listed four different types of morpho-syntactic structures for duration expressions and provided the constraints of their usages.
Natural languages are means to denote concepts. However word sense ambiguities make natural langu... more Natural languages are means to denote concepts. However word sense ambiguities make natural language processing and conceptual processing almost impossible. To bridge the gaps between natural language representations and conceptual representations, we propose a universal concept representational mechanism, called Extended-HowNet, which was evolved from HowNet. It extends the word sense definition mechanism of HowNet and uses WordNet synsets as vocabulary to describe concepts. Each word sense (or concept) is defined by some simpler concepts. The simple concepts used in the definitions can be further decomposed into even simpler concepts, until primitive or basic concepts are reached. Therefore the definition of a concept can be dynamically decomposed and unified into Extended-HowNet at different levels of representations. Extended-HowNet are language independent. Any word sense of any language can be defined and achieved near-canonical representation. For any two concepts, not only their semantic distances but also their sense similarity and difference are known by checking their definitions. In addition to taxonomy links, concepts are also associated by their shared conceptual features. Fine-grain differences among near-synonyms can be differentiated by adding new features.
A model for Lexical Analysis and Parsing of Chinese Sentences
In order to train machines to 'understand' natural language, we propose a meaning representation ... more In order to train machines to 'understand' natural language, we propose a meaning representation mechanism called E-HowNet to encode lexical senses. In this paper, we take interrogatives as examples to demonstrate the mechanisms of semantic representation and composition of interrogative constructions under the framework of E-HowNet. We classify the interrogative words into five classes according to their query types, and represent each type of interrogatives with fine-grained features and operators. The process of semantic composition and the difficulties of representation, such as word sense disambiguation, are addressed. Finally, machine understanding is tested by showing how machines derive the same deep semantic structure for synonymous sentences with different surface structures.
This paper describes a universal concept representational mechanism called E-HowNet, to handle di... more This paper describes a universal concept representational mechanism called E-HowNet, to handle difficulties caused by unknown words in natural language processing. Semantic structures and sense disambiguation of unknown words are discovered by analogy. We intend to achieve that any concept can be defined by E-HowNet and the representation is near-canonical. The design for easy semantic composition and decomposition makes the automation of semantic processing for unknown words, phrases and even sentences possible.
The Identification of Thematic Roles in Parsing Chinese
Journal of Information Science and Engineering
There is a need to measure word similarity when processing natural languages, especially when usi... more There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word , and all the context features are adjusted according to their IDF (inverse document frequency) values. The agglomerative clustering algorithm is applied to group similar words according to their similarity values. It turns out that words with similar syntactic categories and semantic classes are grouped together.
A syllable-based very-large-vocabulary voice retrieval system for Chinese databases with textual attributes
Unconstrained speech retrieval for Chinese document databases with very large vocabulary and unlimited domains
Chinese language model adaptation based on document classification and multiple domain-specific language models
Mandarin Chinese Character Frequency List Based on National Phonetic Alphabets
ABSTRACT
《资讯处理用中文分词规范》设计理念及规范内容 Design Criteria and Content of ‘Segmentation Standard for Chinese Information Processing
Intelligent retrieval of very large Chinese dictionaries with speech queries
Mandarin Chinese Word Frequency Dictionary
ABSTRACT
m®èÌ«÷± 6LQLFD 7UHHEDQN ìÌ1ÌÞ>1"<m®Þ¥ầ ûÒ+! ¶3`±1Ò0^zñÐU!m®èÌ«÷± m¼75^Hö751¼7VõÞz13ëÓ-äh_ Yó!®/m®... more m®èÌ«÷± 6LQLFD 7UHHEDQN ìÌ1ÌÞ>1"<m®Þ¥ầ ûÒ+! ¶3`±1Ò0^zñÐU!m®èÌ«÷± m¼75^Hö751¼7VõÞz13ëÓ-äh_ Yó!®/m®èÌ«÷±Ìì±I¡7vÍ 1m^Ò Ã ±°6LQLFD &RUSXV ±¼7 W Ð a! Ñ ï °, QIRUPDWLRQ EDVHG &DVH *UDPPDU ,&* ±1d5¤a!.Ì7=°Þ ï3ë«èÌ«ñÐB*èÌ31+À¤Èõ ãÐ3`%% Àí^Ð*3`1%^¤`Íô 1èÌn¤ Ê3`^zH< à<û1: +õ m®èÌ«÷± 6LQLFD 7UHHEDQN m^Òà ["7 ¶3vÍ m^Òñ°6LQLFD &RUSXV ±m¼7W7=°3ë«èÌ« ã Ð3`%%_Àíõ1¬S1«à`m®èÌ«÷±ìÌ1ÌÞ>1"<m®Þ¥ â<ûÒ+! ¶3`±1Ò0^zñÐU!m®èÌ«÷±m¼ 75^Hö751¼78õÞz13ëÓ-äD_Yó! ®1>1"/m®èÌ«÷±Ìì±I¡=1 ¶¯÷ Ê2^3ë m®Wz<àÐ=!Ñï°,QIRUPDWLRQ EDVHG &DVH *UDPPDU ,&* ±@ 5¤4!+}Ià@±¤¢+!m®èÌ« ¶1èÌ ë^kéBà¢! pt1àÂz #2/a!1Ñï°, &* ±d5¤
Academia Sinica balanced corpus (Version 3)
International Conference on Acoustics, Speech, and Signal Processing
In this paper, an efficient algorithm is developed to handle the difficulties in parsing noisy wo... more In this paper, an efficient algorithm is developed to handle the difficulties in parsing noisy word lattices (sets of word hypotheses obtained in continuous speech recognition) which include problems such as word boundary overlapping, homonyms, lexical ambiguities, recognition uncertainty and errors, etc. An augmented chart is first proposed, and the new algorithm is then derived on this chart. This algorithm properly integrates the global structural synthesis capabilities of the unification grammar and the local relation estimation capabilities of the Markov language model. The parsing algorithm is island-driven and best-first. In this way, not only the features of the grammatical and statistical approaches can be combined, but the effects of the two different approaches are reflected in a single algorithm such that the overall selectivity can be appropriately
A Mathematical Model for Chinese Input
Temporal relation, includes duration, aspect, frequency, time point and sequence etc., describes ... more Temporal relation, includes duration, aspect, frequency, time point and sequence etc., describes the relationship between time elements and events in complex knowledge networks. Logical compatibility between temporal elements and event types strongly influence semantic interpretation and grammaticality of sentences. It is one of the most complicated, frequently used, and not well understood topics in linguistics. In this paper, we focus our attention on duration only. We made fine-grain distinctions for time intervals and provided explanatory reasons for their common functionalities and idiosyncrasies. We pointed out that types of collocated events and semantic of time intervals are main factors which control the usage of time interval words. Furthermore, we also proved that morpho-syntactic structure of time interval words also reduces the flexibility of their usages. We had listed four different types of morpho-syntactic structures for duration expressions and provided the constraints of their usages.
Natural languages are means to denote concepts. However word sense ambiguities make natural langu... more Natural languages are means to denote concepts. However word sense ambiguities make natural language processing and conceptual processing almost impossible. To bridge the gaps between natural language representations and conceptual representations, we propose a universal concept representational mechanism, called Extended-HowNet, which was evolved from HowNet. It extends the word sense definition mechanism of HowNet and uses WordNet synsets as vocabulary to describe concepts. Each word sense (or concept) is defined by some simpler concepts. The simple concepts used in the definitions can be further decomposed into even simpler concepts, until primitive or basic concepts are reached. Therefore the definition of a concept can be dynamically decomposed and unified into Extended-HowNet at different levels of representations. Extended-HowNet are language independent. Any word sense of any language can be defined and achieved near-canonical representation. For any two concepts, not only their semantic distances but also their sense similarity and difference are known by checking their definitions. In addition to taxonomy links, concepts are also associated by their shared conceptual features. Fine-grain differences among near-synonyms can be differentiated by adding new features.
A model for Lexical Analysis and Parsing of Chinese Sentences
In order to train machines to 'understand' natural language, we propose a meaning representation ... more In order to train machines to 'understand' natural language, we propose a meaning representation mechanism called E-HowNet to encode lexical senses. In this paper, we take interrogatives as examples to demonstrate the mechanisms of semantic representation and composition of interrogative constructions under the framework of E-HowNet. We classify the interrogative words into five classes according to their query types, and represent each type of interrogatives with fine-grained features and operators. The process of semantic composition and the difficulties of representation, such as word sense disambiguation, are addressed. Finally, machine understanding is tested by showing how machines derive the same deep semantic structure for synonymous sentences with different surface structures.
This paper describes a universal concept representational mechanism called E-HowNet, to handle di... more This paper describes a universal concept representational mechanism called E-HowNet, to handle difficulties caused by unknown words in natural language processing. Semantic structures and sense disambiguation of unknown words are discovered by analogy. We intend to achieve that any concept can be defined by E-HowNet and the representation is near-canonical. The design for easy semantic composition and decomposition makes the automation of semantic processing for unknown words, phrases and even sentences possible.
The Identification of Thematic Roles in Parsing Chinese
Journal of Information Science and Engineering
There is a need to measure word similarity when processing natural languages, especially when usi... more There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word , and all the context features are adjusted according to their IDF (inverse document frequency) values. The agglomerative clustering algorithm is applied to group similar words according to their similarity values. It turns out that words with similar syntactic categories and semantic classes are grouped together.
A syllable-based very-large-vocabulary voice retrieval system for Chinese databases with textual attributes
Unconstrained speech retrieval for Chinese document databases with very large vocabulary and unlimited domains
Chinese language model adaptation based on document classification and multiple domain-specific language models
Mandarin Chinese Character Frequency List Based on National Phonetic Alphabets
ABSTRACT
《资讯处理用中文分词规范》设计理念及规范内容 Design Criteria and Content of ‘Segmentation Standard for Chinese Information Processing
Intelligent retrieval of very large Chinese dictionaries with speech queries
Mandarin Chinese Word Frequency Dictionary
ABSTRACT
m®èÌ«÷± 6LQLFD 7UHHEDQN ìÌ1ÌÞ>1"<m®Þ¥ầ ûÒ+! ¶3`±1Ò0^zñÐU!m®èÌ«÷± m¼75^Hö751¼7VõÞz13ëÓ-äh_ Yó!®/m®... more m®èÌ«÷± 6LQLFD 7UHHEDQN ìÌ1ÌÞ>1"<m®Þ¥ầ ûÒ+! ¶3`±1Ò0^zñÐU!m®èÌ«÷± m¼75^Hö751¼7VõÞz13ëÓ-äh_ Yó!®/m®èÌ«÷±Ìì±I¡7vÍ 1m^Ò Ã ±°6LQLFD &RUSXV ±¼7 W Ð a! Ñ ï °, QIRUPDWLRQ EDVHG &DVH *UDPPDU ,&* ±1d5¤a!.Ì7=°Þ ï3ë«èÌ«ñÐB*èÌ31+À¤Èõ ãÐ3`%% Àí^Ð*3`1%^¤`Íô 1èÌn¤ Ê3`^zH< à<û1: +õ m®èÌ«÷± 6LQLFD 7UHHEDQN m^Òà ["7 ¶3vÍ m^Òñ°6LQLFD &RUSXV ±m¼7W7=°3ë«èÌ« ã Ð3`%%_Àíõ1¬S1«à`m®èÌ«÷±ìÌ1ÌÞ>1"<m®Þ¥ â<ûÒ+! ¶3`±1Ò0^zñÐU!m®èÌ«÷±m¼ 75^Hö751¼78õÞz13ëÓ-äD_Yó! ®1>1"/m®èÌ«÷±Ìì±I¡=1 ¶¯÷ Ê2^3ë m®Wz<àÐ=!Ñï°,QIRUPDWLRQ EDVHG &DVH *UDPPDU ,&* ±@ 5¤4!+}Ià@±¤¢+!m®èÌ« ¶1èÌ ë^kéBà¢! pt1àÂz #2/a!1Ñï°, &* ±d5¤
Academia Sinica balanced corpus (Version 3)