Amba Kulkarni | University of Hyderabad (original) (raw)
Sanskrit Computational Linguistics by Amba Kulkarni
These are the presentation slides of my talk at NIAS, Bangalore
Sanskrit being inflectionally rich, the conventional wisdom about Sanskrit word order is that it ... more Sanskrit being inflectionally rich, the conventional
wisdom about Sanskrit word order is that it is free. The
concept of sannidhi (proximity), one of the necessary fac-
tors in the process of verbal cognition, provides a con-
straint on the word order of Sanskrit. We study the free
word order of Sanskrit in the light of the dependency
framework. The weak non-projectivity condition on de-
pendency graphs captures the sannidhi constraint. Gillon
worked within the framework of phrase-structure syntax
and noted that the freeness is constrained by clause bound-
aries. In an examination of the cases of dislocation ob-
served by Gillon and all verses of the Bhagavadg ̄ıt ̄a , we
notice that two relations, viz. adjectival and genitive, are
more frequently involved in sannidhi violation. We con-
clude that the relations involved in sannidhi violation cor-
respond to utthaapya-aakaa.nk.saa(expectancy which is to be
raised) barring a few exceptional cases
Mahābhās . ya is an important commentary on Pān . ini's grammar for Sanskrit and is highly struct... more Mahābhās . ya is an important commentary on Pān . ini's grammar for Sanskrit and is highly structured. The traditional scholars have tagged it manually showing its underlying discourse structure. The traditional grammar also discusses clues for discourse level annotations. Taking into account these clues we have developed an automatic tagger for tagging the Mahābhās . ya. This tagger is described in this paper, along with its performance evaluation. We have also extended this tag-set to on another important textŚābarabhās . ya.
Pān . ini's As . t .ā dhyāyī is often compared to a computer program for its rigour and coverage ... more Pān . ini's As . t .ā dhyāyī is often compared to a computer program for its rigour and coverage of the then prevalent Sanskrit language. The emergence of computer science has given a new dimension to the Pān . inian studies as is evident from the recent efforts by Mishra [?], Hyman [?] and Scharf [?]. Ours is an attempt to discover programming concepts, techniques and paradigms employed by Pān . ini. We discuss how the three sūtras: pūrvatrāsiddham 8.2.1, asiddhavad atrābhāt 6.4.22, and s . atvatukor asiddhah . 6.1.86 play a major role in the ordering of the sūtras and provide a model which can be best described with privacy of data spaces. For conflict resolution, we use two criteria: utsarga-apavāda relation between sūtras, and the word integrity principle. However, this needs further revision. The implementation is still in progress. The current implementation of inflectional morphology to derive a speech form is discussed in detail.
The knowledge of how a language codes information, how much information it codes and where it cod... more The knowledge of how a language codes information, how much information it codes and where it codes the information is very crucial for a computational linguist working in the area of Natural Language Processing and in particular Machine Translation.
Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and... more Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and a thesaurus in the form of Amarakośa.
The Sanskrit kośas such as Amarakośa, Vaijayantikośa etc. have a built in knowledge structure of ... more The Sanskrit kośas such as Amarakośa, Vaijayantikośa etc. have a built in knowledge structure of its own which apart from revealing the ontological classication, provides a holistic view of various concepts. Knowledge in these kośas concerns with many non-observational, culture specic facts. In this paper we present a few representative examples of the concept clusters from the two Sanskrit kośas; Amarakośa and Vaijayantkośa. There is a necessity to make these valuable resources available in suitable e-form so that the NLP community working in Indian Languages can be benitted. Adidevādhyāyah . (supreme diety) Lokapālādhyāyah . (guardian deities) Yaks .ā dhyāyah . (semi-divine beings) • Antariks . akakān . d . ah . (sky) Jyotiradhyāyah . (light) Meghādhyāyah . (cloud) Khagādhyāyah . (bird) Sabdādhyāyah . (sound) • Būmikān . d . ah . (earth) Deśādhyāyah . (place) Sailādhyāyah . (hill) Vanādhyāyah . (forest) Paśusa ngrahādhyāyah . (animals) Manus . yādhyāyah . (mankind) Brāhman .ā dhyāyah . (priest tribe) Ks . atriyādhyāyah . (military tribe) Vaiśyādhyāyah . (bussiness tribe) Sūdrādhyāyah . (mixed class)
Sanskrit Computational Linguistics, Jan 1, 2010
Amarakośa is the most celebrated and authoritative ancient thesaurus of Sanskrit. It is one of th... more Amarakośa is the most celebrated and authoritative ancient thesaurus of Sanskrit. It is one of the books which an Indian child learning through Indian traditional educational system memorizes as early as his first year of formal learning. Though it appears as a linear list of words, close inspection of it shows a rich organisation of words expressing various relations a word bears with other words. Thus when a child studies Amarakośa further, the linear list of words unfolds into a knowledge web. In this paper we describe our effort to make the implicit knowledge in Amarakośa explicit. A model for storing such structure is discussed and a web tool is described that answers the queries by reconstructing the links among words from the structured tables dynamically.
… of National Seminar …, Jan 1, 2009
Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and... more Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and a thesaurus in the form of Amarakośa.
Abstract. In this paper we note the importance of positing a canonical form for verbal root and i... more Abstract. In this paper we note the importance of positing a canonical form for verbal root and its meaning to facilitate the comparison of various Dhatuvr. ttis. We also provide some quantitative measure of the differences in the Dhatuvr. ttis after correlating four Dhatuvr. ttis using canonical forms of roots and meanings. Keywords: Pan. inıya Dhatupat. ha, canonical form, quantitative analysis.
Bh. K Festschrift volume by LSI, Jan 1, 2009
For an inflectionally rich language like Sanskrit, any NLP application demands a good morphologic... more For an inflectionally rich language like Sanskrit, any NLP application demands a good morphological analyzer. Though Sanskrit is the best-analyzed language in the world, a good coverage morphological analyzer for it is still not available. This paper points out the complexity involved in building a wide coverage analyzer for Sanskrit and then describes a morphological analyzer that has been built using the available eresources, based on ad-hoc principles. The coverage of this analyzer is around 95%. Though for practical applications, this is not an acceptable figure, it can however be used as a stepping-stone to develop other modules such as sandhi splitter, search engine, etc. At a later stage, it may be replaced by a module that is based on the classic aÀt¡dhy¡y¢.
Abstract. As. tŻadhyŻayŻı has a section of rules which provide conditions for compound formation.... more Abstract. As. tŻadhyŻayŻı has a section of rules which provide conditions for compound formation. These rules are presented from generation point of view. We study these conditions from the point of view of compound type identification. A rule based classifier based on these rules is developed whose performance on some of the compound types is encouraging. These conditions also suggest the type of information lexical databases should contain for automatic language analysis, including a compound classifier.
Sanskrit is very rich in compound formation unlike modern Indian Languages. The compound formatio... more Sanskrit is very rich in compound formation unlike modern Indian Languages. The compound formation being productive it forms an open-set and as such it is also not possible to list all the compounds in a dictionary. The compound formation involves a mandatory sandhi. But mere sandhi splitting does not help a reader in identifying the meaning of a compound, since typically a compound does not code the relation between its components explicitly. To understand the meaning of a compound, it is necessary to identify its components and discover the relation between them. An expression providing the meaning of a compound is called a paraphrase.
Sanskrit Computational Linguistics, Jan 1, 2010
Sanskrit is very rich in compound formation. Typically a compound does not code the relation betw... more Sanskrit is very rich in compound formation. Typically a compound does not code the relation between its components explicitly. To understand the meaning of a compound, it is necessary to identify its components, identify the way the components group together, discover the relations between them and finally generate a paraphrase of the compound. In this paper, we discuss our efforts in building a constituency parser for Sanskrit compounds. The average performance of this parser is 85%.
These are the presentation slides of my talk at NIAS, Bangalore
Sanskrit being inflectionally rich, the conventional wisdom about Sanskrit word order is that it ... more Sanskrit being inflectionally rich, the conventional
wisdom about Sanskrit word order is that it is free. The
concept of sannidhi (proximity), one of the necessary fac-
tors in the process of verbal cognition, provides a con-
straint on the word order of Sanskrit. We study the free
word order of Sanskrit in the light of the dependency
framework. The weak non-projectivity condition on de-
pendency graphs captures the sannidhi constraint. Gillon
worked within the framework of phrase-structure syntax
and noted that the freeness is constrained by clause bound-
aries. In an examination of the cases of dislocation ob-
served by Gillon and all verses of the Bhagavadg ̄ıt ̄a , we
notice that two relations, viz. adjectival and genitive, are
more frequently involved in sannidhi violation. We con-
clude that the relations involved in sannidhi violation cor-
respond to utthaapya-aakaa.nk.saa(expectancy which is to be
raised) barring a few exceptional cases
Mahābhās . ya is an important commentary on Pān . ini's grammar for Sanskrit and is highly struct... more Mahābhās . ya is an important commentary on Pān . ini's grammar for Sanskrit and is highly structured. The traditional scholars have tagged it manually showing its underlying discourse structure. The traditional grammar also discusses clues for discourse level annotations. Taking into account these clues we have developed an automatic tagger for tagging the Mahābhās . ya. This tagger is described in this paper, along with its performance evaluation. We have also extended this tag-set to on another important textŚābarabhās . ya.
Pān . ini's As . t .ā dhyāyī is often compared to a computer program for its rigour and coverage ... more Pān . ini's As . t .ā dhyāyī is often compared to a computer program for its rigour and coverage of the then prevalent Sanskrit language. The emergence of computer science has given a new dimension to the Pān . inian studies as is evident from the recent efforts by Mishra [?], Hyman [?] and Scharf [?]. Ours is an attempt to discover programming concepts, techniques and paradigms employed by Pān . ini. We discuss how the three sūtras: pūrvatrāsiddham 8.2.1, asiddhavad atrābhāt 6.4.22, and s . atvatukor asiddhah . 6.1.86 play a major role in the ordering of the sūtras and provide a model which can be best described with privacy of data spaces. For conflict resolution, we use two criteria: utsarga-apavāda relation between sūtras, and the word integrity principle. However, this needs further revision. The implementation is still in progress. The current implementation of inflectional morphology to derive a speech form is discussed in detail.
The knowledge of how a language codes information, how much information it codes and where it cod... more The knowledge of how a language codes information, how much information it codes and where it codes the information is very crucial for a computational linguist working in the area of Natural Language Processing and in particular Machine Translation.
Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and... more Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and a thesaurus in the form of Amarakośa.
The Sanskrit kośas such as Amarakośa, Vaijayantikośa etc. have a built in knowledge structure of ... more The Sanskrit kośas such as Amarakośa, Vaijayantikośa etc. have a built in knowledge structure of its own which apart from revealing the ontological classication, provides a holistic view of various concepts. Knowledge in these kośas concerns with many non-observational, culture specic facts. In this paper we present a few representative examples of the concept clusters from the two Sanskrit kośas; Amarakośa and Vaijayantkośa. There is a necessity to make these valuable resources available in suitable e-form so that the NLP community working in Indian Languages can be benitted. Adidevādhyāyah . (supreme diety) Lokapālādhyāyah . (guardian deities) Yaks .ā dhyāyah . (semi-divine beings) • Antariks . akakān . d . ah . (sky) Jyotiradhyāyah . (light) Meghādhyāyah . (cloud) Khagādhyāyah . (bird) Sabdādhyāyah . (sound) • Būmikān . d . ah . (earth) Deśādhyāyah . (place) Sailādhyāyah . (hill) Vanādhyāyah . (forest) Paśusa ngrahādhyāyah . (animals) Manus . yādhyāyah . (mankind) Brāhman .ā dhyāyah . (priest tribe) Ks . atriyādhyāyah . (military tribe) Vaiśyādhyāyah . (bussiness tribe) Sūdrādhyāyah . (mixed class)
Sanskrit Computational Linguistics, Jan 1, 2010
Amarakośa is the most celebrated and authoritative ancient thesaurus of Sanskrit. It is one of th... more Amarakośa is the most celebrated and authoritative ancient thesaurus of Sanskrit. It is one of the books which an Indian child learning through Indian traditional educational system memorizes as early as his first year of formal learning. Though it appears as a linear list of words, close inspection of it shows a rich organisation of words expressing various relations a word bears with other words. Thus when a child studies Amarakośa further, the linear list of words unfolds into a knowledge web. In this paper we describe our effort to make the implicit knowledge in Amarakośa explicit. A model for storing such structure is discussed and a web tool is described that answers the queries by reconstructing the links among words from the structured tables dynamically.
… of National Seminar …, Jan 1, 2009
Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and... more Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and a thesaurus in the form of Amarakośa.
Abstract. In this paper we note the importance of positing a canonical form for verbal root and i... more Abstract. In this paper we note the importance of positing a canonical form for verbal root and its meaning to facilitate the comparison of various Dhatuvr. ttis. We also provide some quantitative measure of the differences in the Dhatuvr. ttis after correlating four Dhatuvr. ttis using canonical forms of roots and meanings. Keywords: Pan. inıya Dhatupat. ha, canonical form, quantitative analysis.
Bh. K Festschrift volume by LSI, Jan 1, 2009
For an inflectionally rich language like Sanskrit, any NLP application demands a good morphologic... more For an inflectionally rich language like Sanskrit, any NLP application demands a good morphological analyzer. Though Sanskrit is the best-analyzed language in the world, a good coverage morphological analyzer for it is still not available. This paper points out the complexity involved in building a wide coverage analyzer for Sanskrit and then describes a morphological analyzer that has been built using the available eresources, based on ad-hoc principles. The coverage of this analyzer is around 95%. Though for practical applications, this is not an acceptable figure, it can however be used as a stepping-stone to develop other modules such as sandhi splitter, search engine, etc. At a later stage, it may be replaced by a module that is based on the classic aÀt¡dhy¡y¢.
Abstract. As. tŻadhyŻayŻı has a section of rules which provide conditions for compound formation.... more Abstract. As. tŻadhyŻayŻı has a section of rules which provide conditions for compound formation. These rules are presented from generation point of view. We study these conditions from the point of view of compound type identification. A rule based classifier based on these rules is developed whose performance on some of the compound types is encouraging. These conditions also suggest the type of information lexical databases should contain for automatic language analysis, including a compound classifier.
Sanskrit is very rich in compound formation unlike modern Indian Languages. The compound formatio... more Sanskrit is very rich in compound formation unlike modern Indian Languages. The compound formation being productive it forms an open-set and as such it is also not possible to list all the compounds in a dictionary. The compound formation involves a mandatory sandhi. But mere sandhi splitting does not help a reader in identifying the meaning of a compound, since typically a compound does not code the relation between its components explicitly. To understand the meaning of a compound, it is necessary to identify its components and discover the relation between them. An expression providing the meaning of a compound is called a paraphrase.
Sanskrit Computational Linguistics, Jan 1, 2010
Sanskrit is very rich in compound formation. Typically a compound does not code the relation betw... more Sanskrit is very rich in compound formation. Typically a compound does not code the relation between its components explicitly. To understand the meaning of a compound, it is necessary to identify its components, identify the way the components group together, discover the relations between them and finally generate a paraphrase of the compound. In this paper, we discuss our efforts in building a constituency parser for Sanskrit compounds. The average performance of this parser is 85%.
Compounds occur very frequently in Indian Languages. There are no strict orthographic conventions... more Compounds occur very frequently in Indian Languages. There are no strict orthographic conventions for compounds in modern Indian Languages. In this paper, Sanskrit compounding system is examined thoroughly and the insight gained from the Sanskrit grammar is applied for the analysis of compounds in Hindi and Marathi. It is interesting to note that compounding in Hindi deviates from that in Sanskrit in two aspects. The data analysed for Hindi does not contain any instance of Bahuvrīhi (exo-centric) compound. Second, Hindi data presents many cases where quite a lot of compounds require a verb as well as vibhakti(a case marker) for its paraphrasing. Compounds requiring a verb for paraphrasing are termed as madhyama-pada-lopī in Sanskrit, and they are found to be rare in Sanskrit.
1 The conjunct verbs in Hindi pose a problem with respect to the agreement. Shapiro has observed ... more 1 The conjunct verbs in Hindi pose a problem with respect to the agreement. Shapiro has observed that when the nominal element of a conjunct verb functions as a direct object of the conjunct verb, then the verb shows an agreement with its nominal element. In this paper we give the syntactico-semantic criterion to decide whether the nominal elemnet of a verb is an argument of a conjunct verb or not and give rules for agreement decisions in such cases.
Arxiv preprint cs/ …, Jan 1, 2003
The paper reports on efforts taken to create lexical resources pertaining to Indian languages, us... more The paper reports on efforts taken to create lexical resources pertaining to Indian languages, using the collaborative model. The lexical resources being developed are: (1) Transfer lexicon and grammar from English to several Indian languages.
Abstract This paper describes a dependency based tagging scheme for creating tree banks for India... more Abstract This paper describes a dependency based tagging scheme for creating tree banks for Indian languages. The scheme has been so designed that it is comprehensive, easy to use with linear notation and economical in typing effort. It is based on Paninian grammatical model.
In the Proceedings of workshop on …, Jan 1, 2002
Abstract In this paper we discuss the problems in Urdu-Hindi-Urdu Machine Translation at various ... more Abstract In this paper we discuss the problems in Urdu-Hindi-Urdu Machine Translation at various levels. Though because of large common vocabulary it may sound that only transliteration can help to overcome the language barrier between Urdu and Hindi, the tendency of Urdu to use words from Persian and Arabic origin, and the tendency of Hindi to use words of Sanskrit origin, call for the use of proper Machine Translation System.
Abstract India is a multilingual, linguistically dense and diverse country with rich resources of... more Abstract India is a multilingual, linguistically dense and diverse country with rich resources of information. Parallel corpora have major role in multilingual natural language processing, computational linguistics, speech and information retrieval. This paper describes an alignment system for aligning English-Hindi texts in Gyan-Nidhi corpus at sentence level. The criteria used for alignment is combination of linguistic, statistical information and simple heuristics.
Panini in his As . t .ā dhyāyī not only provides a grammar for Sanskrit but a grammar formalism t... more Panini in his As . t .ā dhyāyī not only provides a grammar for Sanskrit but a grammar formalism that can be applied to other languages as well. There is a tradition of grammars for various Indian languages written in this formalism. The use of computers as an information processing device demands a sound theory for processing the information in a language string. Pān . inian way of analysis of a langauge provides such a theory. In order to use Pān . inian theory for analysis of other languages, it is necessary to model these languages in terms of Pān . inian primitives such as pada, sup, tiṅ, kr . t, vibhakti, etc. This paper presents an attempt at modelling English in Pān . inian framework. In an earlier effort(Bharati,forthcoming) it was shown that the notion of subject in English corresponds to the notion of an abhihita with a few systematic exceptions.
On the one hand with the world wide web spreading all over the world, information is now availabl... more On the one hand with the world wide web spreading all over the world, information is now available at the click of a mouse. However, most of the information is in English. In India hardly 5 10% of the population can understand English. Hence, if India has to take real advantage of the new technology, it is necessary to make this information available to the Indians in Indian languages. On the other hand, it is well known that Fully Automatic High Quality Machine Translation is impossible in near future.
Computing Research Repository, 2003
Fully-automatic general-purpose high-quality machine translation systems (FGH-MT) are extremely d... more Fully-automatic general-purpose high-quality machine translation systems (FGH-MT) are extremely difficult to build. In fact, there is no system in the world for any pair of languages which qualifies to be called FGH-MT. The reasons are not far to seek. Translation is a creative process which involves interpretation of the given text by the translator. Translation would also vary depending on the audience and the purpose for which it is meant. This would explain the difficulty of building a machine translation system. Since, the machine is not capable of interpreting a general text with sufficient accuracy automatically at present -let alone re-expressing it for a given audience, it fails to perform as FGH-MT. FOOTNOTE{The major difficulty that the machine facesin interpreting a given text is the lack of general world knowledge or common sense knowledge.} To understand the nature of the difficulty, let us consider the following sentence in Hindi: chAvala rAma khAtA hE rice(m.) Ram(m.) eats(m.) Ram eats rice.
Arxiv preprint cs/ …, Jan 1, 2003
The anusaaraka system makes text in one Indian language accessible in another Indian language. In... more The anusaaraka system makes text in one Indian language accessible in another Indian language. In the anusaaraka approach, the load is so divided between man and computer that the language load is taken by the machine, and the interpretation of the text is left to the man. The machine presents an image of the source text in a language close to the target language.In the image, some constructions of the source language (which do not have equivalents) spill over to the output. Some special notation is also devised. The user after some training learns to read and understand the output. Because the Indian languages are close, the learning time of the output language is short, and is expected to be around 2 weeks.
Satyam Techical Review, Jan 1, 2003
Most research in Machine translation is about having the computers completely bear the load of tr... more Most research in Machine translation is about having the computers completely bear the load of translating one human language into another. This paper looks at the machine translation problem afresh and observes that there is a need to share the load between man and machine, distinguish 'reliable' knowledge from the 'heuristics', provide a spectrum of outputs to serve different strata of people, and finally make use of existing resources instead of reinventing the wheel. This paper describes the architecture and design of 'Anusaaraka' based on the fundamental premise of sharing the load, resulting in "good enough" results according to the needs of the reader. The architecture differs from the conventional in three major ways:
Arxiv preprint cs/0308019, Jan 1, 2003
The anusaaraka system (a kind of machine translation system ) makes text in one Indian language a... more The anusaaraka system (a kind of machine translation system ) makes text in one Indian language accessible through another Indian language. The machine presents an image of the source text in a language close to the target language. In the image, some constructions of the source language (which do not have equivalents in the target language) spill over to the output. Some special notation is also devised.
Word Sense Disambiguation (WSD) is a major problem in Machine Translation (MT). There have been s... more Word Sense Disambiguation (WSD) is a major problem in Machine Translation (MT). There have been several attempts to handle WSD 1 . To develop WSD rules manually is laborious and time consuming. Moreover, if the rules are developed for bilingual WSD, they may be languagepair specific. If the rules are monolingual, it is difficult to decide the granularity for different senses. While the statistical method may be helpful in handling large volumes of language data, it does not give any linguistic insight about the languages involved in MT. There have been attempts to semiautomate the task of WSD 2 . However the rules which the machine learns are in hundreds and it becomes again difficult to gain any linguistic insight from these methods. Rulebased WSD, on the other hand, helps in linguistic analysis, but mostly works with the syntax of a sentence and is hence surface structure dependent. Moreover, in this method, rules are written to suit the systems running on available technology and may have to be changed if a better technology comes about. Of course, we cannot do without writing rules for WSD for running the MTS, but a deeper approach for language analysis dealing the language semantically, which would enlighten us about where the information about the language phenomenon is available would be of greater advantage. The results of such analysis can be always used along with further development in technology as the analysis is independent of technological constrains. Such a method of WSD may be called as Informationbased approach. We illustrate informationbased WSD with an example of sense disambiguation of toinfinitive in English into Hindi. First we look at few English sentences with toinfinitives and their Hindi
Last decade has seen introduction of several parsers for English ranging from rule based to stati... more Last decade has seen introduction of several parsers for English ranging from rule based to statistical based. In recent years there is also a growing trend towards producing dependency output in addition to the constituency trees. The dependency format is preferred over the constituency not only from evaluation point of view but also because of its suitability for a wide range of NLP tasks. However there is no consensus among the dependency parser developers on the number of dependency relations and names of these relations.
- ÈèÏÚ×èÂÚÔÛ³: ËÚÏÂÜÍ ÕÚ×èÂèÏÚ¢ÂÞAE ÔÛÕáÖ£ ÔèÍÚ³ÏÁ, AEèÍÚÍ Ô ÌÜÌÚ¢×Ú ÍÚ¢ÂÞAE ËÚÖáÔÏ µØAE ¸Û¢ÂAE... more 1) ÈèÏÚ×èÂÚÔÛ³: ËÚÏÂÜÍ ÕÚ×èÂèÏÚ¢ÂÞAE ÔÛÕáÖ£ ÔèÍÚ³ÏÁ, AEèÍÚÍ Ô ÌÜÌÚ¢×Ú ÍÚ¢ÂÞAE ËÚÖáÔÏ µØAE ¸Û¢ÂAE ³áÑá µáÑá ¥Øá. ËÚÖá¸Ú ÌÝ´èÍ ¨ÈÍåµ ÌÚØÛÂÜ¸Ü ÄáÔÚÁ ¶áÔÚÁ ³ÏÁá ØÚ ¥Øá Øá ×̺ÞAE ¶á©AE ËÚÖáÂÜÑ AEÛÏAEÛÏÚÒèÍÚ ×¢³áÂÚ¢¸Ú (ÈÄÕ³èÂÛ Ô ÔÚ³èÍÕ³èÂÛ ÍÚ¢¸Ú) ÕÚ×èÂèÏÕÝÄèÅÏÛÂèÍÚ ¤ËèÍÚ× ÔèÍÚ³ÏÁÚÄÛ ÕÚ×èÂèÏÚ¢ÂÞAE ³áÑáÑÚ ¥ÀÒÂå. ËÚÖÚ ³ÚÌ ³ÕÜ ³ÏÂá, ¤ÃÔÚ Ô³èÂÚ ¥ÈÑèÍÚ ÌAEÚÂÜÑ ÔÛ¸ÚÏ ËÚÖá¸èÍÚ ÌÚÅèÍÌÚÂÞAE ÕèÏåÂèÍÚÈÏè͢ ³×á ÈåØå¸ÔÞ Õ³Âå, ØèÍÚ ×ÚÏ´èÍÚ ÈèÏÕèAEÚ¢¸Ú ÔÛ¸ÚÏ ÍÚ µèÏ¢ÃÚÂÞAE ³áÑáÑÚ ¥ÀÒÂå. ØèÍÚ ÕÚ×èÂèÏÚ¢¸Ú ¨ÈÍåµ ³áÔÒ ÂÂèÔºè¼ÚAEÚÂÜÑ ¸Ïè¸áÈÏè͢¸ ÌÏèÍÚÄÛ ØåÂÚ.ºÔÒºÔÒ 2000 ÔÏèÖÚ¢¸Ü ØÜ ÈÏ¢ÈÏÚ ¥º ÑÝÈè Øå ¸ÚÑÑÜ ¥Øá. ÈÏ¢ÂÝ ¥ÂÚ ×¢µÁ³Ú¸èÍÚ ¨ÈÑÊèÅÂáÌÝÒá Natural Language Processing ×Ú¾Ü ØèÍÚ ÕÚ×èÂèÏÚ¢¸Ú ¨ÈÍåµ ³ÏÞAE ¶áÁèÍÚ¸Ü ¬³ ¤ÄèÔÛÂÜÍ ×¢ÅÜ ¥ÈÑèÍÚÑÚ ÈèÏÚÈè »ÚÑÜ ¥Øá. ×¢µÁ³ Øá information processors ÌèØÁÞAE ÔÚÈÏÑá ºÚÂÚÂ. ¬³Ú ×èÔÏÞÈÚ ¨ÈÑÊèÅ ¤×ÑáÑÜ ÌÚØÛÂÜ Ôá¸ÞAE ÔáµÒèÍÚ ×èÔÏÞÈÚ ÂÜ ¨ÈÑÊèÅ ³ÏÞAE ÄáÁá Øá information processors ¸á ³ÚÏèÍ. ËÚÖáÌÅèÍá ¨ÈÑÊèÅ ¤×ÁÚÐèÍÚ ÌÚØÛÂÜÔÏ ºáÔèØÚ ×¢µÁ³ ÈèϳèÏÛÍÚ ³ÏÂå, ÂáÔèØÚ ÂèÍÚ ÈèϳèÏÛÍá× Natural Language Processing ¤×á ×¢ÊåÅÑá ºÚÂá. ØèÍÚ ³ÚÏèÍÚ×Ú¾Ü ×Úغ۳¸ ÔèÍÚ³ÏÁÚÄÛ ÕÚ×èÂèÏÚ¢¸Ú ¨ÈÍåµ Øå© Õ³Âå. ¥ÁÛ ÌèØÁÞAE¸ ØèÍÚ ¬³ÌáÔÚÄèÔÛÂÜÍ ×¢ÅÜ¸Ú ¥ÈÁ ÉÚÍÄÚ ³ÏÞAE ¶èÍÚÍÑÚ ØÔÚ. ¥º ¦¢½ÏAEá½ÔÏ ¦¢µè켆 ËÚÖá Èèϸ¢¿ ÈèÏÌÚÁÚ ÌÚØÛÂÜ ¨ÈÑÊèÅ ¥Øá. ×ÌÚºÚ¸Ü ËÏËÏÚ½ ØÜ ×ÌÚºÚÂÜÑ ¶½³Ú¢³¿á ¤×ÁÚÐèÍÚ ¨ÈÍåµÜ ÌÚØÛÂÜÔÏ ¤ÔÑ¢ÊÞAE ¤×Âá. ÂèÍÚÌÝÒá ØÜ ÌÚØÛÂÜ ºAE×ÚÌÚAEèÍÚ¢AEÚ ÂèÍÚ¢¸èÍÚ ÌÚÂßËÚÖá ¨ÈÑÊèÅ ³ÏÞAE ÄáÁá ØÜ ³ÚÒÚ¸Ü µÏº ¥Øá. ×¢µÁ³Ú¸èÍÚ ×ØÚÍèÍÚAEá ¦¢µè켆 ËÚÖáÂÜÑ ÌÚØÛÂÜ ÌÏÚ¾Ü ËÚÖÛ³Ú¢AEÚ ¨ÈÑÊèÅ ³ÏÞAE ÄáÁèÍÚ×Ú¾Ü ÔèÍÚ³ÏÁÚÄÛ ÕÚ×èÂèÏÚ¢¸Ú ÈÍåµ ³×Ú ³ÏÞ Õ³Âå Øá ¦¢µèϺÜ-ÌÏÚ¾Ü ¤AEÝ×ÚϳڸèÍÚ ×ØÚÍèÍÚAEá ¦Ãá ÔÛÕÄ ³áÑá ¥Øá. ØèÍÚ ¨È³èÏÌÚ¸Ú ¥Á´Ü ¬³ ÉÚÍÄÚ ÌèØÁºá ÍÚ ÕÚ×èÂèÏÚ¢¸èÍÚ ¤ËèÍÚ×ÚÑÚ ¨ÏèºÛÂÚÔ×èÃÚ ÈèÏÚÈè Øå §Ñ, Ô ËÚÖá¸èÍÚ ¤ËèÍÚ×Ú×Ú¾Ü Ñå³Ú¢ÌÅèÍá ¬³ ¨Âè×ÚØ AEÛÏèÌÚÁ Øå §Ñ. 2) ËÚÏÂÜÍ ÔèÍÚ³ÏÁÚÄÛ ÕÚ×èÂèÏÚ¢ÂÜÑ ³ÚØÜ ×¢³ÑèÈAEÚ¢¸Ú NLP ×Ú¾Ü ¨ÈÍåµ: Ô³èÂÚ ÕÊèÄÚ¢¸èÍÚ ÌÚÅèÍÌÚÂÞAE ÕèÏåÂèÍÚ¢ÕÜ ×¢ÔÚÄ ×ÚÅ ¤×Âå. ÈÏ¢ÂÝ ÕèÏåÂèÍÚÑÚ ÌÚÂèÏ ÕÊèÄÚ¢¸Ú ¤Ïèà ÑÚÔÂÚAEÚ 'ÕÊèÄÚ¢¸èÍÚ ÈÑܳ¿á' ºÚÁèÍÚ¸Ü µÏº ËÚ×Âá. ÄÚ. ÌÚ¢ºÏ ÌÚ×ÒÜ ´ÚÂá. ØèÍÚ ÔÚ³èÍÚ ÌÚ¢ºÏ ³ÏèÂÚ Ô ÌÚ×ÒÜ ³ÏèÌ ¥Øá, Øá ÕÊèÄÚ¢ÂÞAE ³Ý¾áØÜ Ôèͳè »ÚÑáÑá AEÚØÜ. ÕèÏåÂÚ ¥ÈÑèÍÚ ×ÚÌÚAEèÍ ºè¼ÚAEÚ¸èÍÚ ¥ÅÚÏÚÔÏ ³ÏèÂÚ ³åÁ Ô ³ÏèÌ ³åÁ Øá ¾ÏÔÂå. 'ÕÊèÄÚ¢¸èÍÚ ÈÑܳ¿á' ºÚ©AE ¤Ïèà ÑÚÔÁá ×ÄèÍÚ¸èÍÚ ¶½³áÑÚ ÂÏÜ ×¢µÁ³Ú× Õ³èÍ AEÚØÜ. ÕÊèÄ,ÕÊèÄ-×ÌÞØ,ÔÚ³èÍϸAEÚ ¦ÂèÍÚÄÛ ÌÚÅèÍÌÚ¢ÄèÔÚÏá ËÚÖÚ ³ÛÂÜ ÌÚØÛÂÜ Ôèͳè ³ÏÂá, ÔÚ³èÍÚ¸Ú ¤Ïèà ÑÚÔÂÚAEÚ ×ÚÌÚAEèÍ ºè¼ÚAEÚ¸Ú ³áÔèØÚ Ô ³×Ú ¨ÈÍåµ ØåÂå, ØèÍÚ µåÖè½Ü ºÏ ×èÈÖè½ÈÁá ³ÒÑèÍÚ ÂÏ ×¢µÁ³Ú³¿ÞAE ³åÁÂá ³ÚÌ ¥º ³ÏÔÞAE ¶á© Õ³Âå, ³åÁÂá AEÚØÜ Øá ³ÒÁèÍÚ× ÌÄ ØåÂá. ËÚÖÚ ÕÊèÄ,ÔÚ³èÍϸAEÚ ¥ÄÛ¢¸èÍÚ ÌÚÅèÍÌÚÂÞAE ³ÛÂÜ ÌÚØÛÂÜ Ôèͳè ³ÏÂá Øá ×̺ÚÔÞAE ¶áÂÚAEÚ ËÚÏÂÜÍ ÔèÍÚ³ÏÁÚÂÜÑ ×¢³ÑèÈAEÚ¢¸Ú ¥ÌèØÚ¢× ³×Ú ¨ÈÍåµ »ÚÑÚ ÍÚ¸Ü ØÜ ³ÚØÜ ¨ÄÚØÏÁá. ¤) ÈÄÕ³èÂÛ-ÔÚ³èÍÕ³èÂÛ: ÈèÏÂèÍá³ ËÚÖÚ ÌÚØÛÂÜ Ôèͳè ³ÏÁèÍÚ×Ú¾Ü ³ÚØÜ ×¢³áÂÚ¢¸Ú ¨ÈÍåµ
The asymmetry in the translation of Natural language sentences involving existential and universa... more The asymmetry in the translation of Natural language sentences involving existential and universal quantifiers is well known. It is possible to get rid of this asymmetry by postulating 'quantified typed' variables. In this presentation, we define a 'quantified typed' variable and the algebra associated with these variables to prove the deductions using the method of reductio-ad-absurdum.
Spell Checker is an application which handles spelling errors and Spelling Variations (SV). All t... more Spell Checker is an application which handles spelling errors and Spelling Variations (SV). All the misspelt words are marked and allowed for correction. This system also can be used as an editor where the text is checked for spelling errors and suggestion for correction are provided. Telugu is an agglutinating language and has a very complex morphology which is coupled with prolific sandhi or morphophonemics. The sandhi that is noticed in Telugu is not limited to internal but also external. Both consonantal and vocalic sandhi are common and well-studied in Telugu [Krishnamurti, 1957, 1985]. To identify the specific sandhi type and split it appropriately is a very challenging task. External sandhi is a linguistic phenomenon which refers to a set of changes that occur at word boundaries. These changes are similar to phonological processes such as substition (modification by various means) deletion, and insertion. External sandhi is often orthographically reflected in Telugu. External sandhi in such cases, causes the formation of such forms which are morphologically unanalyzable, thus posing a problem for all kinds of NLP applications. In this paper, we discuss in detail the processes external sandhi in Telugu and the Computational tool the Spell Checker.
A Morphological Analyzer (MA) is a program which compiles and analyses words of a natural languag... more A Morphological Analyzer (MA) is a program which compiles and analyses words of a natural language into their roots and their constituent morpho-syntactic elements along with their attributes. The present paper demonstrates computational implementation of a Morphological Analyzer for Telugu. The algorithm used to build this MA is theoretically justified and is practically executed for Telugu in the context of Modern Standard Written variety. The present proposal is a demonstration of the optimal organization of linguistic database and its performance in computational environment by ensuring high precision and coverage in the parsing of wordforms. The current MA engine's coverage may range between 95-97% on a variety of corpora (3 million word length corpus).
Spell Checker is an application which handles spelling errors and Spelling Variations (SV). All t... more Spell Checker is an application which handles spelling errors and Spelling Variations (SV). All the misspelt words are marked and allowed for correction. This system also can be used as an editor where the text is checked for spelling errors and suggestion for correction are provided. Telugu is an agglutinating language and has a very complex morphology which is coupled with prolific sandhi or morphophonemics. The sandhi that is noticed in Telugu is not limited to internal but also external. Both consonantal and vocalic sandhi are common and well studied in Telugu [Krishnamurti, 1957[Krishnamurti, , 1985. To identify the specific sandhi type and split it appropriately is a very challenging task. External sandhi is a linguistic phenomenon which refers to a set of changes that occur at word boundaries. These changes are similar to phonological processes such as su b st i ti o n (mo d i fi cat i on b y v ar i ou s me an s) d e l e t i o n , and insertion. External sandhi i s o f t e n orthographically reflected in Telugu. External sandhi in such cases, causes the formation of such forms which are morphologically unanalyzable, thus posing a problem for all kinds of NLP applications. In this paper, we discuss in detail the processes external sandhi in Telugu and the Computational tool the Spell Checker.
Abstract: Contribution of Indian Mathematics since Vedic Period has been recognised by the Histor... more Abstract: Contribution of Indian Mathematics since Vedic Period has been recognised by the Historians. Pingala (200 BC) in his book on'Chandashaastra', a text related to the description and analysis of meters in poetic work, describes algorithms which deal with the Combinatorial Mathematics. These algorithms essentially deal with the conversion of Binary numbers to Decimal numbers and vice versa, finding the value of'n choose r', evaluating 2^ n, etc. All these algorithms are recursive in nature.
There is a chaos as far as the Indian languages in electronic form are concerned. Neither can one... more There is a chaos as far as the Indian languages in electronic form are concerned. Neither can one exchange the notes in Indian languages as conveniently as in English language, nor can one perform search on texts in Indian languages available over the web. This is so because the texts are being stored in font dependent glyph codes.
Indian languages belong to four different families. However, as far as scripts are concerned, all... more Indian languages belong to four different families. However, as far as scripts are concerned, all of them (expect for PersoArabic script) are derived from the Brahmi script. Indian languages are compositionally syllabic. They have a scientific phonetic base and all the syllables are derived from the phonemes compositionally. They have flexibility and can be used as alphabetic or as syllabic as the need demands. Whereas the syllabic version is suitable for writing concisely, the alphabetic is suitable for performing the linguistic operations, such as sandhi operation, morphological analysis, sorting, searching, etc.
Mahaabhaa.sya is an important commen-tary on Paa.nini's grammar for Sanskrit and is highly st... more Mahaabhaa.sya is an important commen-tary on Paa.nini's grammar for Sanskrit and is highly structured. The tradi-tional scholars have tagged it manually showing its underlying discourse struc-ture. The traditional grammar also dis-cusses clues for discourse level annota-tions. Taking into account these clues we have developed an automatic tag-ger for tagging the Mah¯ abh¯ as . ya. This tagger is described in this paper, along with its performance evaluation. We have also extended this tag-set to on another important tex S'abarabhaa.sya.
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of... more Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Amba Kulkarni University of Hyderabad, ...
Two annotations schemes for presenting the parsed structures are prevalent viz. the constituency ... more Two annotations schemes for presenting the parsed structures are prevalent viz. the constituency structure and the dependency structure. While the constituency trees mark the relations due to positions, the dependency relations mark the semantic dependencies. Free word order languages like Sanskrit pose more problems for constituency parses since the elements within a phrase are dislocated. In this work, we show how the enriched constituency tree with the information of displacement can help construct the unlabelled dependency tree automatically.
For an inflectionally rich language like Sanskrit, any NLP application demands a good morphologic... more For an inflectionally rich language like Sanskrit, any NLP application demands a good morphological analyzer. Though Sanskrit is the best-analyzed language in the world, a good coverage morphological analyzer for it is still not available. This paper points out the complexity involved in building a wide coverage analyzer for Sanskrit and then describes a morphological analyzer that has been built using the available e-resources, based on ad-hoc principles. The coverage of this analyzer is around 95%. Though for practical applications, this is not an acceptable figure, it can however be used as a stepping-stone to develop other modules such as sandhi splitter, search engine, etc. At a later stage, it may be replaced by a module that is based on the classic aÀt¡dhy¡y¢.
In this paper we present a semi-automatic computational tool to represent a Navya Nyāya expressio... more In this paper we present a semi-automatic computational tool to represent a Navya Nyāya expressions through Conceptual Graphs of Sowa. This tool consists of a domain specific segmenter, a semi-automatic constituency parser and a context free parser that translates an NN Expressions into a Conceptual Graph.
Navya-Nyaaya (NN), a school of Indian logic and philosophy, has evolved a sophisticated language ... more Navya-Nyaaya (NN), a school of Indian logic and philosophy, has evolved a sophisticated language to deal with verbal cognition, logic and epistemology. This language is known for its use of long compounds, productive use of secondary derivational suffixes, and a special technical vocabulary. In
this paper we present a specially designed domain specific splitter to split the NN compounds into its components.
Mahaabhaa.sya is an important commen-tary on Paa.nini's grammar for Sanskrit and is highly st... more Mahaabhaa.sya is an important commen-tary on Paa.nini's grammar for Sanskrit and is highly structured. The tradi-tional scholars have tagged it manually showing its underlying discourse struc-ture. The traditional grammar also dis-cusses clues for discourse level annota-tions. Taking into account these clues we have developed an automatic tag-ger for tagging the Mah¯ abh¯ as . ya. This tagger is described in this paper, along with its performance evaluation. We have also extended this tag-set to on another important tex S'abarabhaa.sya.
Importance of gold standard data in the field of NLP is well-established. In this paper we descri... more Importance of gold standard data in the field of NLP is well-established. In this paper we describe the development of one such gold standard for Sanskrit annotated at various levels of linguistic analysis. We describe how such a domain specific gold standard data, in addition to being useful for training and evaluation, is also useful for teaching. With the help of a suitable interface of anus\={a}raka we demonstrate its usability for a linguist, and also for a learner.
Proceedings of the 3rd workshop on Asian language resources and international standardization - COLING '02, 2002
This paper describes a dependency based tagging scheme for creating tree banks for Indian languag... more This paper describes a dependency based tagging scheme for creating tree banks for Indian languages. The scheme has been so designed that it is comprehensive, easy to use with linear notation and economical in typing effort. It is based on Paninian grammatical model.