Challenges in Developing a Rule based Urdu Stemmer (original) (raw)

Sign up to get access to over 50M papers

Rule Based Urdu Stemmer

This paper presents Rule based Urdu Stemmer. In this technique rules are applied to remove suffix and prefix from the inflected words. Urdu is well spoken language all over the world but less work has been done on Urdu stemming. Stemmer helps us to find the root of the inflected word. Various possibilities of inflected words like ‫وں‬ (vao+noon-gunna), ‫ے‬ (badi-ye), ‫یاں‬ (choti-ye+alif+noon-gunna) etc. have been identified and appropriate rules have been developed for them.

Challenges in Urdu Stemming (A Progress Report

2007

This paper explains the challenges pertaining to Urdu stemming and presents a rule-based prototype with a few rules implemented for Urdu to motivate the intricacies. It shows that Urdu stemming is quite challenging because of Urdu's diverse nature and because Arabic and Farsi stemmers cannot be used for Urdu. Dictionary-based errorcorrecting schemes used by other stemmers cannot be applied to Urdu because of the lack of machine-readable resources. There has not been any work published regarding Urdu stemming or morphological analysis in the IR community even though interest in Urdu is growing. The goal of this paper is to show the challenges in writing an Urdu stemmer, not to present a stemmer.

Design and Development of a Stemmer for Punjabi

International Journal of Computer Applications, 2010

Stemming is the process of removing the affixes from inflected words, without doing complete morphological analysis. A stemming Algorithm is a procedure to reduce all words with the same stem to a common form [20]. It is useful in many areas of computational linguistics and information-retrieval work. This technique is used by the various search engines to find the best solution for a problem. The algorithm is a basic building block for the stemmer. Stemmer is basically used in information retrieval system to improve the performance .The paper present a stemmer for Punjabi, which uses a brute force algorithm. We also use a suffix stripping technique in our paper. Similar techniques can be used to make stemmer for other languages such as Hindi, Bengali and Marathi. The result of stemmer is good and it can be effective in information retrieval system. This stemmer also reduces the problem of over-stemming and under-stemming.

Loading...

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.