A word segmentation system for handling space omission problem in Urdu script (original) (raw)

2010, 23rd International Conference on Computational Linguistics

Abstract

Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian languages such as Chinese, Thai and Myanmar, Urdu also faces word segmentation challenges. Though the Urdu word segmentation problem is not as severe as the other Asian language, since space is used for word delimitation, but the space is not consistently used, which gives rise to both space omission and space insertion errors in Urdu. In this ...

Key takeaways

sparkles

The proposed word segmentation system achieves 99.15% accuracy in segmenting Urdu text.
It effectively addresses space omission issues common in Urdu script.
The system utilizes bilingual corpora to enhance segmentation accuracy compared to monolingual methods.
Statistical analysis from Hindi contributes to segmenting Urdu words more accurately.
The approach leverages both unigram and bigram frequency analysis for optimal word combination.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (68)

Lee, Young-Suk, Papineni, Kishore, Roukos, Salim Emam, Ossama and Hassan, Hany. 2003. Lan- guage model based arabic word segmentation. In Proceedings of the ACL'03, pp. 399-406.
Le, Viet-Bac, Besacier, Laurent, Seng, Sopheap, Bigi, Brigite and Do, Thi-Ngoc-Diep. 2008. Re- cent advances in automatic speech recognition for vietnamese. SLTU'08, Hanoi Vietnam.
Mohri, Mehryar, Fernando C. N. Pereira, and Michael Riley, "A rational design for a weighted finite-state transducer library," in Lecture Notes in Computer Science. Springer, 1998, pp. 144-158.
Seng, Sopheap, Besacier, Laurent, Bigi, Brigitte, Castelli, Eric. 2009. Multiple Text Segmentation for Statistical Language Modeling. InterSpeech, Brighton, UK,
Stolcke, Andreas. 2002. SRILM: an extensible lan- guage modeling toolkit. Proceedings of Interna- tional Conference on Spoken Language Process- ing, volume II, 901-904 . 129.88.65.115
References W. Aroonmanakun. 2007. Thoughts on Word and Sentence Segmentation in Thai. In Pro- ceedings of the Seventh International Sympo- sium on Natural Language Processing, Pat- taya, Thailand, 85-90.
F. Avery Bishop, David C. Brown and David M. Meltzer. 2003. Supporting Multilanguage Text Layout and Complex Scripts with Win- dows 2000. http://www.microsoft.com/typo- graphy/developers/uniscribe/intro.htm
A. W. Black and P. Taylor. 1997. Assigning Phrase Breaks from Part-of-Speech Se- quences. Computer Speech and Language, 12:99-117.
Thatsanee Charoenporn, Virach Sornlertlamva- nich, and Hitoshi Isahara. 1997. Building A Thai Part-Of-Speech Tagged Corpus (ORC- HID).
Paisarn Charoenpornsawat and Virach Sornler- tlamvanich. 2001. Automatic sentence break disambiguation for Thai. In International Conference on Computer Processing of Oriental Languages (ICCPOL), 231-235.
S. F. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, 310-318. Morristown, NJ: ACL.
J. N. Darroch and D. Ratcliff. 1972. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5): 1470-1480.
Choochart Haruechaiyasak, Sarawoot Kon- gyoung, and Matthew N. Dailey. 2008. A Comparative Study on Thai Word Segmenta- tion Approaches. In Proceedings of ECTI- CON 2008. Pathumthani, Thailand: ECTI.
Reinhard Kneser and Hermann Ney. 1995. Im- proved Backing-Off for M-Gram Language Modeling. In Proceedings of International Conference on Acoustics, Speech and Signal Procesing (ICASSP), 1:181-184.
Philipp Koehn. 2004. Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation Models. In Proceedings of the As- References
Ghosh, A. Das, P. Bhaskar, S. Bandyopadhyay. Dependency Parser for Bengali: the JU System at ICON 2009, In NLP Tool Contest ICON 2009, December 14th-17th, 2009, Hyderabad.
Akshar Bharati, Vineet Chaitanya , Rajeev Sangal. Natural Language Processing A Paninian Perspec- tive. Prentice Hall of India (1995).
Charles J. Fillmore, Christopher R. Johnson, and Mi- riam R. L. Petruck. 2003. Background to Frame- Net. International Journal of Lexicography, 16:235-250.
Chomsky, Noam (1956). "Three models for the de- scription of language". IRE Transactions on In- formation Theory 2: 113-124.
Erik F. Tjong kim sang and Herve Dejean Introduc- tion to CoNLL-2001 shared task: clause identifica- tion.
Groenendijk, J.: (2009), 'Inquisitive Semantics: Two Possibilities for Disjunction'. In Lecture Notes in Computer Science. ISBN-978-3-642-00664-7.
Kalika Bali, Monojit Choudhury, Diptesh Chatterjee, Arpit Maheswari, Sankalan Prasad. Correlates be- tween Performance, Prosodic and Phrase Struc- tures in Bangla and Hindi: Insights from a Psycho- linguistic Experiment. In Proceeding of ICON 2009. Hyderabad. India.
Kiparsky, Paul and J. F. Staal (1969). 'Syntactic and semantic relations in Panini.' Foundations of Lan- guage 5, 83-117.
Robins, R. H. (1979). A Short History of Linguistics (2nd Edition). London: Longman.
Vijay Sundar Ram. R and Sobha Lalitha Devi, 2008 Clause Boundary Identification Using Conditional Random Fields. References
Antworth, E. L. 1990. PC-KIMMO: A Two- level Processor for Morphological Analysis. Occasional Publications in Academic Com- puting. Summer Institute of Linguistics, Dallas, Texas.
Bharati, Akshar, V i n e et Chaitanya, and Rajeev Sanghal 1995. Natural Language Processing: A Paninian Perspective. Pren- tice Hall, India.
Damale, M. K. 1970. Shastriya Marathii Vyaakarana. Deshmukh and Company, Pune, India.
Dixit, Veena, Satish Dethe, and Rushikesh K. Joshi. 2006. Design and Implementation of a Morphology-based Spellchecker for Marathi, an Indian Language. In Special issue on Human Language Tech- nologies as a challenge for Computer Science and Linguistics. Part I. 15, pages 309-316. Archives of Control Sciences.
Eryiğit, Gülşen and Adalı Eşref. 2004. An Affix Stripping Morphological Analyzer for Turkish. In IASTED International Multi- Conference on Artificial Intelligence and Applications. Innsbruck, Austria, pages 299-304.
Kim, Deok-Bong., Sung-Jin Lee, Key-Sun Choi, and Gil-Chang Kim (1994). A two- level Morphological Analysis of Korean. In Conference on Computational Linguistics (COLING), pages 535-539.
Koskenniemi, Kimmo 1983. Two-level Morphology: a general computational model for word-form recognition and pro- duction. University of Helsinki, Helsinki.
Oflazer, Kemal 1993. Two-level Description of Turkish Morphology. In The European Chapter of the ACL (EACL).
Bharati, A., Sharma, D. M., Chaitanya, V., Ku lka rni, A. P., & Sangal, R., 2001. LERIL: Collaborative effort fo r creating lexical resources. In Proceed- ings of the 6th NLP Pacific Rim Symposium Post- Conference Workshop, Japan.
Dandapat, S., M itra, P., and Sarkar, S., 2006. Statistical investigation of Bengali noun- verb (N-V) collocations as multi-word- expressions, In Proceedings of Modeling and Shallow Parsing of Indian Languages (MSPIL), Mumbai, pp 230-233
Ekbal, A., and Bandyopadhyay, S., 2008. A web based Bengali news corpus for Named Entity Recognition, Lang Resources & Evaluation (2008) 42:173-182, Springer
Fletcher, W. H., 2001. Concordancing the web with KWiCFinder. In Proceedings of the Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001.
Fletcher, W. H., 2004. Making the web more use-ful as source for linguists corpora. In U. Conor & T. A. Upton (Eds.), Applied corpus linguists: A mul- tidimensional perspective (pp. 191-205). A mster- dam: Rodopi.
Kilgarriff, A., and Grefenstette, G., 2003. Introduc- tion to the special issue on the web as corpus. Computational Linguistics, 29(3), 333-347.
Kishorjit, N., and Bandyopadhyay, S., 2010. Identi- fication of Reduplicated MWEs in Manipuri: A Rule Based Approch, In proceedings of 23rd International Conference on the Computer Processing of Oriental Languages (ICCPOL 2010) -New Generation in Asian Information Processing , Red mond City, CA
Kunchukuttan, A., and Damani, O. P., 2008. A Sys- tem for Co mpound Nouns Multiword Expression Extraction for Hindi, In Proceedings of 6 th Inter- national conference on Natural Language Processing (ICON 2008), Pune, India
Robb, T., 2003. Google as a corpus tool? ETJ Journal, 4(1), Sp ring.
Rundell, M., 2000. The biggest corpus of all. Hu ma- nising Language Teaching, 2(3)
Singh. Chungkham Y., 2000. Manipuri Gra mmar, Rajesh Publications, Delhi, pp 190-204
Singh, Thoudam D., Ekbal, A., Bandyopadhyay, S. 2008. Manipuri POS tagging using CRF and SVM : A language independent approach, In pro- ceeding of 6 th International conference on Natural Language Processing (ICON -2008), Pune, India, pp 240-245
Singh, Thoudam D., Kishorjit, N., Ekbal, A., Ban- dyopadhyay, S., 2009. Named Entity Recognition for Manipuri using Support Vector Machine, In proceedings of 23 rd Pacific Asia Conference on Language, Information and Computation ( PAC- LIC 23), Hong Kong, pp 811-818
Singh, Thoudam D., Singh, Yengkho m R. and Ban- dyopadhyay, S., 2010. Manipuri-English Semi Automatic Parallel Co rpora Ext raction fro m Web, In proceedings of 23rd International Conference on the Computer Processing of Oriental Lan- guages (ICCPOL 2010) -New Generation in Asian Information Processing , Redmond City, CA Vapnik, Vladimir N. 1995: The nature of Statistical learning theory. Springer References
Durrani N. 2007. Typology of Word and Automatic Word Segmentation in Urdu Text Corpus. National University of Computer and Emerging Sciences, Lahore, Pakistan.
Durrani N. and Hussain Sarmad. 2010. Urdu Word Segmentation.http://www.crulp.org/Publication/pa pers/2010/Urdu Word Segmentation NAACL.pdf (accessed on 5 th July 2010).
Jawaid Bushra and Ahmed Tafseer. 2009. Hindi to Urdu Conversion: Beyond Simple Transliteration. Proceedings of the Conference on Language & Technology, Lahore,.Pakistan, 24-31.
Lehal G. S. 2009. A Two Stage Word Segmentation System For Handling Space Insertion Problem In Urdu Script. Proceedings of World Academy of Science, Engineering and Technology, Bangkok, Thailand, 60: 321-324.
Malik Abbas, Besacier Laurent, Boitet Christian and Bhattacharyya Pushpak. 2009. A hybrid Model for Urdu Hindi Transliteration. Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP 2009, Singapore, 177-185.
Nie, J.Y., Hannan, M.L. & Jin, W. 1995. Combining dictionary, rules and statistical information in segmentation of Chinese. Computer Processing of Chinese and Oriental Languages, 9(2): 125-143.
Papageorgiou Constantine P. 1994. Japanese word segmentation by hidden Markov model. Proc. of the HLT Workshop, 283-288.
Wang Xiaolong, , Fu Guohong, Yeung Danial S., Liu James N.K., and Luk Robert. 2000. Models and algorithms of Chinese word segmentation. Proceedings of the International Conference on Artificial Intelligence (IC-AI'2000), Las Vegas, Nevada, USA, 1279-1284.
Xu Jia, Matusov Evgeny, Zens Richard, and Ney. 2005. Hermann.Integrated Chinese word segmentation in statistical machine translation. Proceedings of the International Workshop on Spoken Language Translation, Pittsburgh, PA, 141-147.
Creutz, Mathis, and Krista Lagus. 2005. Unsuper- vised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Technical Report A81, Publications in Computer and Information Science, Helsinki University of Technology.
Creutz, Mathis, and Krista Lagus. 2007. Unsuper- vised models for morpheme segmentation and morphology learning. Association for Computing Machinery Transactions on Speech and Language Processing, 4(1):1-34.
Dasgupta, Sajib, and Vincent Ng. 2006. Unsuper- vised Morphological Parsing of Bengali. Lan- guage Resources and Evaluation, 40(3-4):311- 330.
Goldsmith, John A. 2001. Unsupervised learning of the morphology of a natural language. Computa- tional Linguistics, 27(2):153-198
Goldsmith, John A. 2006. An algorithm for the un- supervised learning of morphology. Natural Lan- guage Engineering, 12(4):353-371
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recogni- tion, and Computational Linguistics. 2nd edition. Prentice-Hall, Englewood Cliffs, NJ.
Lovins, Julie B. 1968. Development of a stemming algorithm. Mechanical Translation and Computa- tional Linguistics, 11:22-31
Majumder, Prasenjit, Mandar Mitra, Swapan K. Pa- rui, Gobinda Kole, Pabitra Mitra, and Kalyanku- mar Datta. 2007. YASS: Yet another suffix strip- per. Association for Computing Machinery Trans- actions on Information Systems, 25(4):18-38.
Pandey, Amaresh K., and Tanveer J. Siddiqui. 2008. An unsupervised Hindi stemmer with heuristic improvements. In Proceedings of the Second Workshop on Analytics For Noisy Unstructured Text Data, 303:99-105.
Porter, Martin F. 1980. An algorithm for suffix strip- ping. Program, 14(3):130-137.
Ramanathan, Ananthakrishnan, and Durgesh D. Rao, A Lightweight Stemmer for Hindi, Workshop on Computational Linguistics for South-Asian Lan- guages, EACL, 2003.
Tisdall, William St. Clair. 1892. A simplified gram- mar of the Gujarati language : together with A short reading book and vocabulary. D. B. Tarapo- revala Sons & Company, Bombay. The EMILLE Corpus, http://www.lancs.ac.uk/fass/projects/corpus/emille/

FAQs

sparkles

What findings support the necessity of word segmentation in Urdu script?add

The study highlights that Urdu's absence of spaces leads to unsegmented clusters, making automation of translation difficult; thus, developing segmentation systems is vital.

How does the proposed segmentation system utilize Hindi corpora?add

The system leverages Hindi bilingual corpora for improved accuracy in recognizing Urdu words and segments them according to consistent Hindi rules.

What statistical techniques enhance the Urdu word segmentation model?add

By employing a combination of unigram and bigram frequency analysis, the model effectively determines the correct segmentation from multiple possibilities.

What challenges arise from segmentation of unknown words in Urdu?add

Unknown words without corresponding entries in the lexicon may be incorrectly segmented, particularly foreign or compound words that don't follow typical morphological rules.

What is the achieved accuracy of the proposed Urdu word segmentation system?add

The proposed system achieves a segmentation accuracy of 99.15% against a test corpus of over 1.6 million Urdu words.