Mutee U Rahman | Isra University (original) (raw)

Papers by Mutee U Rahman

Research paper thumbnail of Developing a Sindhi Computational Resource Grammar in Lexical Functional Grammar Framework

Isra University, Hyderabad, 2017

Research paper thumbnail of Towards Transliteration between Sindhi Scripts Using Roman Script

Social Science Research Network, Oct 30, 2015

Research paper thumbnail of Partial Word Order Syntax of Urdu/Sindhi and Linear Specification Language

JISR management and social sciences & economics, 2007

Like most of the South-Asian languages Urdu and Sindhi are partial word order languages. Conventi... more Like most of the South-Asian languages Urdu and Sindhi are partial word order languages. Conventional syntax representation models like Context Free Grammars are not capable enough to cope with partial word order syntax. Linear Specification Language (LSL) is an extension of Context-Free Grammars (CFGs) which allows arbitrary partial order (free word order) on the right hand side of grammar rule. Partial word order in LSL is handled by using different types of linear precedence (LP) constraints. LSL by using LP constraints is capable enough to represent the syntax of partial word order sentence. Issues related to represent Urdu/Sindhi language sentences with their constituent parts in LSL are discussed. LSL versions for different types of Urdu and Sindhi sentences are presented.

Research paper thumbnail of Universal Dependencies for Urdu Noisy Text

International Journal of Advanced Trends in Computer Science and Engineering, Jun 7, 2021

Research paper thumbnail of Adverb agreement in Urdu, Sindhi and Punjabi

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar, 2016

We discuss agreeing adverbs in Urdu, Sindhi and Punjabi. We adduce crosslinguistic evidence that ... more We discuss agreeing adverbs in Urdu, Sindhi and Punjabi. We adduce crosslinguistic evidence that is based mainly on similar patterns in Romance and posit that there is a close connection between resultatives and so-called pseudo-resultatives, which the agreeing adverbs appear to instantiate. We propose a diachronic relationship by which the originally predicative part of a resultative is reinterpreted as an adjunct that modifies the overall event predication, not just the result.

Research paper thumbnail of Universal Dependencies for Urdu Noisy Text

International Journal of Advanced Trends in Computer Science and Engineering, Jun 7, 2021

Research paper thumbnail of A Multilayered Urdu Treebank

The paper presents the design and construction of a multilayered phrase structure treebank. The t... more The paper presents the design and construction of a multilayered phrase structure treebank. The treebank consists of three layers for phrases, grammatical functions and semantic roles. A small phrase tagset (consisting of 12 tags) is used as the primary label of the phrase. Phrase label is followed by grammatical function (mainly inspired by lexical functional grammar). It is followed by the semantic role label using propbank roles. 1,300 sentences from CLE Urdu Digest Corpus are annotated using the treebank guideline1.

Research paper thumbnail of Sindhi Stemmer using Affix Removal Method

International Journal of Advanced Trends in Computer Science and Engineering, 2021

Research paper thumbnail of Bootstrapping Dependency Treebank of Urdu Noisy Text

International Journal of Emerging Trends in Engineering Research, 2021

This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text ... more This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text dependency treebank. To overcome the bottleneck of manually annotating corpus for a new domain of user-generated text, MaltParser, an opensource, data-driven dependency parser, is used to bootstrap the treebank in semi-automatic manner for corpus annotation after being trained on 500 tweet Urdu Noisy Text Dependency Treebank. Total four bootstrapping iterations were performed. At the end of each iteration, 300 Urdu tweets were automatically tagged, and the performance of parser model was evaluated against the development set. 75 automatically tagged tweets were randomly selected out of pre-tagged 300 tweets for manual correction, which were then added in the training set for parser retraining. Finally, at the end of last iteration, parser performance was evaluated against test set. The final supervised bootstrapping model obtains a LA of 72.1%, UAS of 75.7% and LAS of 64.9%, which is a s...

Research paper thumbnail of Developing a POS Tagged Corpus of Urdu Tweets

Computers, 2020

Processing of social media text like tweets is challenging for traditional Natural Language Proce... more Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present...

Research paper thumbnail of Towards Transliteration between Sindhi Scripts Using Roman Script

Linguistics and Literature Review, 2015

Research paper thumbnail of Towards Sindhi Corpus Construction

Linguistics and Literature Review, 2015

Research paper thumbnail of Analysis of Sindhi Spelling Error Patterns for Spelling Error Detection and Correction

Statistical analysis of spelling error trends in a language plays important role in automatic spe... more Statistical analysis of spelling error trends in a language plays important role in automatic spelling error detection and correction. Comprehensive statistical analysis of spelling error trends for Sindhi is still subject of research. This research study identifies and analyses the spelling error trends in Sindhi. The statistical analysis of error trends is based on a real time corpus collected from different sources. A corpus based dictionary is developed and used for identification of errors from a test corpus. Both traditional (insertion, deletion, substitution and transposition) and language specific error trends are identified and analyzed. Errors are categorized into different types along with their statistical results. Reasons of various corpus/language specific error trends are also discussed. Keywords—Spelling error; spelling error patterns; Sindhi spelling errors; error pattern analysis;

Research paper thumbnail of Finite State Morphology and Sindhi Noun Inflections

Research paper thumbnail of Learning from Peer Mistakes: Collaborative UML-Based ITS with Peer Feedback Evaluation

Computers, 2022

Collaborative Intelligent Tutoring Systems (ITSs) use peer tutor assessment to give feedback to s... more Collaborative Intelligent Tutoring Systems (ITSs) use peer tutor assessment to give feedback to students in solving problems. Through this feedback, the students reflect on their thinking and try to improve it when they get similar questions. The accuracy of the feedback given by the peers is important because this helps students to improve their learning skills. If the student acting as a peer tutor is unclear about the topic, then they will probably provide incorrect feedback. There have been very few attempts in the literature that provide limited support to improve the accuracy and relevancy of peer feedback. This paper presents a collaborative ITS to teach Unified Modeling Language (UML), which is designed in such a way that it can detect erroneous feedback before it is delivered to the student. The evaluations conducted in this study indicate that receiving and sending incorrect feedback have negative impact on students’ learning skills. Furthermore, the results also show that...

Research paper thumbnail of Performance Comparison of Bootstrapped Statistical Taggers on Urdu Tweets

International Journal of Scientific and Research Publications (IJSRP), 2021

Twitter, a social media platform has experienced substantial growth over the last few years. Thus... more Twitter, a social media platform has experienced substantial growth over the last few years. Thus, huge number of tweets from various communities is available and used for various NLP applications such as Opinion mining, information extraction, sentiment analysis etc. One of the key pre-processing steps in such NLP applications is Part-of-Speech (POS) tagging. POS tagging of Twitter data (also called noisy text) is different than conventional POS tagging due to informal nature and presence of Twitter specific elements. Resources for POS tagging of tweet specific data are mostly available for English. Though, availability of tagset and language independent statistical taggers do provide opportunity for resource-poor languages such as Urdu to expand coverage of NLP tools to this new domain of POS tagging for which little effort has been reported. The aim of this study is twofold. First, is to investigate how well the statistical taggers developed for POS tagging of structured text far...

Research paper thumbnail of Towards Silver Standard Dependency Treebank of Urdu Tweets

International Journal of Advanced Trends in Computer Science and Engineering

Manually annotated corpus is a perquisite for several natural language processing applications in... more Manually annotated corpus is a perquisite for several natural language processing applications including parsing. Nevertheless, annotated corpus is not always available for resource-poor languages, especially when domain under consideration is noisy user-generated data found on social media platforms such as Twitter. To overcome this deficiency of hand-annotated corpus, researchers have focused their attention on semi-automatic corpus annotation methods. This paper describes the experiments carried out using semi-automatic methods like self-training and co-training in an attempt for creating silver-standard dependency treebank of Urdu tweets. Six iterations of each approach were performed using same experimental conditions using MaltParser and Parsito parser, both statistical data driven parsers. For self-training experiments, the best performing MaltParser model was trained on 1250 Urdu tweets, with an accuracy of 70.2% LA, 74.4% UAS, 63% LAS. Whereas the best performing Parsito model was also trained on 1250 Urdu tweets with an accuracy of 70.8% LA, 74.8% UAS, 63.4% LAS. For co-training experiments, best performing MaltParser model was trained on 1500 Urdu tweets, with an accuracy of 70.5% LA, 74.4% UAS, 63.2% LAS. The best performing Parsito model was also trained on 1500 Urdu tweets with an accuracy of 70.5% LA, 74.3% UAS, 63% LAS. Although, there was not much difference between the results of both approaches, co-training results were slightly better for both parsers and is used for generating a silver-standard dependency treebank of 4500 Urdu tweets.

Research paper thumbnail of Developing a POS Tagged Corpus of Urdu Tweets

Computers

Processing of social media text like tweets is challenging for traditional Natural Language Proce... more Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present...

Research paper thumbnail of Bootstrapping Dependency Treebank of Urdu Noisy Text

International Journal of Emerging Trends in Engineering Research, 2021

This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text ... more This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text dependency treebank. To overcome the bottleneck of manually annotating corpus for a new domain of user-generated text, MaltParser, an opensource, data-driven dependency parser, is used to bootstrap the treebank in semi-automatic manner for corpus annotation after being trained on 500 tweet Urdu Noisy Text Dependency Treebank. Total four bootstrapping iterations were performed. At the end of each iteration, 300 Urdu tweets were automatically tagged, and the performance of parser model was evaluated against the development set. 75 automatically tagged tweets were randomly selected out of pre-tagged 300 tweets for manual correction, which were then added in the training set for parser retraining. Finally, at the end of last iteration, parser performance was evaluated against test set. The final supervised bootstrapping model obtains a LA of 72.1%, UAS of 75.7% and LAS of 64.9%, which is a significant improvement over baseline score of 69.8% LA, 74% UAS, and 62.9% LAS

Research paper thumbnail of Universal Dependencies for Urdu Noisy Text

International Journal of Advanced Trends in Computer Science and Engineering , 2021

In this paper, the process of creating a Dependency Treebank for tweets in Urdu, a morphologicall... more In this paper, the process of creating a Dependency Treebank for tweets in Urdu, a morphologically rich and less-resourced language is described. The 500 Urdu tweets treebank is created by manually annotating the treebank with lemma, POS tags, morphological and syntactic relations using the Universal Dependencies annotation scheme, adopted to the peculiarities of Urdu social media text. annotation process is evaluated through Inter-annotator agreement for dependency relations and total agreement of 94.5% and resultant weighted Kappa  = 0.876 was observed. The treebank is evaluated through 10-fold cross validation using Maltparser with various feature settings. Results show average UAS score of 74%, LAS score of 62.9% and LA score of 69.8%.

Research paper thumbnail of Developing a Sindhi Computational Resource Grammar in Lexical Functional Grammar Framework

Isra University, Hyderabad, 2017

Research paper thumbnail of Towards Transliteration between Sindhi Scripts Using Roman Script

Social Science Research Network, Oct 30, 2015

Research paper thumbnail of Partial Word Order Syntax of Urdu/Sindhi and Linear Specification Language

JISR management and social sciences & economics, 2007

Like most of the South-Asian languages Urdu and Sindhi are partial word order languages. Conventi... more Like most of the South-Asian languages Urdu and Sindhi are partial word order languages. Conventional syntax representation models like Context Free Grammars are not capable enough to cope with partial word order syntax. Linear Specification Language (LSL) is an extension of Context-Free Grammars (CFGs) which allows arbitrary partial order (free word order) on the right hand side of grammar rule. Partial word order in LSL is handled by using different types of linear precedence (LP) constraints. LSL by using LP constraints is capable enough to represent the syntax of partial word order sentence. Issues related to represent Urdu/Sindhi language sentences with their constituent parts in LSL are discussed. LSL versions for different types of Urdu and Sindhi sentences are presented.

Research paper thumbnail of Universal Dependencies for Urdu Noisy Text

International Journal of Advanced Trends in Computer Science and Engineering, Jun 7, 2021

Research paper thumbnail of Adverb agreement in Urdu, Sindhi and Punjabi

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar, 2016

We discuss agreeing adverbs in Urdu, Sindhi and Punjabi. We adduce crosslinguistic evidence that ... more We discuss agreeing adverbs in Urdu, Sindhi and Punjabi. We adduce crosslinguistic evidence that is based mainly on similar patterns in Romance and posit that there is a close connection between resultatives and so-called pseudo-resultatives, which the agreeing adverbs appear to instantiate. We propose a diachronic relationship by which the originally predicative part of a resultative is reinterpreted as an adjunct that modifies the overall event predication, not just the result.

Research paper thumbnail of Universal Dependencies for Urdu Noisy Text

International Journal of Advanced Trends in Computer Science and Engineering, Jun 7, 2021

Research paper thumbnail of A Multilayered Urdu Treebank

The paper presents the design and construction of a multilayered phrase structure treebank. The t... more The paper presents the design and construction of a multilayered phrase structure treebank. The treebank consists of three layers for phrases, grammatical functions and semantic roles. A small phrase tagset (consisting of 12 tags) is used as the primary label of the phrase. Phrase label is followed by grammatical function (mainly inspired by lexical functional grammar). It is followed by the semantic role label using propbank roles. 1,300 sentences from CLE Urdu Digest Corpus are annotated using the treebank guideline1.

Research paper thumbnail of Sindhi Stemmer using Affix Removal Method

International Journal of Advanced Trends in Computer Science and Engineering, 2021

Research paper thumbnail of Bootstrapping Dependency Treebank of Urdu Noisy Text

International Journal of Emerging Trends in Engineering Research, 2021

This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text ... more This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text dependency treebank. To overcome the bottleneck of manually annotating corpus for a new domain of user-generated text, MaltParser, an opensource, data-driven dependency parser, is used to bootstrap the treebank in semi-automatic manner for corpus annotation after being trained on 500 tweet Urdu Noisy Text Dependency Treebank. Total four bootstrapping iterations were performed. At the end of each iteration, 300 Urdu tweets were automatically tagged, and the performance of parser model was evaluated against the development set. 75 automatically tagged tweets were randomly selected out of pre-tagged 300 tweets for manual correction, which were then added in the training set for parser retraining. Finally, at the end of last iteration, parser performance was evaluated against test set. The final supervised bootstrapping model obtains a LA of 72.1%, UAS of 75.7% and LAS of 64.9%, which is a s...

Research paper thumbnail of Developing a POS Tagged Corpus of Urdu Tweets

Computers, 2020

Processing of social media text like tweets is challenging for traditional Natural Language Proce... more Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present...

Research paper thumbnail of Towards Transliteration between Sindhi Scripts Using Roman Script

Linguistics and Literature Review, 2015

Research paper thumbnail of Towards Sindhi Corpus Construction

Linguistics and Literature Review, 2015

Research paper thumbnail of Analysis of Sindhi Spelling Error Patterns for Spelling Error Detection and Correction

Statistical analysis of spelling error trends in a language plays important role in automatic spe... more Statistical analysis of spelling error trends in a language plays important role in automatic spelling error detection and correction. Comprehensive statistical analysis of spelling error trends for Sindhi is still subject of research. This research study identifies and analyses the spelling error trends in Sindhi. The statistical analysis of error trends is based on a real time corpus collected from different sources. A corpus based dictionary is developed and used for identification of errors from a test corpus. Both traditional (insertion, deletion, substitution and transposition) and language specific error trends are identified and analyzed. Errors are categorized into different types along with their statistical results. Reasons of various corpus/language specific error trends are also discussed. Keywords—Spelling error; spelling error patterns; Sindhi spelling errors; error pattern analysis;

Research paper thumbnail of Finite State Morphology and Sindhi Noun Inflections

Research paper thumbnail of Learning from Peer Mistakes: Collaborative UML-Based ITS with Peer Feedback Evaluation

Computers, 2022

Collaborative Intelligent Tutoring Systems (ITSs) use peer tutor assessment to give feedback to s... more Collaborative Intelligent Tutoring Systems (ITSs) use peer tutor assessment to give feedback to students in solving problems. Through this feedback, the students reflect on their thinking and try to improve it when they get similar questions. The accuracy of the feedback given by the peers is important because this helps students to improve their learning skills. If the student acting as a peer tutor is unclear about the topic, then they will probably provide incorrect feedback. There have been very few attempts in the literature that provide limited support to improve the accuracy and relevancy of peer feedback. This paper presents a collaborative ITS to teach Unified Modeling Language (UML), which is designed in such a way that it can detect erroneous feedback before it is delivered to the student. The evaluations conducted in this study indicate that receiving and sending incorrect feedback have negative impact on students’ learning skills. Furthermore, the results also show that...

Research paper thumbnail of Performance Comparison of Bootstrapped Statistical Taggers on Urdu Tweets

International Journal of Scientific and Research Publications (IJSRP), 2021

Twitter, a social media platform has experienced substantial growth over the last few years. Thus... more Twitter, a social media platform has experienced substantial growth over the last few years. Thus, huge number of tweets from various communities is available and used for various NLP applications such as Opinion mining, information extraction, sentiment analysis etc. One of the key pre-processing steps in such NLP applications is Part-of-Speech (POS) tagging. POS tagging of Twitter data (also called noisy text) is different than conventional POS tagging due to informal nature and presence of Twitter specific elements. Resources for POS tagging of tweet specific data are mostly available for English. Though, availability of tagset and language independent statistical taggers do provide opportunity for resource-poor languages such as Urdu to expand coverage of NLP tools to this new domain of POS tagging for which little effort has been reported. The aim of this study is twofold. First, is to investigate how well the statistical taggers developed for POS tagging of structured text far...

Research paper thumbnail of Towards Silver Standard Dependency Treebank of Urdu Tweets

International Journal of Advanced Trends in Computer Science and Engineering

Manually annotated corpus is a perquisite for several natural language processing applications in... more Manually annotated corpus is a perquisite for several natural language processing applications including parsing. Nevertheless, annotated corpus is not always available for resource-poor languages, especially when domain under consideration is noisy user-generated data found on social media platforms such as Twitter. To overcome this deficiency of hand-annotated corpus, researchers have focused their attention on semi-automatic corpus annotation methods. This paper describes the experiments carried out using semi-automatic methods like self-training and co-training in an attempt for creating silver-standard dependency treebank of Urdu tweets. Six iterations of each approach were performed using same experimental conditions using MaltParser and Parsito parser, both statistical data driven parsers. For self-training experiments, the best performing MaltParser model was trained on 1250 Urdu tweets, with an accuracy of 70.2% LA, 74.4% UAS, 63% LAS. Whereas the best performing Parsito model was also trained on 1250 Urdu tweets with an accuracy of 70.8% LA, 74.8% UAS, 63.4% LAS. For co-training experiments, best performing MaltParser model was trained on 1500 Urdu tweets, with an accuracy of 70.5% LA, 74.4% UAS, 63.2% LAS. The best performing Parsito model was also trained on 1500 Urdu tweets with an accuracy of 70.5% LA, 74.3% UAS, 63% LAS. Although, there was not much difference between the results of both approaches, co-training results were slightly better for both parsers and is used for generating a silver-standard dependency treebank of 4500 Urdu tweets.

Research paper thumbnail of Developing a POS Tagged Corpus of Urdu Tweets

Computers

Processing of social media text like tweets is challenging for traditional Natural Language Proce... more Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present...

Research paper thumbnail of Bootstrapping Dependency Treebank of Urdu Noisy Text

International Journal of Emerging Trends in Engineering Research, 2021

This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text ... more This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text dependency treebank. To overcome the bottleneck of manually annotating corpus for a new domain of user-generated text, MaltParser, an opensource, data-driven dependency parser, is used to bootstrap the treebank in semi-automatic manner for corpus annotation after being trained on 500 tweet Urdu Noisy Text Dependency Treebank. Total four bootstrapping iterations were performed. At the end of each iteration, 300 Urdu tweets were automatically tagged, and the performance of parser model was evaluated against the development set. 75 automatically tagged tweets were randomly selected out of pre-tagged 300 tweets for manual correction, which were then added in the training set for parser retraining. Finally, at the end of last iteration, parser performance was evaluated against test set. The final supervised bootstrapping model obtains a LA of 72.1%, UAS of 75.7% and LAS of 64.9%, which is a significant improvement over baseline score of 69.8% LA, 74% UAS, and 62.9% LAS

Research paper thumbnail of Universal Dependencies for Urdu Noisy Text

International Journal of Advanced Trends in Computer Science and Engineering , 2021

In this paper, the process of creating a Dependency Treebank for tweets in Urdu, a morphologicall... more In this paper, the process of creating a Dependency Treebank for tweets in Urdu, a morphologically rich and less-resourced language is described. The 500 Urdu tweets treebank is created by manually annotating the treebank with lemma, POS tags, morphological and syntactic relations using the Universal Dependencies annotation scheme, adopted to the peculiarities of Urdu social media text. annotation process is evaluated through Inter-annotator agreement for dependency relations and total agreement of 94.5% and resultant weighted Kappa  = 0.876 was observed. The treebank is evaluated through 10-fold cross validation using Maltparser with various feature settings. Results show average UAS score of 74%, LAS score of 62.9% and LA score of 69.8%.