sara renjit - Academia.edu (original) (raw)
Papers by sara renjit
Forum for Information Retrieval Evaluation, 2020
With the popularity of social media, communications through blogs, Facebook, Twitter, and other p... more With the popularity of social media, communications through blogs, Facebook, Twitter, and other platforms have increased. Initially, English was the only medium of communication. Fortunately, now we can communicate in any language. It has led to people using English and their own native or mother tongue language in a mixed form. Sometimes, comments in other languages have English transliterated format or other cases; people use the intended language scripts. Identifying sentiments and offensive content from such code mixed tweets is a necessary task in these times. We present a working model submitted for Task2 of the sub-track HASOC Offensive Language Identification-DravidianCodeMix in Forum for Information Retrieval Evaluation, 2020. It is a message level classification task. An embedding model-based classifier identifies offensive and not offensive comments in our approach. We applied this method in the Manglish dataset provided along with the sub-track.
Forum for Information Retrieval Evaluation, 2019
Text retrieval has taken its role in almost all domains of knowledge understanding. It has applic... more Text retrieval has taken its role in almost all domains of knowledge understanding. It has applications in the legal field where there is an extensive collection of structured and unstructured texts. Artificial Intelligence is now applied in this area to understand and retrieve legal documents. This paper explains a working model developed for the track Artificial Intelligence for Legal Assistance in Forum for Information Retrieval Evaluation, 2019 (AILA-FIRE2019). We have used an embedding model approach to represent these legal texts in a semantic vector space. The similarity between these document embeddings is found using an existing method of cosine similarity. The corpus used for building embedding models is the dataset provided in AILA-FIRE2019.
International Journal of Advanced Computer Science and Applications
Textual entailment is a relationship between two text fragments, namely, text/premise and hypothe... more Textual entailment is a relationship between two text fragments, namely, text/premise and hypothesis. It has applications in question answering systems, multi-document summarization, information retrieval systems, and social network analysis. In the era of the digital world, recognizing semantic variability is important in understanding inferences in texts. The texts are either in the form of sentences, posts, tweets, or user experiences. Hence understanding inferences from customer experiences helps companies in customer segmentation. The availability of digital information is ever-growing with textual data in almost all languages, including low resource languages. This work deals with various machine learning approaches applied to textual entailment recognition or natural language inference for Malayalam, a South Indian low resource language. A performance-based analysis using machine learning classification techniques such as Logistic Regression, Decision Tree, Support Vector Machine, Random Forest, AdaBoost, and Naive Bayes is done for the MaNLI (Malayalam Natural Language Inference) dataset. Different lexical and surface-level features are used for this binary and multiclass classification. With the increasing size of the dataset, there is a drop in the performance of feature-based classification. A comparison of feature-based models with deep learning approaches highlights this inference. The main focus here is the feature-based analysis with 14 different features and its comparison, essential to any NLP classification problem.
With the popularity of social media, communications through blogs, Facebook, Twitter, and other p... more With the popularity of social media, communications through blogs, Facebook, Twitter, and other platforms have increased. Initially, English was the only medium of communication. Fortunately, now we can communicate in any language. It has led to people using English and their own native or mother tongue language in a mixed form. Sometimes, comments in other languages have English transliterated format or other cases; people use the intended language scripts. Identifying sentiments and offensive content from such code mixed tweets is a necessary task in these times. We present a working model submitted for Task2 of the sub-track HASOC Offensive Language IdentificationDravidianCodeMix in Forum for Information Retrieval Evaluation, 2020. It is a message level classification task. An embeddingmodel-based classifier identifies offensive and not offensive comments in our approach. We applied this method in the Manglish dataset provided along with the sub-track.
Identifying offensive information from tweets is a vital language processing task. This task conc... more Identifying offensive information from tweets is a vital language processing task. This task concentrated more on English and other foreign languages these days. In this shared task on Offensive Language Identification in Dravidian Languages, in the First Workshop of Speech and Language Technologies for Dravidian Languages in EACL 2021, the aim is to identify offensive content from code mixed Dravidian Languages Kannada, Malayalam, and Tamil. Our team used language agnostic BERT (Bidirectional Encoder Representation from Transformers) for sentence embedding and a Softmax classifier. The language-agnostic representation based classification helped obtain good performance for all the three languages, out of which results for the Malayalam language are good enough to obtain a third position among the participating teams.
Text retrieval has taken its role in almost all domains of knowledge understanding. It has applic... more Text retrieval has taken its role in almost all domains of knowledge understanding. It has applications in the legal field where there is an extensive collection of structured and unstructured texts. Artificial Intelligence is now applied in this area to understand and retrieve legal documents. This paper explains a working model developed for the track Artificial Intelligence for Legal Assistance in Forum for Information Retrieval Evaluation, 2019 (AILA-FIRE2019). We have used an embedding model approach to represent these legal texts in a semantic vector space. The similarity between these document embeddings is found using an existing method of cosine similarity. The corpus used for building embedding models is the dataset provided in AILA-FIRE2019.
Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications
PeerJ Computer Science
Natural language inference (NLI) is an essential subtask in many natural language processing appl... more Natural language inference (NLI) is an essential subtask in many natural language processing applications. It is a directional relationship from premise to hypothesis. A pair of texts is defined as entailed if a text infers its meaning from the other text. The NLI is also known as textual entailment recognition, and it recognizes entailed and contradictory sentences in various NLP systems like Question Answering, Summarization and Information retrieval systems. This paper describes the NLI problem attempted for a low resource Indian language Malayalam, the regional language of Kerala. More than 30 million people speak this language. The paper is about the Malayalam NLI dataset, named MaNLI dataset, and its application of NLI in Malayalam language using different models, namely Doc2Vec (paragraph vector), fastText, BERT (Bidirectional Encoder Representation from Transformers), and LASER (Language Agnostic Sentence Representation). Our work attempts NLI in two ways, as binary classifi...
Forum for Information Retrieval Evaluation, 2020
With the popularity of social media, communications through blogs, Facebook, Twitter, and other p... more With the popularity of social media, communications through blogs, Facebook, Twitter, and other platforms have increased. Initially, English was the only medium of communication. Fortunately, now we can communicate in any language. It has led to people using English and their own native or mother tongue language in a mixed form. Sometimes, comments in other languages have English transliterated format or other cases; people use the intended language scripts. Identifying sentiments and offensive content from such code mixed tweets is a necessary task in these times. We present a working model submitted for Task2 of the sub-track HASOC Offensive Language Identification-DravidianCodeMix in Forum for Information Retrieval Evaluation, 2020. It is a message level classification task. An embedding model-based classifier identifies offensive and not offensive comments in our approach. We applied this method in the Manglish dataset provided along with the sub-track.
Forum for Information Retrieval Evaluation, 2019
Text retrieval has taken its role in almost all domains of knowledge understanding. It has applic... more Text retrieval has taken its role in almost all domains of knowledge understanding. It has applications in the legal field where there is an extensive collection of structured and unstructured texts. Artificial Intelligence is now applied in this area to understand and retrieve legal documents. This paper explains a working model developed for the track Artificial Intelligence for Legal Assistance in Forum for Information Retrieval Evaluation, 2019 (AILA-FIRE2019). We have used an embedding model approach to represent these legal texts in a semantic vector space. The similarity between these document embeddings is found using an existing method of cosine similarity. The corpus used for building embedding models is the dataset provided in AILA-FIRE2019.
International Journal of Advanced Computer Science and Applications
Textual entailment is a relationship between two text fragments, namely, text/premise and hypothe... more Textual entailment is a relationship between two text fragments, namely, text/premise and hypothesis. It has applications in question answering systems, multi-document summarization, information retrieval systems, and social network analysis. In the era of the digital world, recognizing semantic variability is important in understanding inferences in texts. The texts are either in the form of sentences, posts, tweets, or user experiences. Hence understanding inferences from customer experiences helps companies in customer segmentation. The availability of digital information is ever-growing with textual data in almost all languages, including low resource languages. This work deals with various machine learning approaches applied to textual entailment recognition or natural language inference for Malayalam, a South Indian low resource language. A performance-based analysis using machine learning classification techniques such as Logistic Regression, Decision Tree, Support Vector Machine, Random Forest, AdaBoost, and Naive Bayes is done for the MaNLI (Malayalam Natural Language Inference) dataset. Different lexical and surface-level features are used for this binary and multiclass classification. With the increasing size of the dataset, there is a drop in the performance of feature-based classification. A comparison of feature-based models with deep learning approaches highlights this inference. The main focus here is the feature-based analysis with 14 different features and its comparison, essential to any NLP classification problem.
With the popularity of social media, communications through blogs, Facebook, Twitter, and other p... more With the popularity of social media, communications through blogs, Facebook, Twitter, and other platforms have increased. Initially, English was the only medium of communication. Fortunately, now we can communicate in any language. It has led to people using English and their own native or mother tongue language in a mixed form. Sometimes, comments in other languages have English transliterated format or other cases; people use the intended language scripts. Identifying sentiments and offensive content from such code mixed tweets is a necessary task in these times. We present a working model submitted for Task2 of the sub-track HASOC Offensive Language IdentificationDravidianCodeMix in Forum for Information Retrieval Evaluation, 2020. It is a message level classification task. An embeddingmodel-based classifier identifies offensive and not offensive comments in our approach. We applied this method in the Manglish dataset provided along with the sub-track.
Identifying offensive information from tweets is a vital language processing task. This task conc... more Identifying offensive information from tweets is a vital language processing task. This task concentrated more on English and other foreign languages these days. In this shared task on Offensive Language Identification in Dravidian Languages, in the First Workshop of Speech and Language Technologies for Dravidian Languages in EACL 2021, the aim is to identify offensive content from code mixed Dravidian Languages Kannada, Malayalam, and Tamil. Our team used language agnostic BERT (Bidirectional Encoder Representation from Transformers) for sentence embedding and a Softmax classifier. The language-agnostic representation based classification helped obtain good performance for all the three languages, out of which results for the Malayalam language are good enough to obtain a third position among the participating teams.
Text retrieval has taken its role in almost all domains of knowledge understanding. It has applic... more Text retrieval has taken its role in almost all domains of knowledge understanding. It has applications in the legal field where there is an extensive collection of structured and unstructured texts. Artificial Intelligence is now applied in this area to understand and retrieve legal documents. This paper explains a working model developed for the track Artificial Intelligence for Legal Assistance in Forum for Information Retrieval Evaluation, 2019 (AILA-FIRE2019). We have used an embedding model approach to represent these legal texts in a semantic vector space. The similarity between these document embeddings is found using an existing method of cosine similarity. The corpus used for building embedding models is the dataset provided in AILA-FIRE2019.
Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications
PeerJ Computer Science
Natural language inference (NLI) is an essential subtask in many natural language processing appl... more Natural language inference (NLI) is an essential subtask in many natural language processing applications. It is a directional relationship from premise to hypothesis. A pair of texts is defined as entailed if a text infers its meaning from the other text. The NLI is also known as textual entailment recognition, and it recognizes entailed and contradictory sentences in various NLP systems like Question Answering, Summarization and Information retrieval systems. This paper describes the NLI problem attempted for a low resource Indian language Malayalam, the regional language of Kerala. More than 30 million people speak this language. The paper is about the Malayalam NLI dataset, named MaNLI dataset, and its application of NLI in Malayalam language using different models, namely Doc2Vec (paragraph vector), fastText, BERT (Bidirectional Encoder Representation from Transformers), and LASER (Language Agnostic Sentence Representation). Our work attempts NLI in two ways, as binary classifi...