sara renjit - Academia.edu (original) (raw)

Papers by sara renjit

Research paper thumbnail of CUSAT_NLP@HASOC-Dravidian-CodeMix-FIRE2020: Identifying Offensive Language from Manglish Tweets

Forum for Information Retrieval Evaluation, 2020

With the popularity of social media, communications through blogs, Facebook, Twitter, and other p... more With the popularity of social media, communications through blogs, Facebook, Twitter, and other platforms have increased. Initially, English was the only medium of communication. Fortunately, now we can communicate in any language. It has led to people using English and their own native or mother tongue language in a mixed form. Sometimes, comments in other languages have English transliterated format or other cases; people use the intended language scripts. Identifying sentiments and offensive content from such code mixed tweets is a necessary task in these times. We present a working model submitted for Task2 of the sub-track HASOC Offensive Language Identification-DravidianCodeMix in Forum for Information Retrieval Evaluation, 2020. It is a message level classification task. An embedding model-based classifier identifies offensive and not offensive comments in our approach. We applied this method in the Manglish dataset provided along with the sub-track.

Research paper thumbnail of CUSAT NLP@AILA-FIRE2019: Similarity in Legal Texts using Document Level Embeddings

Forum for Information Retrieval Evaluation, 2019

Text retrieval has taken its role in almost all domains of knowledge understanding. It has applic... more Text retrieval has taken its role in almost all domains of knowledge understanding. It has applications in the legal field where there is an extensive collection of structured and unstructured texts. Artificial Intelligence is now applied in this area to understand and retrieve legal documents. This paper explains a working model developed for the track Artificial Intelligence for Legal Assistance in Forum for Information Retrieval Evaluation, 2019 (AILA-FIRE2019). We have used an embedding model approach to represent these legal texts in a semantic vector space. The similarity between these document embeddings is found using an existing method of cosine similarity. The corpus used for building embedding models is the dataset provided in AILA-FIRE2019.

Research paper thumbnail of Feature based Entailment Recognition for Malayalam Language Texts

International Journal of Advanced Computer Science and Applications

Textual entailment is a relationship between two text fragments, namely, text/premise and hypothe... more Textual entailment is a relationship between two text fragments, namely, text/premise and hypothesis. It has applications in question answering systems, multi-document summarization, information retrieval systems, and social network analysis. In the era of the digital world, recognizing semantic variability is important in understanding inferences in texts. The texts are either in the form of sentences, posts, tweets, or user experiences. Hence understanding inferences from customer experiences helps companies in customer segmentation. The availability of digital information is ever-growing with textual data in almost all languages, including low resource languages. This work deals with various machine learning approaches applied to textual entailment recognition or natural language inference for Malayalam, a South Indian low resource language. A performance-based analysis using machine learning classification techniques such as Logistic Regression, Decision Tree, Support Vector Machine, Random Forest, AdaBoost, and Naive Bayes is done for the MaNLI (Malayalam Natural Language Inference) dataset. Different lexical and surface-level features are used for this binary and multiclass classification. With the increasing size of the dataset, there is a drop in the performance of feature-based classification. A comparison of feature-based models with deep learning approaches highlights this inference. The main focus here is the feature-based analysis with 14 different features and its comparison, essential to any NLP classification problem.

Research paper thumbnail of CUSAT_NLP@HASOC-Dravidian-CodeMix-FIRE2020: Identifying Offensive Language from Manglish Tweets

With the popularity of social media, communications through blogs, Facebook, Twitter, and other p... more With the popularity of social media, communications through blogs, Facebook, Twitter, and other platforms have increased. Initially, English was the only medium of communication. Fortunately, now we can communicate in any language. It has led to people using English and their own native or mother tongue language in a mixed form. Sometimes, comments in other languages have English transliterated format or other cases; people use the intended language scripts. Identifying sentiments and offensive content from such code mixed tweets is a necessary task in these times. We present a working model submitted for Task2 of the sub-track HASOC Offensive Language IdentificationDravidianCodeMix in Forum for Information Retrieval Evaluation, 2020. It is a message level classification task. An embeddingmodel-based classifier identifies offensive and not offensive comments in our approach. We applied this method in the Manglish dataset provided along with the sub-track.

Research paper thumbnail of CUSATNLP@DravidianLangTech-EACL2021:Language Agnostic Classification of Offensive Content in Tweets

Identifying offensive information from tweets is a vital language processing task. This task conc... more Identifying offensive information from tweets is a vital language processing task. This task concentrated more on English and other foreign languages these days. In this shared task on Offensive Language Identification in Dravidian Languages, in the First Workshop of Speech and Language Technologies for Dravidian Languages in EACL 2021, the aim is to identify offensive content from code mixed Dravidian Languages Kannada, Malayalam, and Tamil. Our team used language agnostic BERT (Bidirectional Encoder Representation from Transformers) for sentence embedding and a Softmax classifier. The language-agnostic representation based classification helped obtain good performance for all the three languages, out of which results for the Malayalam language are good enough to obtain a third position among the participating teams.

Research paper thumbnail of CUSAT NLP@AILA-FIRE2019: Similarity in Legal Texts using Document Level Embeddings

Text retrieval has taken its role in almost all domains of knowledge understanding. It has applic... more Text retrieval has taken its role in almost all domains of knowledge understanding. It has applications in the legal field where there is an extensive collection of structured and unstructured texts. Artificial Intelligence is now applied in this area to understand and retrieve legal documents. This paper explains a working model developed for the track Artificial Intelligence for Legal Assistance in Forum for Information Retrieval Evaluation, 2019 (AILA-FIRE2019). We have used an embedding model approach to represent these legal texts in a semantic vector space. The similarity between these document embeddings is found using an existing method of cosine similarity. The corpus used for building embedding models is the dataset provided in AILA-FIRE2019.

Research paper thumbnail of Siamese Networks for Inference in Malayalam Language Texts

Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications

Research paper thumbnail of Natural language inference for Malayalam language using language agnostic sentence representation

PeerJ Computer Science

Natural language inference (NLI) is an essential subtask in many natural language processing appl... more Natural language inference (NLI) is an essential subtask in many natural language processing applications. It is a directional relationship from premise to hypothesis. A pair of texts is defined as entailed if a text infers its meaning from the other text. The NLI is also known as textual entailment recognition, and it recognizes entailed and contradictory sentences in various NLP systems like Question Answering, Summarization and Information retrieval systems. This paper describes the NLI problem attempted for a low resource Indian language Malayalam, the regional language of Kerala. More than 30 million people speak this language. The paper is about the Malayalam NLI dataset, named MaNLI dataset, and its application of NLI in Malayalam language using different models, namely Doc2Vec (paragraph vector), fastText, BERT (Bidirectional Encoder Representation from Transformers), and LASER (Language Agnostic Sentence Representation). Our work attempts NLI in two ways, as binary classifi...

Research paper thumbnail of CUSAT_NLP@HASOC-Dravidian-CodeMix-FIRE2020: Identifying Offensive Language from Manglish Tweets

Forum for Information Retrieval Evaluation, 2020

With the popularity of social media, communications through blogs, Facebook, Twitter, and other p... more With the popularity of social media, communications through blogs, Facebook, Twitter, and other platforms have increased. Initially, English was the only medium of communication. Fortunately, now we can communicate in any language. It has led to people using English and their own native or mother tongue language in a mixed form. Sometimes, comments in other languages have English transliterated format or other cases; people use the intended language scripts. Identifying sentiments and offensive content from such code mixed tweets is a necessary task in these times. We present a working model submitted for Task2 of the sub-track HASOC Offensive Language Identification-DravidianCodeMix in Forum for Information Retrieval Evaluation, 2020. It is a message level classification task. An embedding model-based classifier identifies offensive and not offensive comments in our approach. We applied this method in the Manglish dataset provided along with the sub-track.

Research paper thumbnail of CUSAT NLP@AILA-FIRE2019: Similarity in Legal Texts using Document Level Embeddings

Forum for Information Retrieval Evaluation, 2019

Text retrieval has taken its role in almost all domains of knowledge understanding. It has applic... more Text retrieval has taken its role in almost all domains of knowledge understanding. It has applications in the legal field where there is an extensive collection of structured and unstructured texts. Artificial Intelligence is now applied in this area to understand and retrieve legal documents. This paper explains a working model developed for the track Artificial Intelligence for Legal Assistance in Forum for Information Retrieval Evaluation, 2019 (AILA-FIRE2019). We have used an embedding model approach to represent these legal texts in a semantic vector space. The similarity between these document embeddings is found using an existing method of cosine similarity. The corpus used for building embedding models is the dataset provided in AILA-FIRE2019.

Research paper thumbnail of Feature based Entailment Recognition for Malayalam Language Texts

International Journal of Advanced Computer Science and Applications

Textual entailment is a relationship between two text fragments, namely, text/premise and hypothe... more Textual entailment is a relationship between two text fragments, namely, text/premise and hypothesis. It has applications in question answering systems, multi-document summarization, information retrieval systems, and social network analysis. In the era of the digital world, recognizing semantic variability is important in understanding inferences in texts. The texts are either in the form of sentences, posts, tweets, or user experiences. Hence understanding inferences from customer experiences helps companies in customer segmentation. The availability of digital information is ever-growing with textual data in almost all languages, including low resource languages. This work deals with various machine learning approaches applied to textual entailment recognition or natural language inference for Malayalam, a South Indian low resource language. A performance-based analysis using machine learning classification techniques such as Logistic Regression, Decision Tree, Support Vector Machine, Random Forest, AdaBoost, and Naive Bayes is done for the MaNLI (Malayalam Natural Language Inference) dataset. Different lexical and surface-level features are used for this binary and multiclass classification. With the increasing size of the dataset, there is a drop in the performance of feature-based classification. A comparison of feature-based models with deep learning approaches highlights this inference. The main focus here is the feature-based analysis with 14 different features and its comparison, essential to any NLP classification problem.

Research paper thumbnail of CUSAT_NLP@HASOC-Dravidian-CodeMix-FIRE2020: Identifying Offensive Language from Manglish Tweets

With the popularity of social media, communications through blogs, Facebook, Twitter, and other p... more With the popularity of social media, communications through blogs, Facebook, Twitter, and other platforms have increased. Initially, English was the only medium of communication. Fortunately, now we can communicate in any language. It has led to people using English and their own native or mother tongue language in a mixed form. Sometimes, comments in other languages have English transliterated format or other cases; people use the intended language scripts. Identifying sentiments and offensive content from such code mixed tweets is a necessary task in these times. We present a working model submitted for Task2 of the sub-track HASOC Offensive Language IdentificationDravidianCodeMix in Forum for Information Retrieval Evaluation, 2020. It is a message level classification task. An embeddingmodel-based classifier identifies offensive and not offensive comments in our approach. We applied this method in the Manglish dataset provided along with the sub-track.

Research paper thumbnail of CUSATNLP@DravidianLangTech-EACL2021:Language Agnostic Classification of Offensive Content in Tweets

Identifying offensive information from tweets is a vital language processing task. This task conc... more Identifying offensive information from tweets is a vital language processing task. This task concentrated more on English and other foreign languages these days. In this shared task on Offensive Language Identification in Dravidian Languages, in the First Workshop of Speech and Language Technologies for Dravidian Languages in EACL 2021, the aim is to identify offensive content from code mixed Dravidian Languages Kannada, Malayalam, and Tamil. Our team used language agnostic BERT (Bidirectional Encoder Representation from Transformers) for sentence embedding and a Softmax classifier. The language-agnostic representation based classification helped obtain good performance for all the three languages, out of which results for the Malayalam language are good enough to obtain a third position among the participating teams.

Research paper thumbnail of CUSAT NLP@AILA-FIRE2019: Similarity in Legal Texts using Document Level Embeddings

Text retrieval has taken its role in almost all domains of knowledge understanding. It has applic... more Text retrieval has taken its role in almost all domains of knowledge understanding. It has applications in the legal field where there is an extensive collection of structured and unstructured texts. Artificial Intelligence is now applied in this area to understand and retrieve legal documents. This paper explains a working model developed for the track Artificial Intelligence for Legal Assistance in Forum for Information Retrieval Evaluation, 2019 (AILA-FIRE2019). We have used an embedding model approach to represent these legal texts in a semantic vector space. The similarity between these document embeddings is found using an existing method of cosine similarity. The corpus used for building embedding models is the dataset provided in AILA-FIRE2019.

Research paper thumbnail of Siamese Networks for Inference in Malayalam Language Texts

Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications

Research paper thumbnail of Natural language inference for Malayalam language using language agnostic sentence representation

PeerJ Computer Science

Natural language inference (NLI) is an essential subtask in many natural language processing appl... more Natural language inference (NLI) is an essential subtask in many natural language processing applications. It is a directional relationship from premise to hypothesis. A pair of texts is defined as entailed if a text infers its meaning from the other text. The NLI is also known as textual entailment recognition, and it recognizes entailed and contradictory sentences in various NLP systems like Question Answering, Summarization and Information retrieval systems. This paper describes the NLI problem attempted for a low resource Indian language Malayalam, the regional language of Kerala. More than 30 million people speak this language. The paper is about the Malayalam NLI dataset, named MaNLI dataset, and its application of NLI in Malayalam language using different models, namely Doc2Vec (paragraph vector), fastText, BERT (Bidirectional Encoder Representation from Transformers), and LASER (Language Agnostic Sentence Representation). Our work attempts NLI in two ways, as binary classifi...