vijini liyanage - Academia.edu (original) (raw)
Uploads
Papers by vijini liyanage
arXiv (Cornell University), Oct 25, 2023
Thanks to the state-of-the-art Large Language Models (LLMs), language generation has reached outs... more Thanks to the state-of-the-art Large Language Models (LLMs), language generation has reached outstanding levels. These models are capable of generating high quality content, thus making it a challenging task to detect generated text from human-written content. Despite the advantages provided by Natural Language Generation, the inability to distinguish automatically generated text can raise ethical concerns in terms of authenticity. Consequently, it is important to design and develop methodologies to detect artificial content. In our work, we present some classification models constructed by ensembling transformer models such as Sci-BERT, DeBERTa and XLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate that the considered ensemble architectures surpass the performance of the individual transformer models for classification. Furthermore, the proposed SciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared task 2023 data.
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Lecture Notes in Computer Science, 2023
2017 Moratuwa Engineering Research Conference (MERCon), 2017
Software development process encompasses multiple types of and differentiated versions of artefac... more Software development process encompasses multiple types of and differentiated versions of artefacts during the corresponding lifecycle. These artefacts are vulnerable to artefact drift or erosion when the product being developed gets changed. As a result different artefacts are subject to differential rates of updates compared to each other. Managing the software artefacts is one of the major problems in software industry. When the software process evolves the inconsistencies between artefacts also be evolve and it occurs within different rates. Traceability between software artefacts is considered as a very important factor in today development process. Traceability between artefacts helps the software professionals to track back and forth between artefacts. In order to identify and visualize different relationships between a selected set of software artefact types, Software Artefacts Traceability Analyzer (SAT-Analyzer) was designed and developed. This tool at present is supporting traceability management for requirement specification, design specification and source code. In this paper the work carried out extending SAT Analyzer to support DevOps practices with traceability. This research has considered the Testing artefacts, Configuration artefacts and Deployment artefacts for traceability management within DevOps practices. Adding continuous integration support to this tool is a main area of work as part of the research. Hence SAT Analyzer is linked with Jenkins continuous integration tool. At the same time the existing visualization of SAT Analyzer was enhanced to support DevOps related operations and testing, configuration and deployment traceability links. The evaluation of the modified SAT Analyzer was carried out with a case example and discussed in the paper.
arXiv (Cornell University), Feb 4, 2022
Automatic text generation based on neural language models has achieved performance levels that ma... more Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various applications, it can also be employed for malicious tasks. The diffusion of such practices represent a threat to the quality of academic publishing. To address these problems, we propose in this paper two datasets comprised of artificially generated research content: a completely synthetic dataset and a partial text substitution dataset. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE. The more natural the artificial texts seem, the more difficult they are to detect and the better is the benchmark. We also evaluate the difficulty of the task of distinguishing original from generated text by using state-of-the-art classification models.
Automatic text generation based on neural language models has achieved performance levels that ma... more Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various applications, it can also be employed for malicious tasks. The diffusion of such practices represent a threat to the quality of academic publishing. To address these problems, we propose in this paper two datasets comprised of artificially generated research content: a completely synthetic dataset and a partial text substitution dataset. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE. The more nat...
A Mathematical Word Problem (MWP) differs from a general textual representation due to the fact t... more A Mathematical Word Problem (MWP) differs from a general textual representation due to the fact that it is comprised of numerical quantities and units, in addition to text. Therefore, MWP generation should be carefully handled. When it comes to multi-lingual MWP generation, language specific morphological and syntactic features become additional constraints. Standard template-based MWP generation techniques are incapable of identifying these language specific constraints, particularly in morphologically rich yet low resource languages such as Sinhala and Tamil. This paper presents the use of a Long Short Term Memory (LSTM) network that is capable of generating elementary level MWPs, while satisfying the aforementioned constraints. Our approach feeds a combination of character embeddings, word embeddings, and Part of Speech (POS) tag embeddings to the LSTM, in which attention is provided for numerical values and units. We trained our model for three languages, English, Sinhala and Ta...
2019 14th Conference on Industrial and Information Systems (ICIIS)
Existing approaches for automatically generating mathematical word problems are deprived of custo... more Existing approaches for automatically generating mathematical word problems are deprived of customizability and creativity due to the inherent nature of template-based mechanisms they employ. We present a solution to this problem with the use of deep neural language generation mechanisms. Our approach uses a Character Level Long Short Term Memory Network (LSTM) to generate word problems, and uses POS (Part of Speech) tags to resolve the constraints found in the generated problems. Our approach is capable of generating Mathematics Word Problems in both English and Sinhala languages with an accuracy over 90%.
arXiv (Cornell University), Oct 25, 2023
Thanks to the state-of-the-art Large Language Models (LLMs), language generation has reached outs... more Thanks to the state-of-the-art Large Language Models (LLMs), language generation has reached outstanding levels. These models are capable of generating high quality content, thus making it a challenging task to detect generated text from human-written content. Despite the advantages provided by Natural Language Generation, the inability to distinguish automatically generated text can raise ethical concerns in terms of authenticity. Consequently, it is important to design and develop methodologies to detect artificial content. In our work, we present some classification models constructed by ensembling transformer models such as Sci-BERT, DeBERTa and XLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate that the considered ensemble architectures surpass the performance of the individual transformer models for classification. Furthermore, the proposed SciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared task 2023 data.
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Lecture Notes in Computer Science, 2023
2017 Moratuwa Engineering Research Conference (MERCon), 2017
Software development process encompasses multiple types of and differentiated versions of artefac... more Software development process encompasses multiple types of and differentiated versions of artefacts during the corresponding lifecycle. These artefacts are vulnerable to artefact drift or erosion when the product being developed gets changed. As a result different artefacts are subject to differential rates of updates compared to each other. Managing the software artefacts is one of the major problems in software industry. When the software process evolves the inconsistencies between artefacts also be evolve and it occurs within different rates. Traceability between software artefacts is considered as a very important factor in today development process. Traceability between artefacts helps the software professionals to track back and forth between artefacts. In order to identify and visualize different relationships between a selected set of software artefact types, Software Artefacts Traceability Analyzer (SAT-Analyzer) was designed and developed. This tool at present is supporting traceability management for requirement specification, design specification and source code. In this paper the work carried out extending SAT Analyzer to support DevOps practices with traceability. This research has considered the Testing artefacts, Configuration artefacts and Deployment artefacts for traceability management within DevOps practices. Adding continuous integration support to this tool is a main area of work as part of the research. Hence SAT Analyzer is linked with Jenkins continuous integration tool. At the same time the existing visualization of SAT Analyzer was enhanced to support DevOps related operations and testing, configuration and deployment traceability links. The evaluation of the modified SAT Analyzer was carried out with a case example and discussed in the paper.
arXiv (Cornell University), Feb 4, 2022
Automatic text generation based on neural language models has achieved performance levels that ma... more Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various applications, it can also be employed for malicious tasks. The diffusion of such practices represent a threat to the quality of academic publishing. To address these problems, we propose in this paper two datasets comprised of artificially generated research content: a completely synthetic dataset and a partial text substitution dataset. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE. The more natural the artificial texts seem, the more difficult they are to detect and the better is the benchmark. We also evaluate the difficulty of the task of distinguishing original from generated text by using state-of-the-art classification models.
Automatic text generation based on neural language models has achieved performance levels that ma... more Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various applications, it can also be employed for malicious tasks. The diffusion of such practices represent a threat to the quality of academic publishing. To address these problems, we propose in this paper two datasets comprised of artificially generated research content: a completely synthetic dataset and a partial text substitution dataset. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE. The more nat...
A Mathematical Word Problem (MWP) differs from a general textual representation due to the fact t... more A Mathematical Word Problem (MWP) differs from a general textual representation due to the fact that it is comprised of numerical quantities and units, in addition to text. Therefore, MWP generation should be carefully handled. When it comes to multi-lingual MWP generation, language specific morphological and syntactic features become additional constraints. Standard template-based MWP generation techniques are incapable of identifying these language specific constraints, particularly in morphologically rich yet low resource languages such as Sinhala and Tamil. This paper presents the use of a Long Short Term Memory (LSTM) network that is capable of generating elementary level MWPs, while satisfying the aforementioned constraints. Our approach feeds a combination of character embeddings, word embeddings, and Part of Speech (POS) tag embeddings to the LSTM, in which attention is provided for numerical values and units. We trained our model for three languages, English, Sinhala and Ta...
2019 14th Conference on Industrial and Information Systems (ICIIS)
Existing approaches for automatically generating mathematical word problems are deprived of custo... more Existing approaches for automatically generating mathematical word problems are deprived of customizability and creativity due to the inherent nature of template-based mechanisms they employ. We present a solution to this problem with the use of deep neural language generation mechanisms. Our approach uses a Character Level Long Short Term Memory Network (LSTM) to generate word problems, and uses POS (Part of Speech) tags to resolve the constraints found in the generated problems. Our approach is capable of generating Mathematics Word Problems in both English and Sinhala languages with an accuracy over 90%.