Leila Khalatbari - Academia.edu (original) (raw)
Uploads
Papers by Leila Khalatbari
Cornell University - arXiv, Oct 6, 2022
The ability to generalise well is one of the primary desiderata of natural language processing (N... more The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it. In this paper, we aim to lay the groundwork to improve both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP, we use that taxonomy to present a comprehensive map of published generalisation studies, and we make recommendations for which areas might deserve attention in the future. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they aim to solve, the type of data shift they consider, the source by which this data shift is obtained, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 previous papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis of the current state of generalisation research in NLP, and make recommendations for the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to update as new NLP generalisation studies are published. With this work, we aim to make steps towards making state-of-the-art generalisation testing the new status quo in NLP.
DNA sequence, containing all genetic traits is not a functional entity. Instead, it is transferre... more DNA sequence, containing all genetic traits is not a functional entity. Instead, it is transferred to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functional responsibilities. Consequently protein function prediction is a momentous task in bioinformatics. Protein function can be elucidated from its structure. Protein secondary structure prediction has attracted great attention since it’s the input feature of many bioinformatics problems. The variety of proposed computational methods for protein secondary structure prediction is very extensive. Nevertheless they couldn’t achieve much due to the existing obstacles such as abstruse protein data patterns, noise, class imbalance and high dimensionality of encoding schemes of amino acid sequence...
Considerable advancements have been made in various NLP tasks based on the impressive power of la... more Considerable advancements have been made in various NLP tasks based on the impressive power of large pre-trained language models (LLMs). These results have inspired efforts to understand the limits of LLMs so as to evaluate how far we are from achieving human level general natural language understanding. In this work, we challenge the capability of LLMs with the new task of ETHICAL QUANDARY GENERATIVE QUESTION ANSWERING. Ethical quandary questions are more challenging to address because multiple conflicting answers may exist to a single quandary. We propose a system, AISOCRATES, that provides an answer with a deliberative exchange of different perspectives to an ethical quandary, in the approach of Socratic philosophy, instead of providing a closed answer like an oracle. AISOCRATES searches for different ethical principles applicable to the ethical quandary and generates an answer conditioned on the chosen principles through prompt-based fewshot learning. We also address safety concerns by providing a human controllability option in choosing ethical principles. We show that AISOCRATES generates promising answers to ethical quandary questions with multiple perspectives, 6.92% more often than answers written by human philosophers by one measure, but the system still needs improvement to match the coherence of human philosophers fully. We argue that AISOCRATES is a promising step toward developing an NLP system that incorporates human values explicitly by prompt instructions. We are releasing the code for research purposes.
ArXiv, 2018
The Gene or DNA sequence in every cell does not control genetic properties on its own; Rather, th... more The Gene or DNA sequence in every cell does not control genetic properties on its own; Rather, this is done through translation of DNA into protein and formation of a certain 3D structure. The biological function of protein is tightly connected to its specific 3D structure. Prediction of the protein secondary structure is a crucial intermediate step towards elucidating its 3D structure and function. Traditional experimental methods for prediction of protein secondary structure are expensive and time-consuming. Therefore, in the past 45 years, various machine learning approaches have been put forth. Nevertheless, their average accuracy has hardly reached beyond 80%. The possible underlying reasons are abstruse sequence-structure relation, noise in input protein data, class imbalance and high dimensional encoding schemes that are used to represent protein sequences. In this paper, we propose an accurate multi-component prediction machine to overcome the challenges of protein secondary...
Computers in Biology and Medicine
Computers in Biology and Medicine
Cornell University - arXiv, Oct 6, 2022
The ability to generalise well is one of the primary desiderata of natural language processing (N... more The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it. In this paper, we aim to lay the groundwork to improve both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP, we use that taxonomy to present a comprehensive map of published generalisation studies, and we make recommendations for which areas might deserve attention in the future. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they aim to solve, the type of data shift they consider, the source by which this data shift is obtained, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 previous papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis of the current state of generalisation research in NLP, and make recommendations for the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to update as new NLP generalisation studies are published. With this work, we aim to make steps towards making state-of-the-art generalisation testing the new status quo in NLP.
DNA sequence, containing all genetic traits is not a functional entity. Instead, it is transferre... more DNA sequence, containing all genetic traits is not a functional entity. Instead, it is transferred to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functional responsibilities. Consequently protein function prediction is a momentous task in bioinformatics. Protein function can be elucidated from its structure. Protein secondary structure prediction has attracted great attention since it’s the input feature of many bioinformatics problems. The variety of proposed computational methods for protein secondary structure prediction is very extensive. Nevertheless they couldn’t achieve much due to the existing obstacles such as abstruse protein data patterns, noise, class imbalance and high dimensionality of encoding schemes of amino acid sequence...
Considerable advancements have been made in various NLP tasks based on the impressive power of la... more Considerable advancements have been made in various NLP tasks based on the impressive power of large pre-trained language models (LLMs). These results have inspired efforts to understand the limits of LLMs so as to evaluate how far we are from achieving human level general natural language understanding. In this work, we challenge the capability of LLMs with the new task of ETHICAL QUANDARY GENERATIVE QUESTION ANSWERING. Ethical quandary questions are more challenging to address because multiple conflicting answers may exist to a single quandary. We propose a system, AISOCRATES, that provides an answer with a deliberative exchange of different perspectives to an ethical quandary, in the approach of Socratic philosophy, instead of providing a closed answer like an oracle. AISOCRATES searches for different ethical principles applicable to the ethical quandary and generates an answer conditioned on the chosen principles through prompt-based fewshot learning. We also address safety concerns by providing a human controllability option in choosing ethical principles. We show that AISOCRATES generates promising answers to ethical quandary questions with multiple perspectives, 6.92% more often than answers written by human philosophers by one measure, but the system still needs improvement to match the coherence of human philosophers fully. We argue that AISOCRATES is a promising step toward developing an NLP system that incorporates human values explicitly by prompt instructions. We are releasing the code for research purposes.
ArXiv, 2018
The Gene or DNA sequence in every cell does not control genetic properties on its own; Rather, th... more The Gene or DNA sequence in every cell does not control genetic properties on its own; Rather, this is done through translation of DNA into protein and formation of a certain 3D structure. The biological function of protein is tightly connected to its specific 3D structure. Prediction of the protein secondary structure is a crucial intermediate step towards elucidating its 3D structure and function. Traditional experimental methods for prediction of protein secondary structure are expensive and time-consuming. Therefore, in the past 45 years, various machine learning approaches have been put forth. Nevertheless, their average accuracy has hardly reached beyond 80%. The possible underlying reasons are abstruse sequence-structure relation, noise in input protein data, class imbalance and high dimensional encoding schemes that are used to represent protein sequences. In this paper, we propose an accurate multi-component prediction machine to overcome the challenges of protein secondary...
Computers in Biology and Medicine
Computers in Biology and Medicine