Efsun Sarioglu Kayi | Johns Hopkins University Applied Physics Lab (original) (raw)
Papers by Efsun Sarioglu Kayi
arXiv (Cornell University), Jan 27, 2024
While large language models (LLMs) are extremely capable at text generation, their outputs are st... more While large language models (LLMs) are extremely capable at text generation, their outputs are still distinguishable from human-authored text. We explore this separation across many metrics over text, many sampling techniques, many types of text data, and across two popular LLMs, LLaMA and Vicuna. Along the way, we introduce a new metric, recoverability, to highlight differences between human and machine text; and we propose a new sampling technique, burst sampling, designed to close this gap. We find that LLaMA and Vicuna have distinct distributions under many of the metrics, and that this influences our results: Recoverability separates real from fake text better than any other metric when using LLaMA. When using Vicuna, burst sampling produces text which is distributionally closer to real text compared to other sampling techniques.
2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019
Schizophrenia is one of the mental disorders that impacts a person's thinking, speech, and ac... more Schizophrenia is one of the mental disorders that impacts a person's thinking, speech, and actions. It can reduce a person’s ability to process auditory information and make decisions. Analyzing this disorder correctly is important because it might help with different ways of reducing its negative effects on its patients. Linguists and psychiatrists have been investigating language impairments and speech disorder in people with schizophrenia disorder which can be challenging. In this study, we attempt to address this issue by analyzing linguistic features i.e. cohesion in the writings and speech scripts of schizophrenia patients. Our results show that using referential cohesion with text easability or situation model features provides the best performance for speech whereas for writing dataset, readability or a combination of situation model and readability yield the best performance.
Sixteen years ago, the first "surprise language exercise" was conducted, in Cebuano. Th... more Sixteen years ago, the first "surprise language exercise" was conducted, in Cebuano. The evaluation goal of a surprise language exercise is to learn how well systems for a new language can be quickly built. This paper briefly reviews the history of surprise language exercises. Some details from the most recent surprise language exercise, in Lithuanian, are included to help to illustrate how the state of the art has advanced over this period.
arXiv (Cornell University), May 1, 2024
PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained la... more PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained language models for ranking, consistently achieves state-of-the-art performance in monolingual, crosslanguage, and multilingual retrieval. PLAID differs from ColBERT by assigning terms to clusters and representing those terms as cluster centroids plus compressed residual vectors. While PLAID is effective in batch experiments, its performance degrades in streaming settings where documents arrive over time because representations of new tokens may be poorly modeled by the earlier tokens used to select cluster centroids. PLAID Streaming Hierarchical Indexing that Runs on Terabytes of Temporal Text (PLAID SHIRTTT) addresses this concern using multi-phase incremental indexing based on hierarchical sharding. Experiments on ClueWeb09 and the multilingual NeuCLIR collection demonstrate the effectiveness of this approach both for the largest collection indexed to date by the ColBERT architecture and in the multilingual setting, respectively. CCS CONCEPTS • Information systems → Search engine indexing; Language models; Web and social media search; Multilingual and cross-lingual retrieval.
Schizophrenia is one of the most disabling mental health conditions to live with. Approximately o... more Schizophrenia is one of the most disabling mental health conditions to live with. Approximately one percent of the population has schizophrenia which makes it fairly common, and it affects many people and their families. Patients with schizophrenia suffer different symptoms: formal thought disorder (FTD), delusions, and emotional flatness. In this paper, we quantitatively and qualitatively analyze the language of patients with schizophrenia measuring various linguistic features in two modalities: speech and written text. We examine the following features: coherence and cohesion of thoughts, emotions, specificity, level of committed belief (LCB), and personality traits. Our results show that patients with schizophrenia score high in fear and neuroticism compared to healthy controls. In addition, they are more committed to their beliefs, and their writing lacks details. They score lower in most of the linguistic features of cohesion with significant p-values.
More and more, patient information is being stored in digital formats;however, to be able to maxi... more More and more, patient information is being stored in digital formats;however, to be able to maximize its usefulness, automated tools needs to be built that can effectively and efficiently process these records. Clinical decision support systems are such tools that can recommend the need for a certain medical test or therapy by examining prior patient information. This can help the clinician avoid unnecessary or potentially harmful tests or therapies. In addition, this type of automated analysis of patient data can help medical professionals make clinical decisions much faster and with more confidence. As such, the speed and quality of healthcare would be improved with reduced costs. One popular automated use of clinical reports is predicting the existence or absence of certain conditions in a given report. This type of analysis, called text classification in general, can learn the characteristics of such conditions from a previously labeled dataset of clinical reports. In this research, novel techniques for better performance of automated classification of clinical reports are developed and compared with conventional approaches. As a first step, classifiers using the raw text of the reports with standard preprocessing techniques are implemented. Additionally, biomedical NLP tools are used to extract the relevant information from the reports in a more consistent way. These extracted features are classified using conventional classifiers, including decision trees and support vector machines (SVM). While results show that the classification performance is significantly improved by using the NLP features over using the raw text, this NLP-based classification is computationally expensive and requires a significant number of manual steps to be used effectively across many different clinical areas. As an alternative, a framework for topic modeling-based classification system is built. Topic modeling techniques automatically find the interpretable themes that exist in a document collection. These topics are used to represent each report and different classifiers are built based on this representation. This system has the advantage of being more adaptable to different clinical domains than custom NLP-based classifiers. It also provides dimension reduction because there are fewer potential topic categories than the number of words in a vocabulary. The performance of topic modeling-based classifiers is better than classification using raw text and it is competitive with classification using NLP features. In addition, they provide a compact and interpretable representation. Results from this dissertation research have significant impacts on the quality and efficiency of healthcare. First of all, the classifiers built in this research can be used to automatically predict the conditions in a clinical report. They can replace the manual review of clinical reports, which can be time consuming and error-prone. In addition, with the increased accuracy and interpretability they provide, clinicians can have more confidence in utilizing such systems in real life settings.
Efficient utilization of resources has always been a challenge. Especially in a grid infrastructu... more Efficient utilization of resources has always been a challenge. Especially in a grid infrastructure where the number of nodes is comparably higher than a regular network, the status of the resources is continuously changing and hard to keep track of. A predictive approach, where resources’ status is forecasted based on their historical performances, can adapt the dynamicity of the environment. In this study, such an enhancement to the scheduling mechanism is analyzed: the utilization of the various kinds of resources, such as, memory, CPU, network, and IO are periodically monitored and future utilization of resources are predicted based on this historical information. The system employs a feature extraction and neural network combined approach: features are extracted for better accuracy and faster results. Linear, feed-forward, and recurrent networks are analyzed for time series prediction of resource’s performances. Recurrent networks combined with DWT feature extraction process resulted best predictions with good generalization.
Academic emergency medicine : official journal of the Society for Academic Emergency Medicine, Jan 14, 2016
The authors have previously demonstrated highly reliable automated classification of free-text co... more The authors have previously demonstrated highly reliable automated classification of free-text computed tomography (CT) imaging reports using a hybrid system that pairs linguistic (natural language processing) and statistical (machine learning) techniques. Previously performed for identifying the outcome of orbital fracture in unprocessed radiology reports from a clinical data repository, the performance has not been replicated for more complex outcomes. To validate automated outcome classification performance of a hybrid natural language processing (NLP) and machine learning system for brain CT imaging reports. The hypothesis was that our system has performance characteristics for identifying pediatric traumatic brain injury (TBI). This was a secondary analysis of a subset of 2,121 CT reports from the Pediatric Emergency Care Applied Research Network (PECARN) TBI study. For that project, radiologists dictated CT reports as free text, which were then deidentified and scanned as PDF ...
Proceedings of the 28th International Conference on Computational Linguistics, 2020
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track, 2020
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), 2017
This project investigates a novel approach to building computer systems that can recognize visual... more This project investigates a novel approach to building computer systems that can recognize visual situations. While much effort in computer vision has focused on identifying isolated objects in images, what people actually do is recognize coherent situations — collections of objects and their interrelations that, taken together, correspond to a known concept, such as "dog-walking", or "a fight breaking out", or "a blind person crossing the street". Situation recognition by humans may appear on the surface to be effortless, but it relies on a complex dynamic interplay among human abilities to perceive objects, systems of relationships among objects, and analogies with stored knowledge and memories. Enabling computers to flexibly recognize visual situations would create a flood of important applications in fields as diverse as autonomous vehicles, medical diagnosis, interpretation of scientific imagery, enhanced humancomputer interaction, and personal inf...
Electronic health records (EHRs) contain important clinical information about patients. Efficient... more Electronic health records (EHRs) contain important clinical information about patients. Efficient and effective use of this information could supplement or even replace manual chart review as a means of studying and improving the quality and safety of healthcare delivery. However, some of these clinical data are in the form of free text and require pre-processing before use in automated systems. A common free text data source is radiology reports, typically dictated by radiologists to explain their interpretations. We sought to demonstrate machine learning classification of computed tomography (CT) imaging reports into binary outcomes, i.e. positive and negative for fracture, using regular text classification and classifiers based on topic modeling. Topic modeling provides interpretable themes (topic distributions) in reports, a representation that is more compact than the commonly used bag-of-words representation and can be processed faster than raw text in subsequent automated pro...
Distributed word embeddings have become ubiquitous in natural language processing as they have be... more Distributed word embeddings have become ubiquitous in natural language processing as they have been shown to improve performance in many semantic and syntactic tasks. Popular models for learning cross-lingual word embeddings do not consider the morphology of words. We propose an approach to learn bilingual embeddings using parallel data and subword information that is expressed in various forms, i.e. character n-grams, morphemes obtained by unsupervised morphological segmentation and byte pair encoding. We report results for three low resource morphologically rich languages (Swahili, Tagalog, and Somali) and a high resource language (German) in a simulated a low-resource scenario. Our results show that our method that leverages subword information outperforms the model without subword information, both in intrinsic and extrinsic evaluations of the learned embeddings. Specifically, analogy reasoning results show that using subwords helps capture syntactic characteristics. Semanticall...
Findings of the Association for Computational Linguistics: EMNLP 2020
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
2012 11th International Conference on Machine Learning and Applications, 2012
ABSTRACT Large amount of electronic clinical data encompass important information in free text fo... more ABSTRACT Large amount of electronic clinical data encompass important information in free text format. To be able to help guide medical decision-making, text needs to be efficiently processed and coded. In this research, we investigate techniques to improve classification of Emergency Department computed tomography (CT) reports. The proposed system uses Natural Language Processing (NLP) to generate structured output from patient reports and then applies machine learning techniques to code for the presence of clinically important injuries for traumatic orbital fracture vic- tims. Topic modeling of the corpora is also utilized as an alternative representation of the patient reports. Our results show that both NLP and topic modeling improve raw text classification results. Within NLP features, filtering the codes using modifiers produces the best performance. Topic modeling, on the other hand, shows mixed results. Topic vectors provide good dimensionality reduction and get comparable classification results as with NLP features. However, binary topic classification fails to improve upon raw text classification.
arXiv (Cornell University), Jan 27, 2024
While large language models (LLMs) are extremely capable at text generation, their outputs are st... more While large language models (LLMs) are extremely capable at text generation, their outputs are still distinguishable from human-authored text. We explore this separation across many metrics over text, many sampling techniques, many types of text data, and across two popular LLMs, LLaMA and Vicuna. Along the way, we introduce a new metric, recoverability, to highlight differences between human and machine text; and we propose a new sampling technique, burst sampling, designed to close this gap. We find that LLaMA and Vicuna have distinct distributions under many of the metrics, and that this influences our results: Recoverability separates real from fake text better than any other metric when using LLaMA. When using Vicuna, burst sampling produces text which is distributionally closer to real text compared to other sampling techniques.
2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019
Schizophrenia is one of the mental disorders that impacts a person's thinking, speech, and ac... more Schizophrenia is one of the mental disorders that impacts a person's thinking, speech, and actions. It can reduce a person’s ability to process auditory information and make decisions. Analyzing this disorder correctly is important because it might help with different ways of reducing its negative effects on its patients. Linguists and psychiatrists have been investigating language impairments and speech disorder in people with schizophrenia disorder which can be challenging. In this study, we attempt to address this issue by analyzing linguistic features i.e. cohesion in the writings and speech scripts of schizophrenia patients. Our results show that using referential cohesion with text easability or situation model features provides the best performance for speech whereas for writing dataset, readability or a combination of situation model and readability yield the best performance.
Sixteen years ago, the first "surprise language exercise" was conducted, in Cebuano. Th... more Sixteen years ago, the first "surprise language exercise" was conducted, in Cebuano. The evaluation goal of a surprise language exercise is to learn how well systems for a new language can be quickly built. This paper briefly reviews the history of surprise language exercises. Some details from the most recent surprise language exercise, in Lithuanian, are included to help to illustrate how the state of the art has advanced over this period.
arXiv (Cornell University), May 1, 2024
PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained la... more PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained language models for ranking, consistently achieves state-of-the-art performance in monolingual, crosslanguage, and multilingual retrieval. PLAID differs from ColBERT by assigning terms to clusters and representing those terms as cluster centroids plus compressed residual vectors. While PLAID is effective in batch experiments, its performance degrades in streaming settings where documents arrive over time because representations of new tokens may be poorly modeled by the earlier tokens used to select cluster centroids. PLAID Streaming Hierarchical Indexing that Runs on Terabytes of Temporal Text (PLAID SHIRTTT) addresses this concern using multi-phase incremental indexing based on hierarchical sharding. Experiments on ClueWeb09 and the multilingual NeuCLIR collection demonstrate the effectiveness of this approach both for the largest collection indexed to date by the ColBERT architecture and in the multilingual setting, respectively. CCS CONCEPTS • Information systems → Search engine indexing; Language models; Web and social media search; Multilingual and cross-lingual retrieval.
Schizophrenia is one of the most disabling mental health conditions to live with. Approximately o... more Schizophrenia is one of the most disabling mental health conditions to live with. Approximately one percent of the population has schizophrenia which makes it fairly common, and it affects many people and their families. Patients with schizophrenia suffer different symptoms: formal thought disorder (FTD), delusions, and emotional flatness. In this paper, we quantitatively and qualitatively analyze the language of patients with schizophrenia measuring various linguistic features in two modalities: speech and written text. We examine the following features: coherence and cohesion of thoughts, emotions, specificity, level of committed belief (LCB), and personality traits. Our results show that patients with schizophrenia score high in fear and neuroticism compared to healthy controls. In addition, they are more committed to their beliefs, and their writing lacks details. They score lower in most of the linguistic features of cohesion with significant p-values.
More and more, patient information is being stored in digital formats;however, to be able to maxi... more More and more, patient information is being stored in digital formats;however, to be able to maximize its usefulness, automated tools needs to be built that can effectively and efficiently process these records. Clinical decision support systems are such tools that can recommend the need for a certain medical test or therapy by examining prior patient information. This can help the clinician avoid unnecessary or potentially harmful tests or therapies. In addition, this type of automated analysis of patient data can help medical professionals make clinical decisions much faster and with more confidence. As such, the speed and quality of healthcare would be improved with reduced costs. One popular automated use of clinical reports is predicting the existence or absence of certain conditions in a given report. This type of analysis, called text classification in general, can learn the characteristics of such conditions from a previously labeled dataset of clinical reports. In this research, novel techniques for better performance of automated classification of clinical reports are developed and compared with conventional approaches. As a first step, classifiers using the raw text of the reports with standard preprocessing techniques are implemented. Additionally, biomedical NLP tools are used to extract the relevant information from the reports in a more consistent way. These extracted features are classified using conventional classifiers, including decision trees and support vector machines (SVM). While results show that the classification performance is significantly improved by using the NLP features over using the raw text, this NLP-based classification is computationally expensive and requires a significant number of manual steps to be used effectively across many different clinical areas. As an alternative, a framework for topic modeling-based classification system is built. Topic modeling techniques automatically find the interpretable themes that exist in a document collection. These topics are used to represent each report and different classifiers are built based on this representation. This system has the advantage of being more adaptable to different clinical domains than custom NLP-based classifiers. It also provides dimension reduction because there are fewer potential topic categories than the number of words in a vocabulary. The performance of topic modeling-based classifiers is better than classification using raw text and it is competitive with classification using NLP features. In addition, they provide a compact and interpretable representation. Results from this dissertation research have significant impacts on the quality and efficiency of healthcare. First of all, the classifiers built in this research can be used to automatically predict the conditions in a clinical report. They can replace the manual review of clinical reports, which can be time consuming and error-prone. In addition, with the increased accuracy and interpretability they provide, clinicians can have more confidence in utilizing such systems in real life settings.
Efficient utilization of resources has always been a challenge. Especially in a grid infrastructu... more Efficient utilization of resources has always been a challenge. Especially in a grid infrastructure where the number of nodes is comparably higher than a regular network, the status of the resources is continuously changing and hard to keep track of. A predictive approach, where resources’ status is forecasted based on their historical performances, can adapt the dynamicity of the environment. In this study, such an enhancement to the scheduling mechanism is analyzed: the utilization of the various kinds of resources, such as, memory, CPU, network, and IO are periodically monitored and future utilization of resources are predicted based on this historical information. The system employs a feature extraction and neural network combined approach: features are extracted for better accuracy and faster results. Linear, feed-forward, and recurrent networks are analyzed for time series prediction of resource’s performances. Recurrent networks combined with DWT feature extraction process resulted best predictions with good generalization.
Academic emergency medicine : official journal of the Society for Academic Emergency Medicine, Jan 14, 2016
The authors have previously demonstrated highly reliable automated classification of free-text co... more The authors have previously demonstrated highly reliable automated classification of free-text computed tomography (CT) imaging reports using a hybrid system that pairs linguistic (natural language processing) and statistical (machine learning) techniques. Previously performed for identifying the outcome of orbital fracture in unprocessed radiology reports from a clinical data repository, the performance has not been replicated for more complex outcomes. To validate automated outcome classification performance of a hybrid natural language processing (NLP) and machine learning system for brain CT imaging reports. The hypothesis was that our system has performance characteristics for identifying pediatric traumatic brain injury (TBI). This was a secondary analysis of a subset of 2,121 CT reports from the Pediatric Emergency Care Applied Research Network (PECARN) TBI study. For that project, radiologists dictated CT reports as free text, which were then deidentified and scanned as PDF ...
Proceedings of the 28th International Conference on Computational Linguistics, 2020
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track, 2020
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), 2017
This project investigates a novel approach to building computer systems that can recognize visual... more This project investigates a novel approach to building computer systems that can recognize visual situations. While much effort in computer vision has focused on identifying isolated objects in images, what people actually do is recognize coherent situations — collections of objects and their interrelations that, taken together, correspond to a known concept, such as "dog-walking", or "a fight breaking out", or "a blind person crossing the street". Situation recognition by humans may appear on the surface to be effortless, but it relies on a complex dynamic interplay among human abilities to perceive objects, systems of relationships among objects, and analogies with stored knowledge and memories. Enabling computers to flexibly recognize visual situations would create a flood of important applications in fields as diverse as autonomous vehicles, medical diagnosis, interpretation of scientific imagery, enhanced humancomputer interaction, and personal inf...
Electronic health records (EHRs) contain important clinical information about patients. Efficient... more Electronic health records (EHRs) contain important clinical information about patients. Efficient and effective use of this information could supplement or even replace manual chart review as a means of studying and improving the quality and safety of healthcare delivery. However, some of these clinical data are in the form of free text and require pre-processing before use in automated systems. A common free text data source is radiology reports, typically dictated by radiologists to explain their interpretations. We sought to demonstrate machine learning classification of computed tomography (CT) imaging reports into binary outcomes, i.e. positive and negative for fracture, using regular text classification and classifiers based on topic modeling. Topic modeling provides interpretable themes (topic distributions) in reports, a representation that is more compact than the commonly used bag-of-words representation and can be processed faster than raw text in subsequent automated pro...
Distributed word embeddings have become ubiquitous in natural language processing as they have be... more Distributed word embeddings have become ubiquitous in natural language processing as they have been shown to improve performance in many semantic and syntactic tasks. Popular models for learning cross-lingual word embeddings do not consider the morphology of words. We propose an approach to learn bilingual embeddings using parallel data and subword information that is expressed in various forms, i.e. character n-grams, morphemes obtained by unsupervised morphological segmentation and byte pair encoding. We report results for three low resource morphologically rich languages (Swahili, Tagalog, and Somali) and a high resource language (German) in a simulated a low-resource scenario. Our results show that our method that leverages subword information outperforms the model without subword information, both in intrinsic and extrinsic evaluations of the learned embeddings. Specifically, analogy reasoning results show that using subwords helps capture syntactic characteristics. Semanticall...
Findings of the Association for Computational Linguistics: EMNLP 2020
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
2012 11th International Conference on Machine Learning and Applications, 2012
ABSTRACT Large amount of electronic clinical data encompass important information in free text fo... more ABSTRACT Large amount of electronic clinical data encompass important information in free text format. To be able to help guide medical decision-making, text needs to be efficiently processed and coded. In this research, we investigate techniques to improve classification of Emergency Department computed tomography (CT) reports. The proposed system uses Natural Language Processing (NLP) to generate structured output from patient reports and then applies machine learning techniques to code for the presence of clinically important injuries for traumatic orbital fracture vic- tims. Topic modeling of the corpora is also utilized as an alternative representation of the patient reports. Our results show that both NLP and topic modeling improve raw text classification results. Within NLP features, filtering the codes using modifiers produces the best performance. Topic modeling, on the other hand, shows mixed results. Topic vectors provide good dimensionality reduction and get comparable classification results as with NLP features. However, binary topic classification fails to improve upon raw text classification.