Tommaso Mario Buonocore | University of Pavia (original) (raw)

Uploads

Papers by Tommaso Mario Buonocore

Research paper thumbnail of Advancing Italian Biomedical Information Extraction with Transformers-based Models: Methodological Insights and Multicenter Practical Application

Journal of Biomedical Informatics

Research paper thumbnail of A Rule-Free Approach for Cardiological Registry Filling from Italian Clinical Notes with Question Answering Transformers

Artificial Intelligence in Medicine

Research paper thumbnail of Improving Keyword-Based Topic Classification in Cancer Patient Forums with Multilingual Transformers

IOS Press eBooks, Jun 6, 2022

Online forums play an important role in connecting people who have crossed paths with cancer. The... more Online forums play an important role in connecting people who have crossed paths with cancer. These communities create networks of mutual support that cover different cancer-related topics, containing an extensive amount of heterogeneous information that can be mined to get useful insights. This work presents a case study where users' posts from an Italian cancer patient community have been classified combining both countbased and prediction-based representations to identify discussion topics, with the aim of improving message reviewing and filtering. We demonstrate that pairing simple bag-of-words representations based on keywords matching with pre-trained contextual embeddings significantly improves the overall quality of the predictions and allows the model to handle ambiguities and misspellings. By using non-English real-world data, we also investigated the reusability of pretrained multilingual models like BERT in lower data regimes like many local medical institutions.

Research paper thumbnail of A Rule-Free Approach for Cardiological Registry Filling from Italian Clinical Notes with Question Answering Transformers

Lecture Notes in Computer Science, 2023

Research paper thumbnail of Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

arXiv (Cornell University), Dec 20, 2022

In the era of digital healthcare, the huge volumes of textual information generated every day in ... more In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with taskspecific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

Research paper thumbnail of A synthetic dataset of liver disorder patients

Data in Brief, Apr 1, 2023

The data in this article include 10,0 0 0 synthetic patients with liver disorders, characterized ... more The data in this article include 10,0 0 0 synthetic patients with liver disorders, characterized by 70 different variables, including clinical features, and patient outcomes, such as hospital admission or surgery. Patient data are generated, simulating as close as possible real patient data, using a publicly available Bayesian network describing a casual model for liver disorders. By varying the network parameters, we also generated an additional set of 500 patients with characteristics that deviated from the initial patient population. We provide an overview of the synthetic data generation process and the associated scripts for generating the cohorts. This dataset can be useful for the machine learning models training and validation, especially under the effect of dataset shift between training and testing sets.

Research paper thumbnail of Localizing in-domain adaptation of transformer-based biomedical language models

Journal of Biomedical Informatics, Aug 1, 2023

Research paper thumbnail of COMINTART (COMitato INTelligenza ARTificiale): Canale divulgativo in italiano a proposito di intelligenza artificiale

Research paper thumbnail of Advancing Italian Biomedical Information Extraction with Large Language Models: Methodological Insights and Multicenter Practical Application

The introduction of computerized medical records in hospitals has reduced burdensome operations l... more The introduction of computerized medical records in hospitals has reduced burdensome operations like manual writing and information fetching. However, the data contained in medical records are still far underutilized, primarily because extracting them from unstructured textual medical records takes time and effort. Information Extraction, a subfield of Natural Language Processing, can help clinical practitioners overcome this limitation, using automated text-mining pipelines. In this work, we created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Large Language Model for this task. Moreover, we conducted several experiments with three external independent datasets to implement an effective multicenter model, with overall F1-score 84.77%, Precision 83.16%, Recall 86.44%. The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "few-shot" approach. This allowed us to establish methodological guidelines that pave the way for future implementations in this field and allow Italian hospitals to tap into important research opportunities.

Research paper thumbnail of Localising in-domain adaptation of transformer-based biomedical language models

Journal of Biomedical Informatics

Research paper thumbnail of Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

arXiv (Cornell University), Dec 20, 2022

In the era of digital healthcare, the huge volumes of textual information generated every day in ... more In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with taskspecific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

Research paper thumbnail of A synthetic dataset of liver disorder patients

Research paper thumbnail of Why did AI get this one wrong? — Tree-based explanations of machine learning model predictions

Artificial Intelligence in Medicine

Research paper thumbnail of Advancing Italian Biomedical Information Extraction with Transformers-based Models: Methodological Insights and Multicenter Practical Application

Journal of Biomedical Informatics

Research paper thumbnail of A Rule-Free Approach for Cardiological Registry Filling from Italian Clinical Notes with Question Answering Transformers

Artificial Intelligence in Medicine

Research paper thumbnail of Improving Keyword-Based Topic Classification in Cancer Patient Forums with Multilingual Transformers

IOS Press eBooks, Jun 6, 2022

Online forums play an important role in connecting people who have crossed paths with cancer. The... more Online forums play an important role in connecting people who have crossed paths with cancer. These communities create networks of mutual support that cover different cancer-related topics, containing an extensive amount of heterogeneous information that can be mined to get useful insights. This work presents a case study where users' posts from an Italian cancer patient community have been classified combining both countbased and prediction-based representations to identify discussion topics, with the aim of improving message reviewing and filtering. We demonstrate that pairing simple bag-of-words representations based on keywords matching with pre-trained contextual embeddings significantly improves the overall quality of the predictions and allows the model to handle ambiguities and misspellings. By using non-English real-world data, we also investigated the reusability of pretrained multilingual models like BERT in lower data regimes like many local medical institutions.

Research paper thumbnail of A Rule-Free Approach for Cardiological Registry Filling from Italian Clinical Notes with Question Answering Transformers

Lecture Notes in Computer Science, 2023

Research paper thumbnail of Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

arXiv (Cornell University), Dec 20, 2022

In the era of digital healthcare, the huge volumes of textual information generated every day in ... more In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with taskspecific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

Research paper thumbnail of A synthetic dataset of liver disorder patients

Data in Brief, Apr 1, 2023

The data in this article include 10,0 0 0 synthetic patients with liver disorders, characterized ... more The data in this article include 10,0 0 0 synthetic patients with liver disorders, characterized by 70 different variables, including clinical features, and patient outcomes, such as hospital admission or surgery. Patient data are generated, simulating as close as possible real patient data, using a publicly available Bayesian network describing a casual model for liver disorders. By varying the network parameters, we also generated an additional set of 500 patients with characteristics that deviated from the initial patient population. We provide an overview of the synthetic data generation process and the associated scripts for generating the cohorts. This dataset can be useful for the machine learning models training and validation, especially under the effect of dataset shift between training and testing sets.

Research paper thumbnail of Localizing in-domain adaptation of transformer-based biomedical language models

Journal of Biomedical Informatics, Aug 1, 2023

Research paper thumbnail of COMINTART (COMitato INTelligenza ARTificiale): Canale divulgativo in italiano a proposito di intelligenza artificiale

Research paper thumbnail of Advancing Italian Biomedical Information Extraction with Large Language Models: Methodological Insights and Multicenter Practical Application

The introduction of computerized medical records in hospitals has reduced burdensome operations l... more The introduction of computerized medical records in hospitals has reduced burdensome operations like manual writing and information fetching. However, the data contained in medical records are still far underutilized, primarily because extracting them from unstructured textual medical records takes time and effort. Information Extraction, a subfield of Natural Language Processing, can help clinical practitioners overcome this limitation, using automated text-mining pipelines. In this work, we created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Large Language Model for this task. Moreover, we conducted several experiments with three external independent datasets to implement an effective multicenter model, with overall F1-score 84.77%, Precision 83.16%, Recall 86.44%. The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "few-shot" approach. This allowed us to establish methodological guidelines that pave the way for future implementations in this field and allow Italian hospitals to tap into important research opportunities.

Research paper thumbnail of Localising in-domain adaptation of transformer-based biomedical language models

Journal of Biomedical Informatics

Research paper thumbnail of Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

arXiv (Cornell University), Dec 20, 2022

In the era of digital healthcare, the huge volumes of textual information generated every day in ... more In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with taskspecific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

Research paper thumbnail of A synthetic dataset of liver disorder patients

Research paper thumbnail of Why did AI get this one wrong? — Tree-based explanations of machine learning model predictions

Artificial Intelligence in Medicine