Jakub Sido - Academia.edu (original) (raw)
Papers by Jakub Sido
With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our liv... more With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.Jako každá mocný nástroj, přináší spoustu výhod al...
Lecture Notes in Computer Science, Dec 7, 2021
This paper describes a novel dataset consisting of sentences with semantic similarity annotations... more This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlatio...
ArXiv, 2022
This work proposes a new pipeline for leverag-ing data collected on the Stack Overflow website for... more This work proposes a new pipeline for leverag-ing data collected on the Stack Overflow website for pre-training a multimodal model for searching duplicates on question answering websites. Our multimodal model is trained on question descriptions and source codes in multiple programming languages. We design two new learning objectives to improve duplicate detection capabilities. The result of this work is a mature, fine-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions. Alongside the MQDD model, we release two datasets related to the software engineering domain. The first Stack Overflow Dataset (SOD) represents a massive corpus of paired questions and answers. The second Stack Overflow Duplicity Dataset (SODD) contains data for training duplicate detection models.
ArXiv, 2021
This paper describes a novel dataset consisting of sentences with semantic similarity annotations... more This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators’ agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person’s correlation coeffi...
Computational systems use natural language for communication with humans more often in the last y... more Computational systems use natural language for communication with humans more often in the last years. This work summarises state-of-the-art approaches in the field of generative models, especially in the text domain. It offers a complex study of specific problems known from this domain and related ones like adversarial training, reinforcement learning, artificial neural networks, etc. It also addresses the usage of these models in the context of non-generative approaches and the possibility of combining both. This work was supported by Grant No. SGS-2019-018 Processing of heterogeneous data and its specialized applications. Copies of this report are available on http://www.kiv.zcu.cz/en/research/publications/ or by surface mail on request sent to the following address: University of West Bohemia Department of Computer Science and Engineering Univerzitní 8 30614 Plzeň Czech Republic Copyright c ○ 2020 University of West Bohemia, Czech Republic
2019 International Conference on Applied Electronics (AE), 2019
With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our liv... more With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.
The use of automated journalism, also known as robotic journalism or artificial intelligence journ... more The use of automated journalism, also known as robotic journalism or artificial intelligence journalism, became an established practice in English-speaking countries less than ten years ago. Narrative Science and Automated Insights developed creative software that automatically generates reports. Several media outlets, including The Associated Press (AP), have started to publish their reports. Media landscape barriers based on Slavic languages, such as Czech, have caused some delays in the introduction of automated journalism, or artificial intelligence journalism, in Central and Eastern Europe. This article is a case study of the application of algorithms that transform large data files into news texts in The Czech News Agency (CTK). A research team led by Charles University provided algorithms generating reports on trading results on the Prague Stock Exchange without human intervention to The Czech News Agency in 2019. The study deals with the production of algorithms and compares th...
This work deals with curriculum learning for deep learning models for the sentiment analysis task... more This work deals with curriculum learning for deep learning models for the sentiment analysis task. We design a new way of curriculum learning for text data. We reorder the training dataset to introduce the simpler examples first. We estimate the difficulty of the examples by measuring the length of the sentences. The simple examples are supposed to be shorter. We also experiment with measuring the frequency of the words, which is a technique designed by earlier researchers. We attempt to evaluate changes in the overall accuracy of the models using both curriculum learning techniques. Our experiments do not show an increase in accuracy for any of the methods. Nevertheless, we reach a new state of the art in the sentiment analysis for Czech as a by-product of our effort.
Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications
In this paper, we present coreference resolution experiments with a newly created multilingual co... more In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD (Nedoluzhko et al., 2021). We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models-for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.
Theory and Practice of Natural Computing
ArXiv, 2020
In this paper, we describe our method for detection of lexical semantic change, i.e., word sense ... more In this paper, we describe our method for detection of lexical semantic change, i.e., word sense changes over time. We examine semantic differences between specific words in two corpora, chosen from different time periods, for English, German, Latin, and Swedish. Our method was created for the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. We ranked 1st in Sub-task 1: binary change detection, and 4th in Sub-task 2: ranked change detection. We present our method which is completely unsupervised and language independent. It consists of preparing a semantic vector space for each corpus, earlier and later; computing a linear transformation between earlier and later spaces, using Canonical Correlation Analysis and orthogonal transformation;and measuring the cosines between the transformed vector for the target word from the earlier corpus and the vector for the target word in the later corpus.
Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications
This paper describes the training process of the first Czech monolingual language representation ... more This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pretrained and fine-tuned models freely for the research community.
With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our liv... more With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.Jako každá mocný nástroj, přináší spoustu výhod al...
Lecture Notes in Computer Science, Dec 7, 2021
This paper describes a novel dataset consisting of sentences with semantic similarity annotations... more This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlatio...
ArXiv, 2022
This work proposes a new pipeline for leverag-ing data collected on the Stack Overflow website for... more This work proposes a new pipeline for leverag-ing data collected on the Stack Overflow website for pre-training a multimodal model for searching duplicates on question answering websites. Our multimodal model is trained on question descriptions and source codes in multiple programming languages. We design two new learning objectives to improve duplicate detection capabilities. The result of this work is a mature, fine-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions. Alongside the MQDD model, we release two datasets related to the software engineering domain. The first Stack Overflow Dataset (SOD) represents a massive corpus of paired questions and answers. The second Stack Overflow Duplicity Dataset (SODD) contains data for training duplicate detection models.
ArXiv, 2021
This paper describes a novel dataset consisting of sentences with semantic similarity annotations... more This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators’ agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person’s correlation coeffi...
Computational systems use natural language for communication with humans more often in the last y... more Computational systems use natural language for communication with humans more often in the last years. This work summarises state-of-the-art approaches in the field of generative models, especially in the text domain. It offers a complex study of specific problems known from this domain and related ones like adversarial training, reinforcement learning, artificial neural networks, etc. It also addresses the usage of these models in the context of non-generative approaches and the possibility of combining both. This work was supported by Grant No. SGS-2019-018 Processing of heterogeneous data and its specialized applications. Copies of this report are available on http://www.kiv.zcu.cz/en/research/publications/ or by surface mail on request sent to the following address: University of West Bohemia Department of Computer Science and Engineering Univerzitní 8 30614 Plzeň Czech Republic Copyright c ○ 2020 University of West Bohemia, Czech Republic
2019 International Conference on Applied Electronics (AE), 2019
With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our liv... more With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.
The use of automated journalism, also known as robotic journalism or artificial intelligence journ... more The use of automated journalism, also known as robotic journalism or artificial intelligence journalism, became an established practice in English-speaking countries less than ten years ago. Narrative Science and Automated Insights developed creative software that automatically generates reports. Several media outlets, including The Associated Press (AP), have started to publish their reports. Media landscape barriers based on Slavic languages, such as Czech, have caused some delays in the introduction of automated journalism, or artificial intelligence journalism, in Central and Eastern Europe. This article is a case study of the application of algorithms that transform large data files into news texts in The Czech News Agency (CTK). A research team led by Charles University provided algorithms generating reports on trading results on the Prague Stock Exchange without human intervention to The Czech News Agency in 2019. The study deals with the production of algorithms and compares th...
This work deals with curriculum learning for deep learning models for the sentiment analysis task... more This work deals with curriculum learning for deep learning models for the sentiment analysis task. We design a new way of curriculum learning for text data. We reorder the training dataset to introduce the simpler examples first. We estimate the difficulty of the examples by measuring the length of the sentences. The simple examples are supposed to be shorter. We also experiment with measuring the frequency of the words, which is a technique designed by earlier researchers. We attempt to evaluate changes in the overall accuracy of the models using both curriculum learning techniques. Our experiments do not show an increase in accuracy for any of the methods. Nevertheless, we reach a new state of the art in the sentiment analysis for Czech as a by-product of our effort.
Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications
In this paper, we present coreference resolution experiments with a newly created multilingual co... more In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD (Nedoluzhko et al., 2021). We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models-for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.
Theory and Practice of Natural Computing
ArXiv, 2020
In this paper, we describe our method for detection of lexical semantic change, i.e., word sense ... more In this paper, we describe our method for detection of lexical semantic change, i.e., word sense changes over time. We examine semantic differences between specific words in two corpora, chosen from different time periods, for English, German, Latin, and Swedish. Our method was created for the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. We ranked 1st in Sub-task 1: binary change detection, and 4th in Sub-task 2: ranked change detection. We present our method which is completely unsupervised and language independent. It consists of preparing a semantic vector space for each corpus, earlier and later; computing a linear transformation between earlier and later spaces, using Canonical Correlation Analysis and orthogonal transformation;and measuring the cosines between the transformed vector for the target word from the earlier corpus and the vector for the target word in the later corpus.
Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications
This paper describes the training process of the first Czech monolingual language representation ... more This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pretrained and fine-tuned models freely for the research community.