Jakub Sido - Academia.edu (original) (raw)

Papers by Jakub Sido

Research paper thumbnail of Hluboké učení pro textová data na mobilních zařízeních

With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our liv... more With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.Jako každá mocný nástroj, přináší spoustu výhod al...

Research paper thumbnail of On Injecting Entropy-Like Features into Deep Neural Networks for Content Relevance Assessment

Lecture Notes in Computer Science, Dec 7, 2021

Research paper thumbnail of Czech News Dataset for Semanic Textual Similarity

This paper describes a novel dataset consisting of sentences with semantic similarity annotations... more This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlatio...

Research paper thumbnail of MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain

ArXiv, 2022

This work proposes a new pipeline for leverag-ing data collected on the Stack Overflow website for... more This work proposes a new pipeline for leverag-ing data collected on the Stack Overflow website for pre-training a multimodal model for searching duplicates on question answering websites. Our multimodal model is trained on question descriptions and source codes in multiple programming languages. We design two new learning objectives to improve duplicate detection capabilities. The result of this work is a mature, fine-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions. Alongside the MQDD model, we release two datasets related to the software engineering domain. The first Stack Overflow Dataset (SOD) represents a massive corpus of paired questions and answers. The second Stack Overflow Duplicity Dataset (SODD) contains data for training duplicate detection models.

Research paper thumbnail of Czech News Dataset for Semantic Textual Similarity

ArXiv, 2021

This paper describes a novel dataset consisting of sentences with semantic similarity annotations... more This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators’ agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person’s correlation coeffi...

Research paper thumbnail of Natural Language Generation The State of the Art and the Concept of Ph

Computational systems use natural language for communication with humans more often in the last y... more Computational systems use natural language for communication with humans more often in the last years. This work summarises state-of-the-art approaches in the field of generative models, especially in the text domain. It offers a complex study of specific problems known from this domain and related ones like adversarial training, reinforcement learning, artificial neural networks, etc. It also addresses the usage of these models in the context of non-generative approaches and the possibility of combining both. This work was supported by Grant No. SGS-2019-018 Processing of heterogeneous data and its specialized applications. Copies of this report are available on http://www.kiv.zcu.cz/en/research/publications/ or by surface mail on request sent to the following address: University of West Bohemia Department of Computer Science and Engineering Univerzitní 8 30614 Plzeň Czech Republic Copyright c ○ 2020 University of West Bohemia, Czech Republic

Research paper thumbnail of Deep Learning for Text Data on Mobile Devices

2019 International Conference on Applied Electronics (AE), 2019

With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our liv... more With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.

Research paper thumbnail of The Robotic Reporter In The Czech News Agency: Automated Journalism And Augmentation In The Newsroom

The use of automated journalism, also known as robotic journalism or artificial intelligence journ... more The use of automated journalism, also known as robotic journalism or artificial intelligence journalism, became an established practice in English-speaking countries less than ten years ago. Narrative Science and Automated Insights developed creative software that automatically generates reports. Several media outlets, including The Associated Press (AP), have started to publish their reports. Media landscape barriers based on Slavic languages, such as Czech, have caused some delays in the introduction of automated journalism, or artificial intelligence journalism, in Central and Eastern Europe. This article is a case study of the application of algorithms that transform large data files into news texts in The Czech News Agency (CTK). A research team led by Charles University provided algorithms generating reports on trading results on the Prague Stock Exchange without human intervention to The Czech News Agency in 2019. The study deals with the production of algorithms and compares th...

Research paper thumbnail of Curriculum Learning in Sentiment Analysis

This work deals with curriculum learning for deep learning models for the sentiment analysis task... more This work deals with curriculum learning for deep learning models for the sentiment analysis task. We design a new way of curriculum learning for text data. We reorder the training dataset to introduce the simpler examples first. We estimate the difficulty of the examples by measuring the length of the sentences. The simple examples are supposed to be shorter. We also experiment with measuring the frequency of the words, which is a technique designed by earlier researchers. We attempt to evaluate changes in the overall accuracy of the models using both curriculum learning techniques. Our experiments do not show an increase in accuracy for any of the methods. Nevertheless, we reach a new state of the art in the sentiment analysis for Czech as a by-product of our effort.

Research paper thumbnail of Multilingual Coreference Resolution with Harmonized Annotations

Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications

In this paper, we present coreference resolution experiments with a newly created multilingual co... more In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD (Nedoluzhko et al., 2021). We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models-for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.

Research paper thumbnail of On Injecting Entropy-Like Features into Deep Neural Networks for Content Relevance Assessment

Theory and Practice of Natural Computing

Research paper thumbnail of UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection

ArXiv, 2020

In this paper, we describe our method for detection of lexical semantic change, i.e., word sense ... more In this paper, we describe our method for detection of lexical semantic change, i.e., word sense changes over time. We examine semantic differences between specific words in two corpora, chosen from different time periods, for English, German, Latin, and Swedish. Our method was created for the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. We ranked 1st in Sub-task 1: binary change detection, and 4th in Sub-task 2: ranked change detection. We present our method which is completely unsupervised and language independent. It consists of preparing a semantic vector space for each corpus, earlier and later; computing a linear transformation between earlier and later spaces, using Canonical Correlation Analysis and orthogonal transformation;and measuring the cosines between the transformed vector for the target word from the earlier corpus and the vector for the target word in the later corpus.

Research paper thumbnail of Czert – Czech BERT-like Model for Language Representation

Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications

This paper describes the training process of the first Czech monolingual language representation ... more This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pretrained and fine-tuned models freely for the research community.

Research paper thumbnail of English Dataset for Automatic Forum Extraction

Research paper thumbnail of Hluboké učení pro textová data na mobilních zařízeních

With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our liv... more With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.Jako každá mocný nástroj, přináší spoustu výhod al...

Research paper thumbnail of On Injecting Entropy-Like Features into Deep Neural Networks for Content Relevance Assessment

Lecture Notes in Computer Science, Dec 7, 2021

Research paper thumbnail of Czech News Dataset for Semanic Textual Similarity

This paper describes a novel dataset consisting of sentences with semantic similarity annotations... more This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlatio...

Research paper thumbnail of MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain

ArXiv, 2022

This work proposes a new pipeline for leverag-ing data collected on the Stack Overflow website for... more This work proposes a new pipeline for leverag-ing data collected on the Stack Overflow website for pre-training a multimodal model for searching duplicates on question answering websites. Our multimodal model is trained on question descriptions and source codes in multiple programming languages. We design two new learning objectives to improve duplicate detection capabilities. The result of this work is a mature, fine-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions. Alongside the MQDD model, we release two datasets related to the software engineering domain. The first Stack Overflow Dataset (SOD) represents a massive corpus of paired questions and answers. The second Stack Overflow Duplicity Dataset (SODD) contains data for training duplicate detection models.

Research paper thumbnail of Czech News Dataset for Semantic Textual Similarity

ArXiv, 2021

This paper describes a novel dataset consisting of sentences with semantic similarity annotations... more This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators’ agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person’s correlation coeffi...

Research paper thumbnail of Natural Language Generation The State of the Art and the Concept of Ph

Computational systems use natural language for communication with humans more often in the last y... more Computational systems use natural language for communication with humans more often in the last years. This work summarises state-of-the-art approaches in the field of generative models, especially in the text domain. It offers a complex study of specific problems known from this domain and related ones like adversarial training, reinforcement learning, artificial neural networks, etc. It also addresses the usage of these models in the context of non-generative approaches and the possibility of combining both. This work was supported by Grant No. SGS-2019-018 Processing of heterogeneous data and its specialized applications. Copies of this report are available on http://www.kiv.zcu.cz/en/research/publications/ or by surface mail on request sent to the following address: University of West Bohemia Department of Computer Science and Engineering Univerzitní 8 30614 Plzeň Czech Republic Copyright c ○ 2020 University of West Bohemia, Czech Republic

Research paper thumbnail of Deep Learning for Text Data on Mobile Devices

2019 International Conference on Applied Electronics (AE), 2019

With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our liv... more With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.

Research paper thumbnail of The Robotic Reporter In The Czech News Agency: Automated Journalism And Augmentation In The Newsroom

The use of automated journalism, also known as robotic journalism or artificial intelligence journ... more The use of automated journalism, also known as robotic journalism or artificial intelligence journalism, became an established practice in English-speaking countries less than ten years ago. Narrative Science and Automated Insights developed creative software that automatically generates reports. Several media outlets, including The Associated Press (AP), have started to publish their reports. Media landscape barriers based on Slavic languages, such as Czech, have caused some delays in the introduction of automated journalism, or artificial intelligence journalism, in Central and Eastern Europe. This article is a case study of the application of algorithms that transform large data files into news texts in The Czech News Agency (CTK). A research team led by Charles University provided algorithms generating reports on trading results on the Prague Stock Exchange without human intervention to The Czech News Agency in 2019. The study deals with the production of algorithms and compares th...

Research paper thumbnail of Curriculum Learning in Sentiment Analysis

This work deals with curriculum learning for deep learning models for the sentiment analysis task... more This work deals with curriculum learning for deep learning models for the sentiment analysis task. We design a new way of curriculum learning for text data. We reorder the training dataset to introduce the simpler examples first. We estimate the difficulty of the examples by measuring the length of the sentences. The simple examples are supposed to be shorter. We also experiment with measuring the frequency of the words, which is a technique designed by earlier researchers. We attempt to evaluate changes in the overall accuracy of the models using both curriculum learning techniques. Our experiments do not show an increase in accuracy for any of the methods. Nevertheless, we reach a new state of the art in the sentiment analysis for Czech as a by-product of our effort.

Research paper thumbnail of Multilingual Coreference Resolution with Harmonized Annotations

Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications

In this paper, we present coreference resolution experiments with a newly created multilingual co... more In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD (Nedoluzhko et al., 2021). We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models-for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.

Research paper thumbnail of On Injecting Entropy-Like Features into Deep Neural Networks for Content Relevance Assessment

Theory and Practice of Natural Computing

Research paper thumbnail of UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection

ArXiv, 2020

In this paper, we describe our method for detection of lexical semantic change, i.e., word sense ... more In this paper, we describe our method for detection of lexical semantic change, i.e., word sense changes over time. We examine semantic differences between specific words in two corpora, chosen from different time periods, for English, German, Latin, and Swedish. Our method was created for the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. We ranked 1st in Sub-task 1: binary change detection, and 4th in Sub-task 2: ranked change detection. We present our method which is completely unsupervised and language independent. It consists of preparing a semantic vector space for each corpus, earlier and later; computing a linear transformation between earlier and later spaces, using Canonical Correlation Analysis and orthogonal transformation;and measuring the cosines between the transformed vector for the target word from the earlier corpus and the vector for the target word in the later corpus.

Research paper thumbnail of Czert – Czech BERT-like Model for Language Representation

Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications

This paper describes the training process of the first Czech monolingual language representation ... more This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pretrained and fine-tuned models freely for the research community.

Research paper thumbnail of English Dataset for Automatic Forum Extraction