Agnieszka Wołk - Academia.edu (original) (raw)

Papers by Agnieszka Wołk

Research paper thumbnail of Multilingual Chatbot for E-Commerce: Data Generation and Machine Translation

pacific asia conference on information systems, 2021

Research paper thumbnail of Survey on dialogue systems including slavic languages

Research paper thumbnail of Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

Several natural languages have undergone a great deal of processing, but the problem of limited t... more Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English) from monolingual resources by calculating...

Research paper thumbnail of Mixing Textual Data Selection Methods for Improved In-Domain Data Adaptation

The efficient use of machine translation (MT) training data is being revolutionized on account of... more The efficient use of machine translation (MT) training data is being revolutionized on account of the application of advanced data selection techniques. These techniques involve sentence extraction from broad domains and adaption for MTs of in-domain data. In this research, we attempt to improve in-domain data adaptation methodologies. We focus on three techniques to select sentences for analysis. The first technique is term frequency–inverse document frequency, which originated from information retrieval (IR). The second method, cited in language modeling literature, is a perplexity-based approach. The third method is a unique concept, the Levenshtein distance, which we discuss herein. We propose an effective combination of the three data selection techniques that are applied at the corpus level. The results of this study revealed that the individual techniques are not particularly successful in practical applications. However, multilingual resources and a combination-based IR meth...

Research paper thumbnail of Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Electronics, 2021

Linguists have been focused on a qualitative comparison of the semantics from different languages... more Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-tra...

Research paper thumbnail of Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems ... more Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.

Research paper thumbnail of Hybrid approach to detecting symptoms of depression in social media entries

Sentiment and lexical analyses are widely used to detect depression or anxiety disorders. It has ... more Sentiment and lexical analyses are widely used to detect depression or anxiety disorders. It has been documented that there are significant differences in the language used by a person with emotional disorders in comparison to a healthy individual. Still, the effectiveness of these lexical approaches could be improved further because the current analysis focuses on what the social media entries are about, and not how they are written. In this study, we focus on aspects in which these short texts are similar to each other, and how they were created. We present an innovative approach to the depression screening problem by applying Collgram analysis, which is a known effective method of obtaining linguistic information from texts. We compare these results with sentiment analysis based on the BERT architecture. Finally, we create a hybrid model achieving a diagnostic accuracy of 71%.

Research paper thumbnail of Big Data Language Model of Contemporary Polish

Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, 2017

Research paper thumbnail of Unsupervised tool for quantification of progress in L2 English phraseological

Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, Sep 24, 2017

This study aimed to aid the enormous effort required to analyze phraseological writing competence... more This study aimed to aid the enormous effort required to analyze phraseological writing competence by developing an automatic evaluation tool for texts. We attempted to measure both second language (L2) writing proficiency and text quality. In our research, we adapted the CollGram technique that searches a reference corpus to determine the frequency of each pair of tokens (bi-grams) and calculates the t-score and related information. We used the Level 3 Corpus of Contemporary American English as a reference corpus. Our solution performed well in writing evaluation and is freely available as a web service or as source for other researchers.

Research paper thumbnail of Pictogram-based mobile first medical aid communicator

Procedia Computer Science, 2017

Abstract Recent progress in communications technology has been very rapid. High-speed mobile Inte... more Abstract Recent progress in communications technology has been very rapid. High-speed mobile Internet access and mobile devices have enabled the development of robust technologies such as machine translation, automated speech recognition, voice synthesis, and even speech-to-speech translation. Communication applications that support sign language recognition are also being introduced and upgraded. Nonetheless, people with speech, hearing, or mental impairment still require special communication assistance, especially for medical purposes; this makes their health and life dependent on other people. Automatic solutions for speech recognition or voice synthesis from the text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Additionally, in emergency cases, rapid information exchange is essential. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models and image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, Internet-dependent solutions cannot be used in most countries requiring humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based communication application that provides easy communication means for individuals who are speech- or hearing-impaired, have mental health issues impairing communication, or non-natives who do not speak the local language. It provides support and clarification in communication with such people using intuitive icons and interactive symbols easy to find on a mobile device. Such pictogram-based communication can be quite effective and, ultimately, make some people’s lives happier, easier, and safer. We have developed a conceptual prototype of a patient-physician communicator on a smartwatch that can be used for local as well as remote communication.

Research paper thumbnail of Machine enhanced translation of the Human Phenotype Ontology project

Procedia Computer Science, 2017

Research paper thumbnail of Semantic approach for building generated virtual-parallel corpora from monolingual texts

Poznan Studies in Contemporary Linguistics, 2019

Several natural languages have undergone a great deal of processing, but the problem of limited t... more Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English, Polish-English) from monolingual resource...

Research paper thumbnail of Early and remote detection of possible heartbeat problems with convolutional neural networks and multipart interactive training

Research paper thumbnail of Implementing Statistical Machine Translation into Mobile Augmented Reality Systems

Advances in Intelligent Systems and Computing, 2016

A statistical machine translation (SMT) capability would be very useful in augmented reality (AR)... more A statistical machine translation (SMT) capability would be very useful in augmented reality (AR) systems. For example, translating and displaying text in a smart phone camera image would be useful to a traveler needing to read signs and restaurant menus, or reading medical documents when a medical problem arises when visiting a foreign country. Such system would also be useful for foreign students to translate lectures in real time on their mobile devices. However, SMT quality has been neglected in AR systems research, which has focused on other aspects, such as image processing, optical character recognition (OCR), distributed architectures, and user interaction. In addition, general-purpose translation services, such as Google Translate, used in some AR systems are not well-tuned to produce high-quality translations in specific domains and are Internet connection dependent. This research devised SMT methods and evaluated their performance for potential use in AR systems. We give particular attention to domain-adapted SMT systems, in which an SMT capability is tuned to a particular domain of text to increase translation quality. We focus on translation between the Polish and English languages, which presents a number of challenges due to fundamental linguistic differences. However, the SMT systems used are readily extensible to other language pairs. SMT techniques are applied to two domains in translation experiments: European Medicines Agency (EMEA) medical leaflets and the Technology, Entertainment, Design (TED) lectures. In addition, field experiments are conducted on random samples of Polish text found in city signs, posters, restaurant menus, lectures on biology and computer science, and medical leaflets. Texts from these domains are translated by a number of SMT system variants, and the systems’ performance is evaluated by standard translation performance metrics and compared. The results appear very promising and encourage future applications of SMT to AR systems.

Research paper thumbnail of A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms

Computational and mathematical methods in medicine, 2017

People with speech, hearing, or mental impairment require special communication assistance, espec... more People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing im...

Research paper thumbnail of Automatic Parallel Data Mining After Bilingual Document Alignment

Advances in Intelligent Systems and Computing, 2017

Research paper thumbnail of Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Advances in Intelligent Systems and Computing, 2017

Research paper thumbnail of Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora

Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, 2016

Research paper thumbnail of Analysis of Complexity Between Spoken and Written Language for Statistical Machine Translation in West-Slavic Group

Advances in Intelligent Systems and Computing, 2016

Research paper thumbnail of Enhancements in Statistical Spoken Language Translation by De-normalization of ASR Results

Journal of Computers, 2016

Research paper thumbnail of Multilingual Chatbot for E-Commerce: Data Generation and Machine Translation

pacific asia conference on information systems, 2021

Research paper thumbnail of Survey on dialogue systems including slavic languages

Research paper thumbnail of Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

Several natural languages have undergone a great deal of processing, but the problem of limited t... more Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English) from monolingual resources by calculating...

Research paper thumbnail of Mixing Textual Data Selection Methods for Improved In-Domain Data Adaptation

The efficient use of machine translation (MT) training data is being revolutionized on account of... more The efficient use of machine translation (MT) training data is being revolutionized on account of the application of advanced data selection techniques. These techniques involve sentence extraction from broad domains and adaption for MTs of in-domain data. In this research, we attempt to improve in-domain data adaptation methodologies. We focus on three techniques to select sentences for analysis. The first technique is term frequency–inverse document frequency, which originated from information retrieval (IR). The second method, cited in language modeling literature, is a perplexity-based approach. The third method is a unique concept, the Levenshtein distance, which we discuss herein. We propose an effective combination of the three data selection techniques that are applied at the corpus level. The results of this study revealed that the individual techniques are not particularly successful in practical applications. However, multilingual resources and a combination-based IR meth...

Research paper thumbnail of Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Electronics, 2021

Linguists have been focused on a qualitative comparison of the semantics from different languages... more Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-tra...

Research paper thumbnail of Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems ... more Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.

Research paper thumbnail of Hybrid approach to detecting symptoms of depression in social media entries

Sentiment and lexical analyses are widely used to detect depression or anxiety disorders. It has ... more Sentiment and lexical analyses are widely used to detect depression or anxiety disorders. It has been documented that there are significant differences in the language used by a person with emotional disorders in comparison to a healthy individual. Still, the effectiveness of these lexical approaches could be improved further because the current analysis focuses on what the social media entries are about, and not how they are written. In this study, we focus on aspects in which these short texts are similar to each other, and how they were created. We present an innovative approach to the depression screening problem by applying Collgram analysis, which is a known effective method of obtaining linguistic information from texts. We compare these results with sentiment analysis based on the BERT architecture. Finally, we create a hybrid model achieving a diagnostic accuracy of 71%.

Research paper thumbnail of Big Data Language Model of Contemporary Polish

Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, 2017

Research paper thumbnail of Unsupervised tool for quantification of progress in L2 English phraseological

Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, Sep 24, 2017

This study aimed to aid the enormous effort required to analyze phraseological writing competence... more This study aimed to aid the enormous effort required to analyze phraseological writing competence by developing an automatic evaluation tool for texts. We attempted to measure both second language (L2) writing proficiency and text quality. In our research, we adapted the CollGram technique that searches a reference corpus to determine the frequency of each pair of tokens (bi-grams) and calculates the t-score and related information. We used the Level 3 Corpus of Contemporary American English as a reference corpus. Our solution performed well in writing evaluation and is freely available as a web service or as source for other researchers.

Research paper thumbnail of Pictogram-based mobile first medical aid communicator

Procedia Computer Science, 2017

Abstract Recent progress in communications technology has been very rapid. High-speed mobile Inte... more Abstract Recent progress in communications technology has been very rapid. High-speed mobile Internet access and mobile devices have enabled the development of robust technologies such as machine translation, automated speech recognition, voice synthesis, and even speech-to-speech translation. Communication applications that support sign language recognition are also being introduced and upgraded. Nonetheless, people with speech, hearing, or mental impairment still require special communication assistance, especially for medical purposes; this makes their health and life dependent on other people. Automatic solutions for speech recognition or voice synthesis from the text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Additionally, in emergency cases, rapid information exchange is essential. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models and image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, Internet-dependent solutions cannot be used in most countries requiring humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based communication application that provides easy communication means for individuals who are speech- or hearing-impaired, have mental health issues impairing communication, or non-natives who do not speak the local language. It provides support and clarification in communication with such people using intuitive icons and interactive symbols easy to find on a mobile device. Such pictogram-based communication can be quite effective and, ultimately, make some people’s lives happier, easier, and safer. We have developed a conceptual prototype of a patient-physician communicator on a smartwatch that can be used for local as well as remote communication.

Research paper thumbnail of Machine enhanced translation of the Human Phenotype Ontology project

Procedia Computer Science, 2017

Research paper thumbnail of Semantic approach for building generated virtual-parallel corpora from monolingual texts

Poznan Studies in Contemporary Linguistics, 2019

Several natural languages have undergone a great deal of processing, but the problem of limited t... more Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English, Polish-English) from monolingual resource...

Research paper thumbnail of Early and remote detection of possible heartbeat problems with convolutional neural networks and multipart interactive training

Research paper thumbnail of Implementing Statistical Machine Translation into Mobile Augmented Reality Systems

Advances in Intelligent Systems and Computing, 2016

A statistical machine translation (SMT) capability would be very useful in augmented reality (AR)... more A statistical machine translation (SMT) capability would be very useful in augmented reality (AR) systems. For example, translating and displaying text in a smart phone camera image would be useful to a traveler needing to read signs and restaurant menus, or reading medical documents when a medical problem arises when visiting a foreign country. Such system would also be useful for foreign students to translate lectures in real time on their mobile devices. However, SMT quality has been neglected in AR systems research, which has focused on other aspects, such as image processing, optical character recognition (OCR), distributed architectures, and user interaction. In addition, general-purpose translation services, such as Google Translate, used in some AR systems are not well-tuned to produce high-quality translations in specific domains and are Internet connection dependent. This research devised SMT methods and evaluated their performance for potential use in AR systems. We give particular attention to domain-adapted SMT systems, in which an SMT capability is tuned to a particular domain of text to increase translation quality. We focus on translation between the Polish and English languages, which presents a number of challenges due to fundamental linguistic differences. However, the SMT systems used are readily extensible to other language pairs. SMT techniques are applied to two domains in translation experiments: European Medicines Agency (EMEA) medical leaflets and the Technology, Entertainment, Design (TED) lectures. In addition, field experiments are conducted on random samples of Polish text found in city signs, posters, restaurant menus, lectures on biology and computer science, and medical leaflets. Texts from these domains are translated by a number of SMT system variants, and the systems’ performance is evaluated by standard translation performance metrics and compared. The results appear very promising and encourage future applications of SMT to AR systems.

Research paper thumbnail of A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms

Computational and mathematical methods in medicine, 2017

People with speech, hearing, or mental impairment require special communication assistance, espec... more People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing im...

Research paper thumbnail of Automatic Parallel Data Mining After Bilingual Document Alignment

Advances in Intelligent Systems and Computing, 2017

Research paper thumbnail of Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Advances in Intelligent Systems and Computing, 2017

Research paper thumbnail of Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora

Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, 2016

Research paper thumbnail of Analysis of Complexity Between Spoken and Written Language for Statistical Machine Translation in West-Slavic Group

Advances in Intelligent Systems and Computing, 2016

Research paper thumbnail of Enhancements in Statistical Spoken Language Translation by De-normalization of ASR Results

Journal of Computers, 2016