Cornelius Glackin - Academia.edu (original) (raw)
Papers by Cornelius Glackin
2022 International Joint Conference on Neural Networks (IJCNN)
2020 International Joint Conference on Neural Networks (IJCNN)
Deep learning has recently made a breakthrough in the speech enhancement process. Some architectu... more Deep learning has recently made a breakthrough in the speech enhancement process. Some architectures are based on a time domain representation, while others operate in the frequency domain; however, the study and comparison of different networks working in time and frequency is not reported in the literature. In this paper, this comparison between time and frequency domain learning for five Deep Neural Network (DNN) based speech enhancement architectures is presented. The comparison covers the evaluation of the output speech using four objective evaluation metrics: PESQ, STOI, LSD, and SSNR increase. Furthermore, the complexity of the five networks was investigated by comparing the number of parameters and processing time for each architecture. Finally some of the factors that affect learning in time and frequency were discussed. The primary results of this paper show that fully connected based architectures generate speech with low overall perception when learning in the time domain. On the other hand, convolutional based designs give acceptable performance in both frequency and time domains. However, time domain implementations show an inferior generalization ability. Frequency domain based learning was proved to be better than time domain when the complex spectrogram is used in the training process. Additionally, feature extraction is also proved to be very effective in DNN based supervised speech enhancement, whether it is performed at the beginning, or implicitly by bottleneck layer features. Finally, it was concluded that the choice of the working domain is mainly restricted by the type and design of the architecture used.
2020 International Joint Conference on Neural Networks (IJCNN), 2020
Mapping and Masking targets are both widely used in recent Deep Neural Network (DNN) based superv... more Mapping and Masking targets are both widely used in recent Deep Neural Network (DNN) based supervised speech enhancement. Masking targets are proved to have a positive impact on the intelligibility of the output speech, while mapping targets are found, in other studies, to generate speech with better quality. However, most of the studies are based on comparing the two approaches using the Multilayer Perceptron (MLP) architecture only. With the emergence of new architectures that outperform the MLP, a more generalized comparison is needed between mapping and masking approaches. In this paper, a complete comparison will be conducted between mapping and masking targets using four different DNN based speech enhancement architectures, to work out how the performance of the networks changes with the chosen training target. The results show that there is no perfect training target with respect to all the different speech quality evaluation metrics, and that there is a tradeoff between the denoising process and the intelligibility of the output speech. Furthermore, the generalization ability of the networks was evaluated, and it is concluded that the design of the architecture restricts the choice of the training target, because masking targets result in significant performance degradation for deep convolutional autoencoder architecture.
Meetings occupy 40% of the average working day. According to the Wall Street Journal, CEOs spend ... more Meetings occupy 40% of the average working day. According to the Wall Street Journal, CEOs spend 18 hours, Civil Servants spend 22 Hours, and the average office worker spends 16 hours per week in meetings. Meetings are where information is shared, discussions take place and the most important decisions are made. The outcome of meetings should be clearly understood actions, but this is rarely the case as comprehensive meeting minutes and action points are not often captured. Meetings become ineffective and time is wasted and travelling becomes the biggest obstacle and cost (both monetarily and environmentally). Video conferencing technology has been developed to provide a low-cost alternative to expensive, time-consuming meetings. However, the video conferencing user experience lacks naturalness, and this inhibits effective communication between the participants. The Augmented Reality (AR) shared experience application proposed in this work will be the next form of video conferencing.
ArXiv, 2021
This paper outlines the EMPATHIC Research & Innovation project, which aims to research, innovate,... more This paper outlines the EMPATHIC Research & Innovation project, which aims to research, innovate, explore and validate new interaction paradigms and platforms for future generations of Personalized Virtual Coaches to assist elderly people living independently at and around their home. Innovative multimodial face analytics, adaptive spoken dialogue systems, and natural language interfaces are part of what the project investigates and innovates, aiming to help dependent aging persons and their carers. It will use remote, non-intrusive technologies to extract physiological markers of emotional states and adapt respective coach responses. In doing so, it aims to develop causal models for emotionally believable coach-user interactions, which shall engage elders and thus keep off loneliness, sustain health, enhance quality of life, and simplify access to future telecare services. Through measurable end-user validations performed in Spain, Norway and France (and complementary user evaluati...
The EMPATHIC Research & Innovation project will research, innovate, explore and validate new para... more The EMPATHIC Research & Innovation project will research, innovate, explore and validate new paradigms and platforms, laying the foundation for future generations of Personalised Virtual Coaches to assist elderly people living independently at and around their home. Innovative multimodal face analytics, adaptive spoken dialogue systems and natural language interfaces are part of what the project will research and innovate, in order to help dependent aging persons and their carers. The project will use remote non-intrusive technologies to extract physiological markers of emotional states in real-time for online adaptive responses of the coach, and advance holistic modelling of behavioural, computational, physical and social aspects of a personalised expressive virtual coach. It will develop causal models of coach-user interactional exchanges that engage elders in emotionally believable interactions keeping off loneliness, sustaining health status, enhancing quality of life and simpli...
This paper presents an overview of a strategy for enabling speech recognition to be performed in ... more This paper presents an overview of a strategy for enabling speech recognition to be performed in the cloud whilst preserving the privacy of users. The strategy advocates a demarcation of responsibilities between the client and server-side components for performing the speech recognition task. On the client-side resides the acoustic model, which symbolically encodes the audio and encrypts the data before uploading to the server. The server-side then employs searchable encryption-based language modelling to perform the speech recognition task. The paper details the proposed client-side acoustic model components, and the proposed server-side searchable encryption which will be the basis of the language modelling. Some preliminary results are presented, and potential problems and their solutions regarding the encrypted communication between client and server are discussed. Preliminary benchmarking results with acceleration of the client and server operations with GPGPU computing are als...
ArXiv, 2019
Detecting the elements of deception in a conversation is one of the most challenging problems for... more Detecting the elements of deception in a conversation is one of the most challenging problems for the AI community. It becomes even more difficult to design a transparent system, which is fully explainable and satisfies the need for financial and legal services to be deployed. This paper presents an approach for fraud detection in transcribed telephone conversations using linguistic features. The proposed approach exploits the syntactic and semantic information of the transcription to extract both the linguistic markers and the sentiment of the customer's response. We demonstrate the results on real-world financial services data using simple, robust and explainable classifiers such as Naive Bayes, Decision Tree, Nearest Neighbours, and Support Vector Machines.
This document describes the Intelligent Voice (IV) speaker diarization system for the first DIHAR... more This document describes the Intelligent Voice (IV) speaker diarization system for the first DIHARD challenge. The aim of this challenge is to provide an evaluation protocol to assess speaker diarization on more challenging domains with the speech across a wide array of challenging acoustic and environmental conditions. We developed a new frame-level speaker diarization built on the success of deep neural network based speaker embeddings, known as d-vectors, in speaker verification systems. In contrary to acoustic features such as MFCCs, frame-level speaker embeddings are much better at discerning speaker identities. We perform spectral clustering on our proposed LSTM-based speaker embeddings to generate speaker log likelihood for each frame. A HMM is then used to refine the speaker posterior probabilities through limiting the probability of switching between speakers when changing frames.
A novel application of convolutional neural networks to phone recognition is presented in this pa... more A novel application of convolutional neural networks to phone recognition is presented in this paper. Both the TIMIT and NTIMIT speech corpora have been employed. The phonetic transcriptions of these corpora have been used to label spectrogram segments for training the convolutional neural network. A sliding window extracted fixed sized images from the spectrograms produced for the TIMIT and NTIMIT utterances. These images were assigned to the appropriate phone class by parsing the TIMIT and NTIMIT phone transcriptions. The GoogLeNet convolutional neural network was implemented and trained using stochastic gradient descent with mini batches. Post training, phonetic rescoring was performed to map each phone set to the smaller standard set, i.e. the 61 phone set was mapped to the 39 phone set. Benchmark results of both datasets are presented for comparison to other state-of-the-art approaches. It will be shown that this convolutional neural network approach is particularly well suited...
2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2017
The objective of this paper is to outline the design specification, implementation and evaluation... more The objective of this paper is to outline the design specification, implementation and evaluation of a proposed accelerated encryption framework which deploys both homomorphic and symmetric-key encryptions to serve the privacy preserving processing; in particular, as a sub-system within the Privacy Preserving Speech Processing framework architecture as part of the PPSP-in-Cloud Platform. Following a preliminary study of GPU efficiency gains optimisations benchmarked for AES implementation we have addressed and resolved the Big Integer processing challenges in parallel implementation of bilinear pairing thus enabling the creation of partially homomorphic encryption schemes which facilitates applications such as speech processing in the encrypted domain on the cloud. This novel implementation has been validated in laboratory tests using a standard speech corpus and can be used for other application domains to support secure computation and privacy preserving big data storage/processin...
ArXiv, 2016
This paper presents the Intelligent Voice (IV) system submitted to the NIST 2016 Speaker Recognit... more This paper presents the Intelligent Voice (IV) system submitted to the NIST 2016 Speaker Recognition Evaluation (SRE). The primary emphasis of SRE this year was on developing speaker recognition technology which is robust for novel languages that are much more heterogeneous than those used in the current state-of-the-art, using significantly less training data, that does not contain meta-data from those languages. The system is based on the state-of-the-art i-vector/PLDA which is developed on the fixed training condition, and the results are reported on the protocol defined on the development set of the challenge.
Lecture Notes in Computer Science
Page 1. Implementing Fuzzy Reasoning on a Spiking Neural Network Cornelius Glackin, Liam McDaid, ... more Page 1. Implementing Fuzzy Reasoning on a Spiking Neural Network Cornelius Glackin, Liam McDaid, Liam Maguire, and Heather Sayers ... 265 Fig. 4. Hidden Layer Output class.Matlab's FCM algorithm was used to perform the clustering. ...
Journal of Intelligent Systems, 2008
More-accurate control, resulting in increased yield, higher quality, and minimizing costs to the ... more More-accurate control, resulting in increased yield, higher quality, and minimizing costs to the grower remain the main driving forces of greenhouse climate research. Studies using techniques such as computational fluid dynamics, plant sensor measurements, and tracer gas analysis of greenhouses are now widespread. Feedback processes such as mass and energy transfers between the crop and its environment are the focus of this paper. Of the panoply of possible strategies for modeling greenhouse climate that could have been used, this study investigates greenhouse climate modeling in terms of biological cybernetics. The approach here is simply to represent the most important components of the greenhouse climate with a view to developing an accurate neural controller, the production of which involves system identification to produce first a neural reference model. For the neural model to have flexibility in predicting the dynamics of abrupt weather changes and the complex feedback processes between the crop and its environment requires good training data. In this paper, different greenhouse control strategies are reviewed, and a biological cybernetic model is constructed. The model is then used to train a neural network, with the training results presented.
International Journal of Production Research, 2006
Fuzzy Sets and Systems, 2007
Analysing all prospective companies for acquisition in large market sectors is an onerous task. A... more Analysing all prospective companies for acquisition in large market sectors is an onerous task. A strategy that results in a shortlist of companies that meet certain basic criteria is required. The short-listed companies can then be further investigated in more detail later if desired. Fuzzy logic systems (FLSs) imbued with the expertise of a focal organisation's financial experts can be of great assistance in this process. In this paper an investigation into the suitability of FLSs for acquisition analysis is presented. The nuances of training and tuning are discussed. In particular, the difficulty of obtaining suitable amounts of expert data is a recurring theme throughout the paper. A strategy for circumventing this issue is presented that relies on the design of a conventional fuzzy logic rule base with the assistance of a financial expert. With the rule base created, various scenarios such as the simulation of multiple experts and the creation of expert training data are investigated. In particular, two scenarios for the creation of simulated expert data are presented. In the first the responses from the different experts are averaged, and in the second scenario the responses from all the different experts are preserved in the training data. This paper builds on previous work with scalable membership functions, however, the use of fuzzy C-means clustering and backpropagation training, are new developments. Additionally, a type-2 FLS is developed and its potential advantages are discussed for this application. The type-2 system facilitates the inclusion of the opinions of multiple experts. Both the type-1 and type-2 FLSs were trained using the backpropagation algorithm with early stopping and verified with five-fold cross-validation. Multiple runs of the five-fold method were conducted with different random orderings of the data. For this particular application, the type-1 system performed comparably with the type-2 system despite the considerable amount of variation in the expert training data. The training results have proven the methods to be capable of efficient tuning of parameters, and of reliable ranking of prospective companies.
Frontiers in Computational Neuroscience, 2013
2022 International Joint Conference on Neural Networks (IJCNN)
2020 International Joint Conference on Neural Networks (IJCNN)
Deep learning has recently made a breakthrough in the speech enhancement process. Some architectu... more Deep learning has recently made a breakthrough in the speech enhancement process. Some architectures are based on a time domain representation, while others operate in the frequency domain; however, the study and comparison of different networks working in time and frequency is not reported in the literature. In this paper, this comparison between time and frequency domain learning for five Deep Neural Network (DNN) based speech enhancement architectures is presented. The comparison covers the evaluation of the output speech using four objective evaluation metrics: PESQ, STOI, LSD, and SSNR increase. Furthermore, the complexity of the five networks was investigated by comparing the number of parameters and processing time for each architecture. Finally some of the factors that affect learning in time and frequency were discussed. The primary results of this paper show that fully connected based architectures generate speech with low overall perception when learning in the time domain. On the other hand, convolutional based designs give acceptable performance in both frequency and time domains. However, time domain implementations show an inferior generalization ability. Frequency domain based learning was proved to be better than time domain when the complex spectrogram is used in the training process. Additionally, feature extraction is also proved to be very effective in DNN based supervised speech enhancement, whether it is performed at the beginning, or implicitly by bottleneck layer features. Finally, it was concluded that the choice of the working domain is mainly restricted by the type and design of the architecture used.
2020 International Joint Conference on Neural Networks (IJCNN), 2020
Mapping and Masking targets are both widely used in recent Deep Neural Network (DNN) based superv... more Mapping and Masking targets are both widely used in recent Deep Neural Network (DNN) based supervised speech enhancement. Masking targets are proved to have a positive impact on the intelligibility of the output speech, while mapping targets are found, in other studies, to generate speech with better quality. However, most of the studies are based on comparing the two approaches using the Multilayer Perceptron (MLP) architecture only. With the emergence of new architectures that outperform the MLP, a more generalized comparison is needed between mapping and masking approaches. In this paper, a complete comparison will be conducted between mapping and masking targets using four different DNN based speech enhancement architectures, to work out how the performance of the networks changes with the chosen training target. The results show that there is no perfect training target with respect to all the different speech quality evaluation metrics, and that there is a tradeoff between the denoising process and the intelligibility of the output speech. Furthermore, the generalization ability of the networks was evaluated, and it is concluded that the design of the architecture restricts the choice of the training target, because masking targets result in significant performance degradation for deep convolutional autoencoder architecture.
Meetings occupy 40% of the average working day. According to the Wall Street Journal, CEOs spend ... more Meetings occupy 40% of the average working day. According to the Wall Street Journal, CEOs spend 18 hours, Civil Servants spend 22 Hours, and the average office worker spends 16 hours per week in meetings. Meetings are where information is shared, discussions take place and the most important decisions are made. The outcome of meetings should be clearly understood actions, but this is rarely the case as comprehensive meeting minutes and action points are not often captured. Meetings become ineffective and time is wasted and travelling becomes the biggest obstacle and cost (both monetarily and environmentally). Video conferencing technology has been developed to provide a low-cost alternative to expensive, time-consuming meetings. However, the video conferencing user experience lacks naturalness, and this inhibits effective communication between the participants. The Augmented Reality (AR) shared experience application proposed in this work will be the next form of video conferencing.
ArXiv, 2021
This paper outlines the EMPATHIC Research & Innovation project, which aims to research, innovate,... more This paper outlines the EMPATHIC Research & Innovation project, which aims to research, innovate, explore and validate new interaction paradigms and platforms for future generations of Personalized Virtual Coaches to assist elderly people living independently at and around their home. Innovative multimodial face analytics, adaptive spoken dialogue systems, and natural language interfaces are part of what the project investigates and innovates, aiming to help dependent aging persons and their carers. It will use remote, non-intrusive technologies to extract physiological markers of emotional states and adapt respective coach responses. In doing so, it aims to develop causal models for emotionally believable coach-user interactions, which shall engage elders and thus keep off loneliness, sustain health, enhance quality of life, and simplify access to future telecare services. Through measurable end-user validations performed in Spain, Norway and France (and complementary user evaluati...
The EMPATHIC Research & Innovation project will research, innovate, explore and validate new para... more The EMPATHIC Research & Innovation project will research, innovate, explore and validate new paradigms and platforms, laying the foundation for future generations of Personalised Virtual Coaches to assist elderly people living independently at and around their home. Innovative multimodal face analytics, adaptive spoken dialogue systems and natural language interfaces are part of what the project will research and innovate, in order to help dependent aging persons and their carers. The project will use remote non-intrusive technologies to extract physiological markers of emotional states in real-time for online adaptive responses of the coach, and advance holistic modelling of behavioural, computational, physical and social aspects of a personalised expressive virtual coach. It will develop causal models of coach-user interactional exchanges that engage elders in emotionally believable interactions keeping off loneliness, sustaining health status, enhancing quality of life and simpli...
This paper presents an overview of a strategy for enabling speech recognition to be performed in ... more This paper presents an overview of a strategy for enabling speech recognition to be performed in the cloud whilst preserving the privacy of users. The strategy advocates a demarcation of responsibilities between the client and server-side components for performing the speech recognition task. On the client-side resides the acoustic model, which symbolically encodes the audio and encrypts the data before uploading to the server. The server-side then employs searchable encryption-based language modelling to perform the speech recognition task. The paper details the proposed client-side acoustic model components, and the proposed server-side searchable encryption which will be the basis of the language modelling. Some preliminary results are presented, and potential problems and their solutions regarding the encrypted communication between client and server are discussed. Preliminary benchmarking results with acceleration of the client and server operations with GPGPU computing are als...
ArXiv, 2019
Detecting the elements of deception in a conversation is one of the most challenging problems for... more Detecting the elements of deception in a conversation is one of the most challenging problems for the AI community. It becomes even more difficult to design a transparent system, which is fully explainable and satisfies the need for financial and legal services to be deployed. This paper presents an approach for fraud detection in transcribed telephone conversations using linguistic features. The proposed approach exploits the syntactic and semantic information of the transcription to extract both the linguistic markers and the sentiment of the customer's response. We demonstrate the results on real-world financial services data using simple, robust and explainable classifiers such as Naive Bayes, Decision Tree, Nearest Neighbours, and Support Vector Machines.
This document describes the Intelligent Voice (IV) speaker diarization system for the first DIHAR... more This document describes the Intelligent Voice (IV) speaker diarization system for the first DIHARD challenge. The aim of this challenge is to provide an evaluation protocol to assess speaker diarization on more challenging domains with the speech across a wide array of challenging acoustic and environmental conditions. We developed a new frame-level speaker diarization built on the success of deep neural network based speaker embeddings, known as d-vectors, in speaker verification systems. In contrary to acoustic features such as MFCCs, frame-level speaker embeddings are much better at discerning speaker identities. We perform spectral clustering on our proposed LSTM-based speaker embeddings to generate speaker log likelihood for each frame. A HMM is then used to refine the speaker posterior probabilities through limiting the probability of switching between speakers when changing frames.
A novel application of convolutional neural networks to phone recognition is presented in this pa... more A novel application of convolutional neural networks to phone recognition is presented in this paper. Both the TIMIT and NTIMIT speech corpora have been employed. The phonetic transcriptions of these corpora have been used to label spectrogram segments for training the convolutional neural network. A sliding window extracted fixed sized images from the spectrograms produced for the TIMIT and NTIMIT utterances. These images were assigned to the appropriate phone class by parsing the TIMIT and NTIMIT phone transcriptions. The GoogLeNet convolutional neural network was implemented and trained using stochastic gradient descent with mini batches. Post training, phonetic rescoring was performed to map each phone set to the smaller standard set, i.e. the 61 phone set was mapped to the 39 phone set. Benchmark results of both datasets are presented for comparison to other state-of-the-art approaches. It will be shown that this convolutional neural network approach is particularly well suited...
2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2017
The objective of this paper is to outline the design specification, implementation and evaluation... more The objective of this paper is to outline the design specification, implementation and evaluation of a proposed accelerated encryption framework which deploys both homomorphic and symmetric-key encryptions to serve the privacy preserving processing; in particular, as a sub-system within the Privacy Preserving Speech Processing framework architecture as part of the PPSP-in-Cloud Platform. Following a preliminary study of GPU efficiency gains optimisations benchmarked for AES implementation we have addressed and resolved the Big Integer processing challenges in parallel implementation of bilinear pairing thus enabling the creation of partially homomorphic encryption schemes which facilitates applications such as speech processing in the encrypted domain on the cloud. This novel implementation has been validated in laboratory tests using a standard speech corpus and can be used for other application domains to support secure computation and privacy preserving big data storage/processin...
ArXiv, 2016
This paper presents the Intelligent Voice (IV) system submitted to the NIST 2016 Speaker Recognit... more This paper presents the Intelligent Voice (IV) system submitted to the NIST 2016 Speaker Recognition Evaluation (SRE). The primary emphasis of SRE this year was on developing speaker recognition technology which is robust for novel languages that are much more heterogeneous than those used in the current state-of-the-art, using significantly less training data, that does not contain meta-data from those languages. The system is based on the state-of-the-art i-vector/PLDA which is developed on the fixed training condition, and the results are reported on the protocol defined on the development set of the challenge.
Lecture Notes in Computer Science
Page 1. Implementing Fuzzy Reasoning on a Spiking Neural Network Cornelius Glackin, Liam McDaid, ... more Page 1. Implementing Fuzzy Reasoning on a Spiking Neural Network Cornelius Glackin, Liam McDaid, Liam Maguire, and Heather Sayers ... 265 Fig. 4. Hidden Layer Output class.Matlab's FCM algorithm was used to perform the clustering. ...
Journal of Intelligent Systems, 2008
More-accurate control, resulting in increased yield, higher quality, and minimizing costs to the ... more More-accurate control, resulting in increased yield, higher quality, and minimizing costs to the grower remain the main driving forces of greenhouse climate research. Studies using techniques such as computational fluid dynamics, plant sensor measurements, and tracer gas analysis of greenhouses are now widespread. Feedback processes such as mass and energy transfers between the crop and its environment are the focus of this paper. Of the panoply of possible strategies for modeling greenhouse climate that could have been used, this study investigates greenhouse climate modeling in terms of biological cybernetics. The approach here is simply to represent the most important components of the greenhouse climate with a view to developing an accurate neural controller, the production of which involves system identification to produce first a neural reference model. For the neural model to have flexibility in predicting the dynamics of abrupt weather changes and the complex feedback processes between the crop and its environment requires good training data. In this paper, different greenhouse control strategies are reviewed, and a biological cybernetic model is constructed. The model is then used to train a neural network, with the training results presented.
International Journal of Production Research, 2006
Fuzzy Sets and Systems, 2007
Analysing all prospective companies for acquisition in large market sectors is an onerous task. A... more Analysing all prospective companies for acquisition in large market sectors is an onerous task. A strategy that results in a shortlist of companies that meet certain basic criteria is required. The short-listed companies can then be further investigated in more detail later if desired. Fuzzy logic systems (FLSs) imbued with the expertise of a focal organisation's financial experts can be of great assistance in this process. In this paper an investigation into the suitability of FLSs for acquisition analysis is presented. The nuances of training and tuning are discussed. In particular, the difficulty of obtaining suitable amounts of expert data is a recurring theme throughout the paper. A strategy for circumventing this issue is presented that relies on the design of a conventional fuzzy logic rule base with the assistance of a financial expert. With the rule base created, various scenarios such as the simulation of multiple experts and the creation of expert training data are investigated. In particular, two scenarios for the creation of simulated expert data are presented. In the first the responses from the different experts are averaged, and in the second scenario the responses from all the different experts are preserved in the training data. This paper builds on previous work with scalable membership functions, however, the use of fuzzy C-means clustering and backpropagation training, are new developments. Additionally, a type-2 FLS is developed and its potential advantages are discussed for this application. The type-2 system facilitates the inclusion of the opinions of multiple experts. Both the type-1 and type-2 FLSs were trained using the backpropagation algorithm with early stopping and verified with five-fold cross-validation. Multiple runs of the five-fold method were conducted with different random orderings of the data. For this particular application, the type-1 system performed comparably with the type-2 system despite the considerable amount of variation in the expert training data. The training results have proven the methods to be capable of efficient tuning of parameters, and of reliable ranking of prospective companies.
Frontiers in Computational Neuroscience, 2013