Georgios Rizos | Imperial College London (original) (raw)
Papers by Georgios Rizos
arXiv (Cornell University), Jul 18, 2021
Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to... more Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to a target style while preserving its content and speaker identity information. Previous emotional conversion studies do not disentangle emotional information from emotion-independent information that should be preserved, thus transforming it all in a monolithic manner and generating audio of low quality, with linguistic distortions. To address this distortion problem, we propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN). The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion, which reveals that the proposed model can effectively reduce distortion. Furthermore, in data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2 % in Micro-F1 and 5 % in Macro-F1 compared to the baseline StarGAN model, which indicates that the proposed model is more valuable for data augmentation.
arXiv (Cornell University), Oct 18, 2019
The Poisson equation is commonly encountered in engineering, for instance in computational fluid ... more The Poisson equation is commonly encountered in engineering, for instance in computational fluid dynamics (CFD) where it is needed to compute corrections to the pressure field to ensure the incompressibility of the velocity field. In the present work, we propose a novel fully convolutional neural network (CNN) architecture to infer the solution of the Poisson equation on a 2D Cartesian grid with different resolutions given the right hand side term, arbitrary boundary conditions and grid parameters. It provides unprecedented versatility for a CNN approach dealing with partial differential equations. The boundary conditions are handled using a novel approach by decomposing the original Poisson problem into a homogeneous Poisson problem plus four inhomogeneous Laplace sub-problems. The model is trained using a novel loss function approximating the continuous L p norm between the prediction and the target. Even when predicting on grids denser than previously encountered, our model demonstrates encouraging capacity to reproduce the correct solution profile. The proposed model, which outperforms well-known neural network models, can be included in a CFD solver to help with solving the Poisson equation. Analytical test cases indicate that our CNN architecture is capable of predicting the correct solution of a Poisson problem with mean percentage errors below 10%, an improvement by comparison to the first step of conventional iterative methods. Predictions from our model, used as the initial guess to iterative algorithms like Multigrid, can reduce the RMS error after a single iteration by more than 90% compared to a zero initial guess.
JMIR Research Protocols
Background Despite efforts, the UK death rate from asthma is the highest in Europe, and 65% of pe... more Background Despite efforts, the UK death rate from asthma is the highest in Europe, and 65% of people with asthma in the United Kingdom do not receive the professional care they are entitled to. Experts have recommended the use of digital innovations to help address the issues of poor outcomes and lack of care access. An automated SMS text messaging–based conversational agent (ie, chatbot) created to provide access to asthma support in a familiar format via a mobile phone has the potential to help people with asthma across demographics and at scale. Such a chatbot could help improve the accuracy of self-assessed risk, improve asthma self-management, increase access to professional care, and ultimately reduce asthma attacks and emergencies. Objective The aims of this study are to determine the feasibility and usability of a text-based conversational agent that processes a patient’s text responses and short sample voice recordings to calculate an estimate of their risk for an asthma e...
Assessing the design of Conversational Agents for Asthma support: Protocol for an Observational Pilot Study (Preprint)
Proceedings of SPIE, Aug 20, 2009
This paper presents an acceleration method, using both algorithmic and architectural means, for f... more This paper presents an acceleration method, using both algorithmic and architectural means, for fast calculation of local correlation coefficients, which is a basic image-based information processing step for template or pattern matching, image registration, motion or change detection and estimation, compensation of changes, or compression of representations, among other information processing objectives. For real-time applications, the complexity in arithmetic operations as well as in programming and memory access latency had been a divisive issue between the so-called correction-based methods and the Fourier domain methods. In the presented method, the complexity in calculating local correlation coefficients is reduced via equivalent reformulation that leads to efficient array operations or enables the use of multi-dimensional fast Fourier transforms, without losing or sacrificing local and non-linear changes or characteristics.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Asthma affects an estimated 334 million people worldwide, causing over 461 000 deaths. Exacerbati... more Asthma affects an estimated 334 million people worldwide, causing over 461 000 deaths. Exacerbations or asthma attacks can be predicted with new sensor technologies. We explore how recordings of human voice, and machine learning can provide better diagnostics for pulmonary diseases like asthma, as well as tools for helping patients better manage it. Past studies have focused on data collection processes that either mimic traditional auscultation, or make multi-sensor measurements, where the application of specialised recording hardware is required, possibly by expert personnel. This is costly and places limits on the size of the studies (e. g., number of study participants, and recording devices). In this paper, we consider another avenue, that of modelling self-recorded voice samples made using regular smartphones, along with self-reported clinical diagnosis annotations; specifically of asthma. We propose the usage of self-supervised learning that aims to reduce within-class representation redundancy among heterogeneous samples as an auxiliary task to promote robust, bias-free learning. The application of our method achieves an absolute increase of 1.80% in area under the Precision-Recall curve, compared to not using it, and a total of 3.54% compared to our baseline.
Cornell University - arXiv, Oct 19, 2022
• Sample-free, Bayesian attentive ResNet with squeeze-and-excitation • Uncertainty based, data-sp... more • Sample-free, Bayesian attentive ResNet with squeeze-and-excitation • Uncertainty based, data-specific label smoothing • Bioacoustic call detection on two datasets, one of which is introduced here • Propagated uncertainty should be used to weigh label smoothing The Bigger Picture Neural networks that accompany their predictions with uncertainty values can foster trust in artificial intelligence based decision-making. We leverage the potential of Bayesian deep learning by using its efficiently propagated 1
We study the problem of user multi-label classification in settings where two types of informatio... more We study the problem of user multi-label classification in settings where two types of information are available: a) a set of seed users with known labels, b) the online interactions among the users of interest. User labels may refer to topics of different granularities (e.g. broad themes, news stories, etc.), user types (e.g. person, news agency, etc.), political stance (e.g. liberal, conservative) and others. To tackle the problem, we propose a semi-supervised learning framework that represents users by means of network-based features. We propose the use of Absorbing Reqularized Commute-Time Embedding (ARCTE) as a means to extract local graph features and devise a computationally more efficient scheme (compared to existing ones) for their extraction. We then compare the results of this representation with a number of previously proposed alternatives on a Twitter dataset of 534K users. We also discuss a few key practical issues as well as the repercussions of the proposed approach ...
2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP)
Despite advances in deep algorithmic music generation, evaluation of generated samples often reli... more Despite advances in deep algorithmic music generation, evaluation of generated samples often relies on human evaluation, which is subjective and costly. We focus on designing a homogeneous, objective framework for evaluating samples of algorithmically generated music. Any engineered measures to evaluate generated music typically attempt to define the samples' musicality, but do not capture qualities of music such as theme or mood. We do not seek to assess the musical merit of generated music, but instead explore whether generated samples contain meaningful information pertaining to emotion or mood/theme. We achieve this by measuring the change in predictive performance of a music mood/theme classifier after augmenting its training data with generated samples. We analyse music samples generated by three models -SampleRNN, Jukebox, and DDSP -and employ a homogeneous framework across all methods to allow for objective comparison. This is the first attempt at augmenting a music genre classification dataset with conditionally generated music. We investigate the classification performance improvement using deep music generation and the ability of the generators to make emotional music by using an additional, emotion annotation of the dataset. Finally, we use a classifier trained on real data to evaluate the label validity of classconditionally generated samples.
X-AWARE: ConteXt-AWARE Human-Environment Attention Fusion for Driver Gaze Prediction in the Wild
Proceedings of the 2020 International Conference on Multimodal Interaction
Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the n... more Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information - for example, the vehicle cabin environment - adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE, attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03% in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72% on the test set.
Data-Centric Engineering
The Poisson equation is commonly encountered in engineering, for instance, in computational fluid... more The Poisson equation is commonly encountered in engineering, for instance, in computational fluid dynamics (CFD) where it is needed to compute corrections to the pressure field to ensure the incompressibility of the velocity field. In the present work, we propose a novel fully convolutional neural network (CNN) architecture to infer the solution of the Poisson equation on a 2D Cartesian grid with different resolutions given the right-hand side term, arbitrary boundary conditions, and grid parameters. It provides unprecedented versatility for a CNN approach dealing with partial differential equations. The boundary conditions are handled using a novel approach by decomposing the original Poisson problem into a homogeneous Poisson problem plus four inhomogeneous Laplace subproblems. The model is trained using a novel loss function approximating the continuous $ {L}^p $ norm between the prediction and the target. Even when predicting on grids denser than previously encountered, our mode...
Proceedings of the First Workshop on Causal Inference and NLP
Despite peer-reviewing being an essential component of academia since the 1600s, it has repeatedl... more Despite peer-reviewing being an essential component of academia since the 1600s, it has repeatedly received criticisms for lack of transparency and consistency. We posit that recent work in machine learning and explainable AI provide tools that enable insights into the decisions from a given peer review process. We start by extracting global explanations in the form of linguistic features that affect the acceptance of a scientific paper for publication on an open peer-review dataset. Second, since such global explanations do not justify causal interpretations, we provide a methodology for detecting confounding effects in natural language in order to generate causal explanations, under assumptions, in the form of lexicons. Our proposed linguistic explanation methodology indicates the following on a case dataset of ICLR submissions: a) the organising committee follows, for the most part, the recommendations of reviewers, and, b) the paper's main characteristics that led to reviewers recommending acceptance for publication are originality, clarity and substance.
Convolutional Neural Networks for the Solution of the 2D Poisson Equation with Arbitrary Dirichlet Boundary Conditions, Mesh Sizes and Grid Spacings
Bulletin of the American Physical Society, 2019
Modelling Sample Informativeness for Deep Affective Computing
Using data with high quality annotation is crucial in emotion recognition applications, especiall... more Using data with high quality annotation is crucial in emotion recognition applications, especially because the task is subjective and the raters may exhibit disagreement with respect to each sample. In this paper, we propose a meta-learning methodology that can reason about the training data and detect potentially less informative instances in order to reduce their impact in the training process. The way we inform the meta-learner on the importance of each sample is by utilising recent advances in uncertainty modelling with Bayesian neural networks that can decompose predictive uncertainty into: a) model uncertainty that is due to a lack of observations and b) label uncertainty that is due to inherent randomness in the data labelling, which we adapt for affective computing. Our proposed method for soft data selection exhibits a 6% absolute improvement in Concordance Correlation Coefficient with respect to the baseline in a two-dimensional continuous affect recognition task.
Stargan for Emotional Speech Conversion: Validated by Data Augmentation of End-To-End Emotion Recognition
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
In this paper, we propose an adversarial network implementation for speech emotion conversion as ... more In this paper, we propose an adversarial network implementation for speech emotion conversion as a data augmentation method, validated by a multi-class speech affect recognition task. In our setting, we do not assume the availability of parallel data, and we additionally make it a priority to exploit as much as possible the available training data by adopting a cycle-consistent, class-conditional generative adversarial network with an auxiliary domain classifier. Our generated samples are valuable for data augmentation, achieving a corresponding 2% and 6% absolute increase in Micro- and MacroF1 compared to the baseline in a 3-class classification paradigm using a deep, end-to-end network. We finally perform a human perception evaluation of the samples, through which we conclude that our samples are indicative of their target emotion, albeit showing a tendency for confusion in cases where the emotional attribute of valence and arousal are inconsistent.
MuSe 2020 Challenge and Workshop
Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop
Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusi... more Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthiness detection by means of more comprehensively integrating the audio-visual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild, which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic, in which participants recognise 10 domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust, in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CAR, the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34 * UAR + 0.66 * F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.
Towards Sonification in Multimodal and User-friendlyExplainable Artificial Intelligence
Proceedings of the 2021 International Conference on Multimodal Interaction
We are largely used to hearing explanations. For example, if someone thinks you are sad today, th... more We are largely used to hearing explanations. For example, if someone thinks you are sad today, they might reply to your “why?” with “because you were so Hmmmmm-mmm-mmm”. Today’s Artificial Intelligence (AI), however, is – if at all – largely providing explanations of decisions in a visual or textual manner. While such approaches are good for communication via visual media such as in research papers or screens of intelligent devices, they may not always be the best way to explain; especially when the end user is not an expert. In particular, when the AI’s task is about Audio Intelligence, visual explanations appear less intuitive than audible, sonified ones. Sonification has also great potential for explainable AI (XAI) in systems that deal with non-audio data – for example, because it does not require visual contact or active attention of a user. Hence, sonified explanations of AI decisions face a challenging, yet highly promising and pioneering task. That involves incorporating innovative XAI algorithms to allow pointing back at the learning data responsible for decisions made by an AI, and to include decomposition of the data to identify salient aspects. It further aims to identify the components of the preprocessing, feature representation, and learnt attention patterns that are responsible for the decisions. Finally, it targets decision-making at the model-level, to provide a holistic explanation of the chain of processing in typical pattern recognition problems from end-to-end. Sonified AI explanations will need to unite methods for sonification of the identified aspects that benefit decisions, decomposition and recomposition of audio to sonify which parts in the audio were responsible for the decision, and rendering attention patterns and salient feature representations audible. Benchmarking sonified XAI is challenging, as it will require a comparison against a backdrop of existing, state-of-the-art visual and textual alternatives, as well as synergistic complementation of all modalities in user evaluations. Sonified AI explanations will need to target different user groups to allow personalisation of the sonification experience for different user needs, to lead to a major breakthrough in comprehensibility of AI via hearing how decisions are made, hence supporting tomorrow’s humane AI’s trustability. Here, we introduce and motivate the general idea, and provide accompanying considerations including milestones of realisation of sonifed XAI and foreseeable risks.
Augment to Prevent
Proceedings of the 28th ACM International Conference on Information and Knowledge Management
In this paper, we address the issue of augmenting text data in supervised Natural Language Proces... more In this paper, we address the issue of augmenting text data in supervised Natural Language Processing problems, exemplified by deep online hate speech classification. A great challenge in this domain is that although the presence of hate speech can be deleterious to the quality of service provided by social platforms, it still comprises only a tiny fraction of the content that can be found online, which can lead to performance deterioration due to majority class overfitting. To this end, we perform a thorough study on the application of deep learning to the hate speech detection problem: a) we propose three text-based data augmentation techniques aimed at reducing the degree of class imbalance and to maximise the amount of information we can extract from our limited resources and b) we apply them on a selection of top-performing deep architectures and hate speech databases in order to showcase their generalisation properties. The data augmentation techniques are based on a) synonym replacement based on word embedding vector closeness, b) warping of the word tokens along the padded sequence or c) class-conditional, recurrent neural language generation. Our proposed framework yields a significant increase in multi-class hate speech detection, outperforming the baseline in the largest online hate speech database by an absolute 5.7% increase in Macro-F1 score and 30% in hate speech class recall.
Interspeech 2020
The evaluation of scientific submissions through peer review is both the most fundamental compone... more The evaluation of scientific submissions through peer review is both the most fundamental component of the publication process, as well as the most frequently criticised and questioned. Academic journals and conferences request reviews from multiple reviewers per submission, which an editor, or area chair aggregates into the final acceptance decision. Reviewers are often in disagreement due to varying levels of domain expertise, confidence, levels of motivation, as well as due to the heavy workload and the different interpretations by the reviewers of the score scale. Herein, we explore the possibility of a computational decision support tool for the editor, based on Natural Language Processing, that offers an additional aggregated recommendation. We provide a comparative study of state-of-the-art text modelling methods on the newly crafted, largest review dataset of its kind based on Interspeech 2019, and we are the first to explore uncertainty-aware methods (soft labels, quantile regression) to address the subjectivity inherent in this problem.
SPIE Proceedings, 2008
This paper presents an acceleration method, using both algorithmic and architectural means, for f... more This paper presents an acceleration method, using both algorithmic and architectural means, for fast calculation of local correlation coefficients , which is a basic image-based information processing step for template or pattern matching, image registration, motion or change detection and estimation, compensation of changes, or compression of representations, among other information processing objectives. For real-time applications, the complexity in arithmetic operations as well as in programming and memory access latency had been a divisive issue between the so-called correction-based methods and the Fourier domain methods. In the presented method, the complexity in calculating local correlation coefficients is reduced via equivalent reformulation that leads to efficient array operations or enables the use of multi-dimensional fast Fourier transforms, without losing or sacrificing local and non-linear changes or characteristics. The computation time is further reduced by utilizing modern multi-core architectures, such as the Sony-Toshiba-IBM Cell processor, with high processing speed and low power consumption.
arXiv (Cornell University), Jul 18, 2021
Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to... more Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to a target style while preserving its content and speaker identity information. Previous emotional conversion studies do not disentangle emotional information from emotion-independent information that should be preserved, thus transforming it all in a monolithic manner and generating audio of low quality, with linguistic distortions. To address this distortion problem, we propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN). The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion, which reveals that the proposed model can effectively reduce distortion. Furthermore, in data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2 % in Micro-F1 and 5 % in Macro-F1 compared to the baseline StarGAN model, which indicates that the proposed model is more valuable for data augmentation.
arXiv (Cornell University), Oct 18, 2019
The Poisson equation is commonly encountered in engineering, for instance in computational fluid ... more The Poisson equation is commonly encountered in engineering, for instance in computational fluid dynamics (CFD) where it is needed to compute corrections to the pressure field to ensure the incompressibility of the velocity field. In the present work, we propose a novel fully convolutional neural network (CNN) architecture to infer the solution of the Poisson equation on a 2D Cartesian grid with different resolutions given the right hand side term, arbitrary boundary conditions and grid parameters. It provides unprecedented versatility for a CNN approach dealing with partial differential equations. The boundary conditions are handled using a novel approach by decomposing the original Poisson problem into a homogeneous Poisson problem plus four inhomogeneous Laplace sub-problems. The model is trained using a novel loss function approximating the continuous L p norm between the prediction and the target. Even when predicting on grids denser than previously encountered, our model demonstrates encouraging capacity to reproduce the correct solution profile. The proposed model, which outperforms well-known neural network models, can be included in a CFD solver to help with solving the Poisson equation. Analytical test cases indicate that our CNN architecture is capable of predicting the correct solution of a Poisson problem with mean percentage errors below 10%, an improvement by comparison to the first step of conventional iterative methods. Predictions from our model, used as the initial guess to iterative algorithms like Multigrid, can reduce the RMS error after a single iteration by more than 90% compared to a zero initial guess.
JMIR Research Protocols
Background Despite efforts, the UK death rate from asthma is the highest in Europe, and 65% of pe... more Background Despite efforts, the UK death rate from asthma is the highest in Europe, and 65% of people with asthma in the United Kingdom do not receive the professional care they are entitled to. Experts have recommended the use of digital innovations to help address the issues of poor outcomes and lack of care access. An automated SMS text messaging–based conversational agent (ie, chatbot) created to provide access to asthma support in a familiar format via a mobile phone has the potential to help people with asthma across demographics and at scale. Such a chatbot could help improve the accuracy of self-assessed risk, improve asthma self-management, increase access to professional care, and ultimately reduce asthma attacks and emergencies. Objective The aims of this study are to determine the feasibility and usability of a text-based conversational agent that processes a patient’s text responses and short sample voice recordings to calculate an estimate of their risk for an asthma e...
Assessing the design of Conversational Agents for Asthma support: Protocol for an Observational Pilot Study (Preprint)
Proceedings of SPIE, Aug 20, 2009
This paper presents an acceleration method, using both algorithmic and architectural means, for f... more This paper presents an acceleration method, using both algorithmic and architectural means, for fast calculation of local correlation coefficients, which is a basic image-based information processing step for template or pattern matching, image registration, motion or change detection and estimation, compensation of changes, or compression of representations, among other information processing objectives. For real-time applications, the complexity in arithmetic operations as well as in programming and memory access latency had been a divisive issue between the so-called correction-based methods and the Fourier domain methods. In the presented method, the complexity in calculating local correlation coefficients is reduced via equivalent reformulation that leads to efficient array operations or enables the use of multi-dimensional fast Fourier transforms, without losing or sacrificing local and non-linear changes or characteristics.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Asthma affects an estimated 334 million people worldwide, causing over 461 000 deaths. Exacerbati... more Asthma affects an estimated 334 million people worldwide, causing over 461 000 deaths. Exacerbations or asthma attacks can be predicted with new sensor technologies. We explore how recordings of human voice, and machine learning can provide better diagnostics for pulmonary diseases like asthma, as well as tools for helping patients better manage it. Past studies have focused on data collection processes that either mimic traditional auscultation, or make multi-sensor measurements, where the application of specialised recording hardware is required, possibly by expert personnel. This is costly and places limits on the size of the studies (e. g., number of study participants, and recording devices). In this paper, we consider another avenue, that of modelling self-recorded voice samples made using regular smartphones, along with self-reported clinical diagnosis annotations; specifically of asthma. We propose the usage of self-supervised learning that aims to reduce within-class representation redundancy among heterogeneous samples as an auxiliary task to promote robust, bias-free learning. The application of our method achieves an absolute increase of 1.80% in area under the Precision-Recall curve, compared to not using it, and a total of 3.54% compared to our baseline.
Cornell University - arXiv, Oct 19, 2022
• Sample-free, Bayesian attentive ResNet with squeeze-and-excitation • Uncertainty based, data-sp... more • Sample-free, Bayesian attentive ResNet with squeeze-and-excitation • Uncertainty based, data-specific label smoothing • Bioacoustic call detection on two datasets, one of which is introduced here • Propagated uncertainty should be used to weigh label smoothing The Bigger Picture Neural networks that accompany their predictions with uncertainty values can foster trust in artificial intelligence based decision-making. We leverage the potential of Bayesian deep learning by using its efficiently propagated 1
We study the problem of user multi-label classification in settings where two types of informatio... more We study the problem of user multi-label classification in settings where two types of information are available: a) a set of seed users with known labels, b) the online interactions among the users of interest. User labels may refer to topics of different granularities (e.g. broad themes, news stories, etc.), user types (e.g. person, news agency, etc.), political stance (e.g. liberal, conservative) and others. To tackle the problem, we propose a semi-supervised learning framework that represents users by means of network-based features. We propose the use of Absorbing Reqularized Commute-Time Embedding (ARCTE) as a means to extract local graph features and devise a computationally more efficient scheme (compared to existing ones) for their extraction. We then compare the results of this representation with a number of previously proposed alternatives on a Twitter dataset of 534K users. We also discuss a few key practical issues as well as the repercussions of the proposed approach ...
2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP)
Despite advances in deep algorithmic music generation, evaluation of generated samples often reli... more Despite advances in deep algorithmic music generation, evaluation of generated samples often relies on human evaluation, which is subjective and costly. We focus on designing a homogeneous, objective framework for evaluating samples of algorithmically generated music. Any engineered measures to evaluate generated music typically attempt to define the samples' musicality, but do not capture qualities of music such as theme or mood. We do not seek to assess the musical merit of generated music, but instead explore whether generated samples contain meaningful information pertaining to emotion or mood/theme. We achieve this by measuring the change in predictive performance of a music mood/theme classifier after augmenting its training data with generated samples. We analyse music samples generated by three models -SampleRNN, Jukebox, and DDSP -and employ a homogeneous framework across all methods to allow for objective comparison. This is the first attempt at augmenting a music genre classification dataset with conditionally generated music. We investigate the classification performance improvement using deep music generation and the ability of the generators to make emotional music by using an additional, emotion annotation of the dataset. Finally, we use a classifier trained on real data to evaluate the label validity of classconditionally generated samples.
X-AWARE: ConteXt-AWARE Human-Environment Attention Fusion for Driver Gaze Prediction in the Wild
Proceedings of the 2020 International Conference on Multimodal Interaction
Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the n... more Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information - for example, the vehicle cabin environment - adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE, attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03% in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72% on the test set.
Data-Centric Engineering
The Poisson equation is commonly encountered in engineering, for instance, in computational fluid... more The Poisson equation is commonly encountered in engineering, for instance, in computational fluid dynamics (CFD) where it is needed to compute corrections to the pressure field to ensure the incompressibility of the velocity field. In the present work, we propose a novel fully convolutional neural network (CNN) architecture to infer the solution of the Poisson equation on a 2D Cartesian grid with different resolutions given the right-hand side term, arbitrary boundary conditions, and grid parameters. It provides unprecedented versatility for a CNN approach dealing with partial differential equations. The boundary conditions are handled using a novel approach by decomposing the original Poisson problem into a homogeneous Poisson problem plus four inhomogeneous Laplace subproblems. The model is trained using a novel loss function approximating the continuous $ {L}^p $ norm between the prediction and the target. Even when predicting on grids denser than previously encountered, our mode...
Proceedings of the First Workshop on Causal Inference and NLP
Despite peer-reviewing being an essential component of academia since the 1600s, it has repeatedl... more Despite peer-reviewing being an essential component of academia since the 1600s, it has repeatedly received criticisms for lack of transparency and consistency. We posit that recent work in machine learning and explainable AI provide tools that enable insights into the decisions from a given peer review process. We start by extracting global explanations in the form of linguistic features that affect the acceptance of a scientific paper for publication on an open peer-review dataset. Second, since such global explanations do not justify causal interpretations, we provide a methodology for detecting confounding effects in natural language in order to generate causal explanations, under assumptions, in the form of lexicons. Our proposed linguistic explanation methodology indicates the following on a case dataset of ICLR submissions: a) the organising committee follows, for the most part, the recommendations of reviewers, and, b) the paper's main characteristics that led to reviewers recommending acceptance for publication are originality, clarity and substance.
Convolutional Neural Networks for the Solution of the 2D Poisson Equation with Arbitrary Dirichlet Boundary Conditions, Mesh Sizes and Grid Spacings
Bulletin of the American Physical Society, 2019
Modelling Sample Informativeness for Deep Affective Computing
Using data with high quality annotation is crucial in emotion recognition applications, especiall... more Using data with high quality annotation is crucial in emotion recognition applications, especially because the task is subjective and the raters may exhibit disagreement with respect to each sample. In this paper, we propose a meta-learning methodology that can reason about the training data and detect potentially less informative instances in order to reduce their impact in the training process. The way we inform the meta-learner on the importance of each sample is by utilising recent advances in uncertainty modelling with Bayesian neural networks that can decompose predictive uncertainty into: a) model uncertainty that is due to a lack of observations and b) label uncertainty that is due to inherent randomness in the data labelling, which we adapt for affective computing. Our proposed method for soft data selection exhibits a 6% absolute improvement in Concordance Correlation Coefficient with respect to the baseline in a two-dimensional continuous affect recognition task.
Stargan for Emotional Speech Conversion: Validated by Data Augmentation of End-To-End Emotion Recognition
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
In this paper, we propose an adversarial network implementation for speech emotion conversion as ... more In this paper, we propose an adversarial network implementation for speech emotion conversion as a data augmentation method, validated by a multi-class speech affect recognition task. In our setting, we do not assume the availability of parallel data, and we additionally make it a priority to exploit as much as possible the available training data by adopting a cycle-consistent, class-conditional generative adversarial network with an auxiliary domain classifier. Our generated samples are valuable for data augmentation, achieving a corresponding 2% and 6% absolute increase in Micro- and MacroF1 compared to the baseline in a 3-class classification paradigm using a deep, end-to-end network. We finally perform a human perception evaluation of the samples, through which we conclude that our samples are indicative of their target emotion, albeit showing a tendency for confusion in cases where the emotional attribute of valence and arousal are inconsistent.
MuSe 2020 Challenge and Workshop
Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop
Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusi... more Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthiness detection by means of more comprehensively integrating the audio-visual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild, which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic, in which participants recognise 10 domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust, in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CAR, the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34 * UAR + 0.66 * F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.
Towards Sonification in Multimodal and User-friendlyExplainable Artificial Intelligence
Proceedings of the 2021 International Conference on Multimodal Interaction
We are largely used to hearing explanations. For example, if someone thinks you are sad today, th... more We are largely used to hearing explanations. For example, if someone thinks you are sad today, they might reply to your “why?” with “because you were so Hmmmmm-mmm-mmm”. Today’s Artificial Intelligence (AI), however, is – if at all – largely providing explanations of decisions in a visual or textual manner. While such approaches are good for communication via visual media such as in research papers or screens of intelligent devices, they may not always be the best way to explain; especially when the end user is not an expert. In particular, when the AI’s task is about Audio Intelligence, visual explanations appear less intuitive than audible, sonified ones. Sonification has also great potential for explainable AI (XAI) in systems that deal with non-audio data – for example, because it does not require visual contact or active attention of a user. Hence, sonified explanations of AI decisions face a challenging, yet highly promising and pioneering task. That involves incorporating innovative XAI algorithms to allow pointing back at the learning data responsible for decisions made by an AI, and to include decomposition of the data to identify salient aspects. It further aims to identify the components of the preprocessing, feature representation, and learnt attention patterns that are responsible for the decisions. Finally, it targets decision-making at the model-level, to provide a holistic explanation of the chain of processing in typical pattern recognition problems from end-to-end. Sonified AI explanations will need to unite methods for sonification of the identified aspects that benefit decisions, decomposition and recomposition of audio to sonify which parts in the audio were responsible for the decision, and rendering attention patterns and salient feature representations audible. Benchmarking sonified XAI is challenging, as it will require a comparison against a backdrop of existing, state-of-the-art visual and textual alternatives, as well as synergistic complementation of all modalities in user evaluations. Sonified AI explanations will need to target different user groups to allow personalisation of the sonification experience for different user needs, to lead to a major breakthrough in comprehensibility of AI via hearing how decisions are made, hence supporting tomorrow’s humane AI’s trustability. Here, we introduce and motivate the general idea, and provide accompanying considerations including milestones of realisation of sonifed XAI and foreseeable risks.
Augment to Prevent
Proceedings of the 28th ACM International Conference on Information and Knowledge Management
In this paper, we address the issue of augmenting text data in supervised Natural Language Proces... more In this paper, we address the issue of augmenting text data in supervised Natural Language Processing problems, exemplified by deep online hate speech classification. A great challenge in this domain is that although the presence of hate speech can be deleterious to the quality of service provided by social platforms, it still comprises only a tiny fraction of the content that can be found online, which can lead to performance deterioration due to majority class overfitting. To this end, we perform a thorough study on the application of deep learning to the hate speech detection problem: a) we propose three text-based data augmentation techniques aimed at reducing the degree of class imbalance and to maximise the amount of information we can extract from our limited resources and b) we apply them on a selection of top-performing deep architectures and hate speech databases in order to showcase their generalisation properties. The data augmentation techniques are based on a) synonym replacement based on word embedding vector closeness, b) warping of the word tokens along the padded sequence or c) class-conditional, recurrent neural language generation. Our proposed framework yields a significant increase in multi-class hate speech detection, outperforming the baseline in the largest online hate speech database by an absolute 5.7% increase in Macro-F1 score and 30% in hate speech class recall.
Interspeech 2020
The evaluation of scientific submissions through peer review is both the most fundamental compone... more The evaluation of scientific submissions through peer review is both the most fundamental component of the publication process, as well as the most frequently criticised and questioned. Academic journals and conferences request reviews from multiple reviewers per submission, which an editor, or area chair aggregates into the final acceptance decision. Reviewers are often in disagreement due to varying levels of domain expertise, confidence, levels of motivation, as well as due to the heavy workload and the different interpretations by the reviewers of the score scale. Herein, we explore the possibility of a computational decision support tool for the editor, based on Natural Language Processing, that offers an additional aggregated recommendation. We provide a comparative study of state-of-the-art text modelling methods on the newly crafted, largest review dataset of its kind based on Interspeech 2019, and we are the first to explore uncertainty-aware methods (soft labels, quantile regression) to address the subjectivity inherent in this problem.
SPIE Proceedings, 2008
This paper presents an acceleration method, using both algorithmic and architectural means, for f... more This paper presents an acceleration method, using both algorithmic and architectural means, for fast calculation of local correlation coefficients , which is a basic image-based information processing step for template or pattern matching, image registration, motion or change detection and estimation, compensation of changes, or compression of representations, among other information processing objectives. For real-time applications, the complexity in arithmetic operations as well as in programming and memory access latency had been a divisive issue between the so-called correction-based methods and the Fourier domain methods. In the presented method, the complexity in calculating local correlation coefficients is reduced via equivalent reformulation that leads to efficient array operations or enables the use of multi-dimensional fast Fourier transforms, without losing or sacrificing local and non-linear changes or characteristics. The computation time is further reduced by utilizing modern multi-core architectures, such as the Sony-Toshiba-IBM Cell processor, with high processing speed and low power consumption.