Didan Deng - Academia.edu (original) (raw)
Papers by Didan Deng
arXiv (Cornell University), May 9, 2023
Micro-expressions (MEs) are involuntary and subtle facial expressions that are thought to reveal ... more Micro-expressions (MEs) are involuntary and subtle facial expressions that are thought to reveal feelings people are trying to hide. ME spotting detects the temporal intervals containing MEs in videos. Detecting such quick and subtle motions from long videos is difficult. Recent works leverage detailed facial motion representations, such as the optical flow, and deep learning models, leading to high computational complexity. To reduce computational complexity and achieve realtime operation, we propose RMES, a real-time ME spotting framework. We represent motion using phase computed by Riesz Pyramid, and feed this motion representation into a threestream shallow CNN, which predicts the likelihood of each frame belonging to an ME. In comparison to optical flow, phase provides more localized motion estimates, which are essential for ME spotting, resulting in higher performance. Using phase also reduces the required computation of the ME spotting pipeline by 77.8%. Despite its relative simplicity and low computational complexity, our framework achieves state-of-the-art performance on two public datasets: CAS(ME) 2 and SAMM Long Videos.
Proceedings of the AAAI Conference on Human Computation and Crowdsourcing
Model explanations are generated by XAI (explainable AI) methods to help people understand and in... more Model explanations are generated by XAI (explainable AI) methods to help people understand and interpret machine learning models. To study XAI methods from the human perspective, we propose a human-based benchmark dataset, i.e., human saliency benchmark (HSB), for evaluating saliency-based XAI methods. Different from existing human saliency annotations where class-related features are manually and subjectively labeled, this benchmark collects more objective human attention on vision information with a precise eye-tracking device and a novel crowdsourcing experiment. Taking the labor cost of human experiment into consideration, we further explore the potential of utilizing a prediction model trained on HSB to mimic saliency annotating by humans. Hence, a dense prediction problem is formulated, and we propose an encoder-decoder architecture which combines multi-modal and multi-scale features to produce the human saliency maps. Accordingly, a pretraining-finetuning method is designed t...
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
In this paper, we designed an architecture for multiple emotion descriptors estimation in partici... more In this paper, we designed an architecture for multiple emotion descriptors estimation in participating the ABAW3 Challenge. Based on the theory of Ekman and Friesen, 1969, we designed distinct architectures to measure the sign vehicles (i.e., facial action units) and the message (i.e., discrete emotions, valence and arousal) given their different properties. The quantitative experiments on the ABAW3 challenge dataset has shown the superior performance of our approach over two baseline models.
Given a fixed parameter budget, one can build a single large neural network or create a memoryspl... more Given a fixed parameter budget, one can build a single large neural network or create a memorysplit ensemble: a pool of several smaller networks with the same total parameter count as the single network. A memory-split ensemble can outperform its single model counterpart (Lobacheva et al., 2020): a phenomenon known as the memory-split advantage (MSA). The reasons for MSA are still not yet fully understood. In particular, it is difficult in practice to predict when it will exist. This paper sheds light on the reasons underlying MSA using random feature theory. We study the dependence of the MSA on several factors: the parameter budget, the training set size, the L2 regularization and the Stochastic Gradient Descent (SGD) hyper-parameters. Using the bias-variance decomposition, we show that MSA exists when the reduction in variance due to the ensemble (i.e., ensemble gain) exceeds the increase in squared bias due to the smaller size of the individual networks (i.e., shrinkage cost). T...
Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous ... more Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous deep network structures often focused on macro-motion which extends over long time scales, e.g., on the order of seconds. We believe integrating structures capturing information about both micro- and macro-motion will benefit emotion prediction, because human perceive both micro- and macro-expressions. In this paper, we propose to combine micro- and macro-motion features to improve video emotion recognition with a two-stream recurrent network, named MIMAMO (Micro-Macro-Motion) Net. Specifically, smaller and shorter micro-motions are analyzed by a two-stream network, while larger and more sustained macro-motions can be well captured by a subsequent recurrent network. Assigning specific interpretations to the roles of different parts of the network enables us to make choice of parameters based on prior knowledge: choices that turn out to be optimal. One of the important innovations in our ...
ArXiv, 2021
In this paper, we are presenting a novel method and system for neuropsychological performance tes... more In this paper, we are presenting a novel method and system for neuropsychological performance testing that can establish a link between cognition and emotion. It comprises a portable device used to interact with a cloud service which stores user information under username and is logged into by the user through the portable device; the user information is directly captured through the device and is processed by artificial neural network; and this tridimensional information comprises user cognitive reactions, user emotional responses and user chronometrics. The multivariate situational risk assessment is used to evaluate the performance of the subject by capturing the 3 dimensions of each reaction to a series of 30 dichotomous questions describing various situations of daily life and challenging the user’s knowledge, values, ethics, and principles. In industrial application, the timing of this assessment will depend on the user’s need to obtain a service from a provider such as openin...
2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019
Appearance-based gaze estimation maps RGB images to estimates of gaze directions. One problem in ... more Appearance-based gaze estimation maps RGB images to estimates of gaze directions. One problem in gaze estimation is that there always exist low-quality samples (outliers) in which the eyes are barely visible. These low-quality samples are mainly caused by blinks, occlusions (e.g. by eye glasses), blur (e.g. due to motion) and failures of the eye landmark detection. Training on these outliers degrades the performance of gaze estimators, since they have no or limited information about gaze directions. It is also risky to give estimates based on these images in real-world applications, as these estimates may be unreliable. To solve this problem, we propose an algorithm that detects outliers without supervision. Based on the input images with only gaze labels, the proposed algorithm learns to predict a gaze estimates and an additional confidence score, which alleviates the impact of outliers during learning. We evaluated this algorithm on the MPIIGaze dataset and on an internal dataset....
ArXiv, 2021
When recognizing emotions, subtle nuances of emotion displays often cause ambiguity or uncertaint... more When recognizing emotions, subtle nuances of emotion displays often cause ambiguity or uncertainty in emotion perception. Unfortunately, the ambiguity or uncertainty cannot be reflected in hard emotion labels. Emotion predictions with uncertainty can be useful for risk controlling, but they are relatively scarce in current deep models for emotion recognition. To address this issue, we propose to apply the multi-generational self-distillation algorithm to emotion recognition task towards better uncertainty estimation performance. We firstly use deep ensembles to capture uncertainty, as an approximation to Bayesian methods. Secondly, the deep ensemble provides soft labels to its student models, while the student models can learn from the uncertainty embedded in those soft labels. Thirdly, we iteratively train deep ensembles to further improve the performance of emotion recognition and uncertainty estimation. In the end, our algorithm results in a single student model that can estimate...
ArXiv, 2021
In this paper, we are introducing a novel model of artificial intelligence, the functional neural... more In this paper, we are introducing a novel model of artificial intelligence, the functional neural network for modeling of human decision-making processes. This neural network is composed of multiple artificial neurons racing in the network. Each of these neurons has a similar structure programmed independently by the users and composed of an intention wheel, a motor core and a sensory core representing the user itself and racing at a specific velocity. The mathematics of the neuron's formulation and the racing mechanism of multiple nodes in the network will be discussed, and the group decision process with fuzzy logic and the transformation of these conceptual methods into practical methods of simulation and in operations will be developed. Eventually, we will describe some possible future research directions in the fields of finance, education and medicine including the opportunity to design an intelligent learning agent with application in business operations supervision. We b...
In this paper, we are presenting a novel method and system for neuropsychological performance tes... more In this paper, we are presenting a novel method and system for neuropsychological performance testing that can establish a link between cognition and emotion. It comprises a portable device used to interact with a cloud service which stores user information under username and is logged into by user through the portable device; the user information is directly captured through the device and is processed by artificial neural network; and the tridimensional user information comprises user cognitive reactions, user emotional responses and user chronometrics. The multivariate situational risk assessment is used to evaluate the performance of the subject by capturing the 3 dimensions of each reaction to a series of 30 dichotomous questions describing various situations of daily life and challenging the user’s knowledge, values, ethics, and principles. In industrial application, the timing of this assessment will depend on the user’s need to obtain a service from a provider such as openin...
2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
When recognizing emotions, subtle nuances in displays of emotion generate ambiguity or uncertaint... more When recognizing emotions, subtle nuances in displays of emotion generate ambiguity or uncertainty in emotion perception. Emotion uncertainty has been previously interpreted as inter-rater disagreement among multiple annotators. In this paper, we consider a more common and challenging scenario: modeling emotion uncertainty when only single emotion labels are available. From a Bayesian perspective, we propose to use deep ensembles to capture uncertainty for multiple emotion descriptors, i.e., action units, discrete expression labels and continuous descriptors. We further apply iterative self-distillation. Iterative distillation over multiple generations significantly improves performance in both emotion recognition and uncertainty estimation. Our method generates single student models that provide accurate estimates of uncertainty for in-domain samples and a student ensemble that can detect out-of-domain samples. Our experiments on emotion recognition and uncertainty estimation using the Aff-wild2 dataset demonstrate that our algorithm gives more reliable uncertainty estimates than both Temperature Scaling and Monte Carol Dropout.
The integration of information across multiple modalities and across time is a promising way to e... more The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. Much previous work has focused on instantaneous emotion recognition. The 2018 One-Minute Gradual-Emotion Recognition (OMG-Emotion) challenge, which was held in conjunction with the IEEE World Congress on Computational Intelligence, encouraged participants to address long-term emotion recognition by integrating cues from multiple modalities, including facial expression, audio and language. Intuitively, a multi-modal inference network should be able to leverage information from each modality and their correlations to improve recognition over that achievable by a single modality network. We describe here a multi-modal neural architecture that integrates visual information over time using an LSTM, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips. Our model outperforms t...
2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)
We train a unified model to perform three tasks: facial action unit detection, expression classif... more We train a unified model to perform three tasks: facial action unit detection, expression classification, and valence-arousal estimation. We address two main challenges of learning the three tasks. First, most existing datasets are highly imbalanced. Second, most existing datasets do not contain labels for all three tasks. To tackle the first challenge, we apply data balancing techniques to experimental datasets. To tackle the second challenge, we propose an algorithm for the multitask model to learn from missing (incomplete) labels. This algorithm has two steps. We first train a teacher model to perform all three tasks, where each instance is trained by the ground truth label of its corresponding task. Secondly, we refer to the outputs of the teacher model as the soft labels. We use the soft labels and the ground truth to train the student model. We find that most of the student models outperform their teacher model on all the three tasks. Finally, we use model ensembling to boost performance further on the three tasks. Our code is publicly available 1 .
Proceedings of the AAAI Conference on Artificial Intelligence
Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous ... more Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous deep network structures often focused on macro-motion which extends over long time scales, e.g., on the order of seconds. We believe integrating structures capturing information about both micro- and macro-motion will benefit emotion prediction, because human perceive both micro- and macro-expressions. In this paper, we propose to combine micro- and macro-motion features to improve video emotion recognition with a two-stream recurrent network, named MIMAMO (Micro-Macro-Motion) Net. Specifically, smaller and shorter micro-motions are analyzed by a two-stream network, while larger and more sustained macro-motions can be well captured by a subsequent recurrent network. Assigning specific interpretations to the roles of different parts of the network enables us to make choice of parameters based on prior knowledge: choices that turn out to be optimal. One of the important innovations in our ...
arXiv (Cornell University), May 9, 2023
Micro-expressions (MEs) are involuntary and subtle facial expressions that are thought to reveal ... more Micro-expressions (MEs) are involuntary and subtle facial expressions that are thought to reveal feelings people are trying to hide. ME spotting detects the temporal intervals containing MEs in videos. Detecting such quick and subtle motions from long videos is difficult. Recent works leverage detailed facial motion representations, such as the optical flow, and deep learning models, leading to high computational complexity. To reduce computational complexity and achieve realtime operation, we propose RMES, a real-time ME spotting framework. We represent motion using phase computed by Riesz Pyramid, and feed this motion representation into a threestream shallow CNN, which predicts the likelihood of each frame belonging to an ME. In comparison to optical flow, phase provides more localized motion estimates, which are essential for ME spotting, resulting in higher performance. Using phase also reduces the required computation of the ME spotting pipeline by 77.8%. Despite its relative simplicity and low computational complexity, our framework achieves state-of-the-art performance on two public datasets: CAS(ME) 2 and SAMM Long Videos.
Proceedings of the AAAI Conference on Human Computation and Crowdsourcing
Model explanations are generated by XAI (explainable AI) methods to help people understand and in... more Model explanations are generated by XAI (explainable AI) methods to help people understand and interpret machine learning models. To study XAI methods from the human perspective, we propose a human-based benchmark dataset, i.e., human saliency benchmark (HSB), for evaluating saliency-based XAI methods. Different from existing human saliency annotations where class-related features are manually and subjectively labeled, this benchmark collects more objective human attention on vision information with a precise eye-tracking device and a novel crowdsourcing experiment. Taking the labor cost of human experiment into consideration, we further explore the potential of utilizing a prediction model trained on HSB to mimic saliency annotating by humans. Hence, a dense prediction problem is formulated, and we propose an encoder-decoder architecture which combines multi-modal and multi-scale features to produce the human saliency maps. Accordingly, a pretraining-finetuning method is designed t...
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
In this paper, we designed an architecture for multiple emotion descriptors estimation in partici... more In this paper, we designed an architecture for multiple emotion descriptors estimation in participating the ABAW3 Challenge. Based on the theory of Ekman and Friesen, 1969, we designed distinct architectures to measure the sign vehicles (i.e., facial action units) and the message (i.e., discrete emotions, valence and arousal) given their different properties. The quantitative experiments on the ABAW3 challenge dataset has shown the superior performance of our approach over two baseline models.
Given a fixed parameter budget, one can build a single large neural network or create a memoryspl... more Given a fixed parameter budget, one can build a single large neural network or create a memorysplit ensemble: a pool of several smaller networks with the same total parameter count as the single network. A memory-split ensemble can outperform its single model counterpart (Lobacheva et al., 2020): a phenomenon known as the memory-split advantage (MSA). The reasons for MSA are still not yet fully understood. In particular, it is difficult in practice to predict when it will exist. This paper sheds light on the reasons underlying MSA using random feature theory. We study the dependence of the MSA on several factors: the parameter budget, the training set size, the L2 regularization and the Stochastic Gradient Descent (SGD) hyper-parameters. Using the bias-variance decomposition, we show that MSA exists when the reduction in variance due to the ensemble (i.e., ensemble gain) exceeds the increase in squared bias due to the smaller size of the individual networks (i.e., shrinkage cost). T...
Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous ... more Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous deep network structures often focused on macro-motion which extends over long time scales, e.g., on the order of seconds. We believe integrating structures capturing information about both micro- and macro-motion will benefit emotion prediction, because human perceive both micro- and macro-expressions. In this paper, we propose to combine micro- and macro-motion features to improve video emotion recognition with a two-stream recurrent network, named MIMAMO (Micro-Macro-Motion) Net. Specifically, smaller and shorter micro-motions are analyzed by a two-stream network, while larger and more sustained macro-motions can be well captured by a subsequent recurrent network. Assigning specific interpretations to the roles of different parts of the network enables us to make choice of parameters based on prior knowledge: choices that turn out to be optimal. One of the important innovations in our ...
ArXiv, 2021
In this paper, we are presenting a novel method and system for neuropsychological performance tes... more In this paper, we are presenting a novel method and system for neuropsychological performance testing that can establish a link between cognition and emotion. It comprises a portable device used to interact with a cloud service which stores user information under username and is logged into by the user through the portable device; the user information is directly captured through the device and is processed by artificial neural network; and this tridimensional information comprises user cognitive reactions, user emotional responses and user chronometrics. The multivariate situational risk assessment is used to evaluate the performance of the subject by capturing the 3 dimensions of each reaction to a series of 30 dichotomous questions describing various situations of daily life and challenging the user’s knowledge, values, ethics, and principles. In industrial application, the timing of this assessment will depend on the user’s need to obtain a service from a provider such as openin...
2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019
Appearance-based gaze estimation maps RGB images to estimates of gaze directions. One problem in ... more Appearance-based gaze estimation maps RGB images to estimates of gaze directions. One problem in gaze estimation is that there always exist low-quality samples (outliers) in which the eyes are barely visible. These low-quality samples are mainly caused by blinks, occlusions (e.g. by eye glasses), blur (e.g. due to motion) and failures of the eye landmark detection. Training on these outliers degrades the performance of gaze estimators, since they have no or limited information about gaze directions. It is also risky to give estimates based on these images in real-world applications, as these estimates may be unreliable. To solve this problem, we propose an algorithm that detects outliers without supervision. Based on the input images with only gaze labels, the proposed algorithm learns to predict a gaze estimates and an additional confidence score, which alleviates the impact of outliers during learning. We evaluated this algorithm on the MPIIGaze dataset and on an internal dataset....
ArXiv, 2021
When recognizing emotions, subtle nuances of emotion displays often cause ambiguity or uncertaint... more When recognizing emotions, subtle nuances of emotion displays often cause ambiguity or uncertainty in emotion perception. Unfortunately, the ambiguity or uncertainty cannot be reflected in hard emotion labels. Emotion predictions with uncertainty can be useful for risk controlling, but they are relatively scarce in current deep models for emotion recognition. To address this issue, we propose to apply the multi-generational self-distillation algorithm to emotion recognition task towards better uncertainty estimation performance. We firstly use deep ensembles to capture uncertainty, as an approximation to Bayesian methods. Secondly, the deep ensemble provides soft labels to its student models, while the student models can learn from the uncertainty embedded in those soft labels. Thirdly, we iteratively train deep ensembles to further improve the performance of emotion recognition and uncertainty estimation. In the end, our algorithm results in a single student model that can estimate...
ArXiv, 2021
In this paper, we are introducing a novel model of artificial intelligence, the functional neural... more In this paper, we are introducing a novel model of artificial intelligence, the functional neural network for modeling of human decision-making processes. This neural network is composed of multiple artificial neurons racing in the network. Each of these neurons has a similar structure programmed independently by the users and composed of an intention wheel, a motor core and a sensory core representing the user itself and racing at a specific velocity. The mathematics of the neuron's formulation and the racing mechanism of multiple nodes in the network will be discussed, and the group decision process with fuzzy logic and the transformation of these conceptual methods into practical methods of simulation and in operations will be developed. Eventually, we will describe some possible future research directions in the fields of finance, education and medicine including the opportunity to design an intelligent learning agent with application in business operations supervision. We b...
In this paper, we are presenting a novel method and system for neuropsychological performance tes... more In this paper, we are presenting a novel method and system for neuropsychological performance testing that can establish a link between cognition and emotion. It comprises a portable device used to interact with a cloud service which stores user information under username and is logged into by user through the portable device; the user information is directly captured through the device and is processed by artificial neural network; and the tridimensional user information comprises user cognitive reactions, user emotional responses and user chronometrics. The multivariate situational risk assessment is used to evaluate the performance of the subject by capturing the 3 dimensions of each reaction to a series of 30 dichotomous questions describing various situations of daily life and challenging the user’s knowledge, values, ethics, and principles. In industrial application, the timing of this assessment will depend on the user’s need to obtain a service from a provider such as openin...
2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
When recognizing emotions, subtle nuances in displays of emotion generate ambiguity or uncertaint... more When recognizing emotions, subtle nuances in displays of emotion generate ambiguity or uncertainty in emotion perception. Emotion uncertainty has been previously interpreted as inter-rater disagreement among multiple annotators. In this paper, we consider a more common and challenging scenario: modeling emotion uncertainty when only single emotion labels are available. From a Bayesian perspective, we propose to use deep ensembles to capture uncertainty for multiple emotion descriptors, i.e., action units, discrete expression labels and continuous descriptors. We further apply iterative self-distillation. Iterative distillation over multiple generations significantly improves performance in both emotion recognition and uncertainty estimation. Our method generates single student models that provide accurate estimates of uncertainty for in-domain samples and a student ensemble that can detect out-of-domain samples. Our experiments on emotion recognition and uncertainty estimation using the Aff-wild2 dataset demonstrate that our algorithm gives more reliable uncertainty estimates than both Temperature Scaling and Monte Carol Dropout.
The integration of information across multiple modalities and across time is a promising way to e... more The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. Much previous work has focused on instantaneous emotion recognition. The 2018 One-Minute Gradual-Emotion Recognition (OMG-Emotion) challenge, which was held in conjunction with the IEEE World Congress on Computational Intelligence, encouraged participants to address long-term emotion recognition by integrating cues from multiple modalities, including facial expression, audio and language. Intuitively, a multi-modal inference network should be able to leverage information from each modality and their correlations to improve recognition over that achievable by a single modality network. We describe here a multi-modal neural architecture that integrates visual information over time using an LSTM, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips. Our model outperforms t...
2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)
We train a unified model to perform three tasks: facial action unit detection, expression classif... more We train a unified model to perform three tasks: facial action unit detection, expression classification, and valence-arousal estimation. We address two main challenges of learning the three tasks. First, most existing datasets are highly imbalanced. Second, most existing datasets do not contain labels for all three tasks. To tackle the first challenge, we apply data balancing techniques to experimental datasets. To tackle the second challenge, we propose an algorithm for the multitask model to learn from missing (incomplete) labels. This algorithm has two steps. We first train a teacher model to perform all three tasks, where each instance is trained by the ground truth label of its corresponding task. Secondly, we refer to the outputs of the teacher model as the soft labels. We use the soft labels and the ground truth to train the student model. We find that most of the student models outperform their teacher model on all the three tasks. Finally, we use model ensembling to boost performance further on the three tasks. Our code is publicly available 1 .
Proceedings of the AAAI Conference on Artificial Intelligence
Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous ... more Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous deep network structures often focused on macro-motion which extends over long time scales, e.g., on the order of seconds. We believe integrating structures capturing information about both micro- and macro-motion will benefit emotion prediction, because human perceive both micro- and macro-expressions. In this paper, we propose to combine micro- and macro-motion features to improve video emotion recognition with a two-stream recurrent network, named MIMAMO (Micro-Macro-Motion) Net. Specifically, smaller and shorter micro-motions are analyzed by a two-stream network, while larger and more sustained macro-motions can be well captured by a subsequent recurrent network. Assigning specific interpretations to the roles of different parts of the network enables us to make choice of parameters based on prior knowledge: choices that turn out to be optimal. One of the important innovations in our ...