Cigdem Eroglu Erdem - Academia.edu (original) (raw)
Papers by Cigdem Eroglu Erdem
2020 28th European Signal Processing Conference (EUSIPCO), 2021
We present a novel curriculum learning (CL) algorithm for face recognition using convolutional ne... more We present a novel curriculum learning (CL) algorithm for face recognition using convolutional neural networks. Curriculum learning is inspired by the fact that humans learn better, when the presented information is organized in a way that covers the easy concepts first, followed by more complex ones. It has been shown in the literature that that CL is also beneficial for machine learning tasks by enabling convergence to a better local minimum. In the proposed CL algorithm for face recognition, we divide the training set of face images into subsets of increasing difficulty based on the head pose angle obtained from the absolute sum of yaw, pitch and roll angles. These subsets are introduced to the deep CNN in order of increasing difficulty. Experimental results on the large-scale CASIA-WebFace-Sub dataset show that the increase in face recognition accuracy is statistically significant when CL is used, as compared to organizing the training data in random batches.
Signal, Image and Video Processing, 2021
Remote photoplethysmography (rPPG) is a non-contact and noninvasive way of measuring human physio... more Remote photoplethysmography (rPPG) is a non-contact and noninvasive way of measuring human physiological signals such as the heart rate using the subtle color changes of skin regions. Since the face of a person is generally visible, facial videos can be used for estimating the heart rate remotely. The rigid and non-rigid motions of the face and illumination variations are the main challenges that affect the accuracy of heart rate estimation. In this paper, we present a new method for estimating the heart rate of a person from the skin region of the facial video using nonlinear mode decomposition (NMD), which is a recently proposed blind source separation method and has been shown to be more robust to noise. We also propose a new method (history-based consistency check—HBCC) for selecting the best heart rate candidate after decomposition by minimizing a temporal cost function. Experiments on two datasets show that the proposed method (rPPG-NMD) achieves promising results as compared to several the state-of-the-art methods for rPPG-based heart rate estimation.
Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015
In this paper, we present the methods used for Bahcesehir University team's submissions to th... more In this paper, we present the methods used for Bahcesehir University team's submissions to the 2015 Emotion Recognition in the Wild Challenge. The challenge consists of categorical emotion recognition in short video clips extracted from movies based on emotional keywords in the subtitles. The video clips mostly contain expressive faces (single or multiple) and also audio which contains the speech of the person in the clip as well as other human voices or background sounds/music. We use an audio-visual method based on video summarization by key frame selection. The key frame selection uses a minimum sparse reconstruction approach with the goal of representing the original video in the best possible way. We extract the LPQ features of the key frames and average them to determine a single feature vector that will represent the video component of the clip. In order to represent the temporal variations of the facial expression, we also use the LBP-TOP features extracted from the whol...
Interspeech 2011
We present a Random Sampling Consensus (RANSAC) based training approach for the problem of speake... more We present a Random Sampling Consensus (RANSAC) based training approach for the problem of speaker state recognition from spontaneous speech. Our system is trained and tested with the INTERSPEECH 2011 Speaker State Challenge corpora that includes the Intoxication and the Sleepiness Subchallenges, where each sub-challenge defines a two-class classification task. We aim to perform a RANSAC-based training data selection coupled with the Support Vector Machine (SVM) based classification to prune possible outliers, which exist in the training data. Our experimental evaluations indicate that utilization of RANSAC-based training data selection provides 66.32 % and 65.38 % unweighted average (UA) recall rate on the development and test sets for the Sleepiness Sub-challenge, respectively and a slight improvement on the Intoxication Subchallenge performance.
Interspeech 2009
We present a speech signal driven emotion recognition system. Our system is trained and tested wi... more We present a speech signal driven emotion recognition system. Our system is trained and tested with the INTERSPEECH 2009 Emotion Challenge corpus, which includes spontaneous and emotionally rich recordings. The challenge includes classifier and feature sub-challenges with five-class and two-class classification problems. We investigate prosody related, spectral and HMM-based features for the evaluation of emotion recognition with Gaussian mixture model (GMM) based classifiers. Spectral features consist of mel-scale cepstral coefficients (MFCC), line spectral frequency (LSF) features and their derivatives, whereas prosody-related features consist of mean normalized values of pitch, first derivative of pitch and intensity. Unsupervised training of HMM structures are employed to define prosody related temporal features for the emotion recognition problem. We also investigate data fusion of different features and decision fusion of different classifiers, which are not well studied for emotion recognition framework. Experimental results of automatic emotion recognition with the INTER-SPEECH 2009 Emotion Challenge corpus are presented.
Digital Signal Processing
2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)
2018 26th Signal Processing and Communications Applications Conference (SIU)
2000 10th European Signal Processing Conference, Sep 1, 2000
In this paper we investigate performance metrics for quantitative evaluation of object-based vide... more In this paper we investigate performance metrics for quantitative evaluation of object-based video segmentation algorithms. The metrics address the case when ground-truth video object planes are available. The proposed metrics are used to evaluate three essentially different approaches for video segmentation, i.e., an edge-based [1], a motion clustering based [2], and a total feature vector clustering based [3] algorithm.
IEEE Transactions on Affective Computing, 2016
Advances in Computer Vision and Pattern Recognition, 2015
Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205), 2001
We present metrics to evaluate the performance of video object segmentation and tracking methods ... more We present metrics to evaluate the performance of video object segmentation and tracking methods quantitatively when groundtruth segmentation maps are not available. The proposed metrics are based on the color and motion differences along the boundary of the estimated video object plane and the color histogram differences between the current object plane and its temporal neighbors. These metrics can be used to localize (spatially and/or temporally) regions where segmentation results are good or bad; or combined to yield a single numerical measure to indicate the goodness of the boundary segmentation and tracking results. Experimental results are presented to evaluate the segmentation map of the "Man" object in the "Hall Monitor" sequence both in terms of a single numerical measure, as well as localization of the good and bad segments of the boundary.
Signal, Image and Video Processing, 2015
In order to carry out research on audio-visual affect recognition, suitable databases are essenti... more In order to carry out research on audio-visual affect recognition, suitable databases are essential. In this work, we present a re-acted audio-visual database in Turkish, consisting of recordings of subjects expressing various emotional and mental states. The database contains synchronous facial recordings of subjects with a frontal stereo camera and a half profile mono camera. The subjects first watch visual or audio-visual stimuli on a screen in front of them, which are designed and timed to elicit certain emotions and mental states. The subjects answer questions about the visual stimuli in an unscripted way. The target emotions that we want to elicit are the six basic ones (happiness, anger, sadness, disgust, fear, surprise) and additionally boredom. We also aim to elicit several mental states such as unsure (including confused, undecided), thinking, concentrating, interested (including curious), and complaining. The database also contains short acted recordings of each subject.
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007
We present a new framework for joint analysis of head gesture and speech prosody patterns of a sp... more We present a new framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. The proposed twostage analysis aims to "learn" both elementary prosody and head gesture patterns for a particular speaker, as well as the correlations between these head gesture and prosody patterns from a training video sequence. The resulting audiovisual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech.
2008 15th IEEE International Conference on Image Processing, 2008
This paper presents a framework for unsupervised video analysis in the context of dance performan... more This paper presents a framework for unsupervised video analysis in the context of dance performances, where gestures and 3D movements of a dancer are characterized by repetition of a set of unknown dance figures. The system is trained in an unsupervised manner using Hidden Markov Models (HMMs) to automatically segment multi-view video recordings of a dancer into recurring elementary temporal body motion patterns to identify the dance figures. That is, a parallel HMM structure is employed to automatically determine the number and the temporal boundaries of different dance figures in a given dance video. The success of the analysis framework has been evaluated by visualizing these dance figures on a dancing avatar animated by the computed 3D analysis parameters. Experimental results demonstrate that the proposed framework enables synthetic agents and/or robots to learn dance figures from video automatically.
2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings, 2014
In this paper we present an effective framework for multimodal emotion recognition based on a nov... more In this paper we present an effective framework for multimodal emotion recognition based on a novel approach for automatic peak frame selection from audiovisual video sequences. Given a video with an emotional expression, peak frames are the ones at which the emotion is at its apex. The objective of peak frame selection is to make the training process for the automatic emotion recognition system easier by summarizing the expressed emotion over a video sequence. The main steps of the proposed framework consists of extraction of video and audio features based on peak frame selection, unimodal classification and decision level fusion of audio and visual results. We evaluated the performance of our approach on eNTERFACE'05 audiovisual database containing six basic emotional classes. Experimental results demonstrate the effectiveness and superiority of the proposed system over other methods in the literature.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
This paper presents a hybrid method for face detection in color images. The well known Haar featu... more This paper presents a hybrid method for face detection in color images. The well known Haar feature-based face detector developed by Viola and Jones (VJ), that has been designed for gray-scale images is combined with a skin-color filter, which provides complementary information in color images. The image is first passed through a Haar-Feature based face detector, which is adjusted such that it is operating at a point on its ROC curve that has a low number of missed faces but a high number of false detections. Then, using the proposed skin color post-filtering method many of these false detections can be eliminated easily. We also use a color compensation algorithm to reduce the effects of lighting. Our experimental results on the Bao color face database show that the proposed method is superior to the original VJ algorithm and also to other skin color based pre-filtering methods in the literature in terms of precision.
2013 21st Signal Processing and Communications Applications Conference (SIU), 2013
ABSTRACT In order to design algorithms for affect recognition from facial expressions and speech,... more ABSTRACT In order to design algorithms for affect recognition from facial expressions and speech, audio-visual databases are needed. The affective databases used by researchers today are generally recorded in laboratory environments and contain acted expressions. In this work, we present a method for extraction of audio-visual facial clips from movies. The database collected using the proposed method contains English and Turkish clips and can easily be extended for other languages. We also provide facial expresssion recognition results, which utilize local phase quantization based feature extraction and a support vector machine. Due to larger number of features compared to the number of examples, the affect recognition accuracy improves significantly when feature selection is also performed.
2020 28th European Signal Processing Conference (EUSIPCO), 2021
We present a novel curriculum learning (CL) algorithm for face recognition using convolutional ne... more We present a novel curriculum learning (CL) algorithm for face recognition using convolutional neural networks. Curriculum learning is inspired by the fact that humans learn better, when the presented information is organized in a way that covers the easy concepts first, followed by more complex ones. It has been shown in the literature that that CL is also beneficial for machine learning tasks by enabling convergence to a better local minimum. In the proposed CL algorithm for face recognition, we divide the training set of face images into subsets of increasing difficulty based on the head pose angle obtained from the absolute sum of yaw, pitch and roll angles. These subsets are introduced to the deep CNN in order of increasing difficulty. Experimental results on the large-scale CASIA-WebFace-Sub dataset show that the increase in face recognition accuracy is statistically significant when CL is used, as compared to organizing the training data in random batches.
Signal, Image and Video Processing, 2021
Remote photoplethysmography (rPPG) is a non-contact and noninvasive way of measuring human physio... more Remote photoplethysmography (rPPG) is a non-contact and noninvasive way of measuring human physiological signals such as the heart rate using the subtle color changes of skin regions. Since the face of a person is generally visible, facial videos can be used for estimating the heart rate remotely. The rigid and non-rigid motions of the face and illumination variations are the main challenges that affect the accuracy of heart rate estimation. In this paper, we present a new method for estimating the heart rate of a person from the skin region of the facial video using nonlinear mode decomposition (NMD), which is a recently proposed blind source separation method and has been shown to be more robust to noise. We also propose a new method (history-based consistency check—HBCC) for selecting the best heart rate candidate after decomposition by minimizing a temporal cost function. Experiments on two datasets show that the proposed method (rPPG-NMD) achieves promising results as compared to several the state-of-the-art methods for rPPG-based heart rate estimation.
Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015
In this paper, we present the methods used for Bahcesehir University team's submissions to th... more In this paper, we present the methods used for Bahcesehir University team's submissions to the 2015 Emotion Recognition in the Wild Challenge. The challenge consists of categorical emotion recognition in short video clips extracted from movies based on emotional keywords in the subtitles. The video clips mostly contain expressive faces (single or multiple) and also audio which contains the speech of the person in the clip as well as other human voices or background sounds/music. We use an audio-visual method based on video summarization by key frame selection. The key frame selection uses a minimum sparse reconstruction approach with the goal of representing the original video in the best possible way. We extract the LPQ features of the key frames and average them to determine a single feature vector that will represent the video component of the clip. In order to represent the temporal variations of the facial expression, we also use the LBP-TOP features extracted from the whol...
Interspeech 2011
We present a Random Sampling Consensus (RANSAC) based training approach for the problem of speake... more We present a Random Sampling Consensus (RANSAC) based training approach for the problem of speaker state recognition from spontaneous speech. Our system is trained and tested with the INTERSPEECH 2011 Speaker State Challenge corpora that includes the Intoxication and the Sleepiness Subchallenges, where each sub-challenge defines a two-class classification task. We aim to perform a RANSAC-based training data selection coupled with the Support Vector Machine (SVM) based classification to prune possible outliers, which exist in the training data. Our experimental evaluations indicate that utilization of RANSAC-based training data selection provides 66.32 % and 65.38 % unweighted average (UA) recall rate on the development and test sets for the Sleepiness Sub-challenge, respectively and a slight improvement on the Intoxication Subchallenge performance.
Interspeech 2009
We present a speech signal driven emotion recognition system. Our system is trained and tested wi... more We present a speech signal driven emotion recognition system. Our system is trained and tested with the INTERSPEECH 2009 Emotion Challenge corpus, which includes spontaneous and emotionally rich recordings. The challenge includes classifier and feature sub-challenges with five-class and two-class classification problems. We investigate prosody related, spectral and HMM-based features for the evaluation of emotion recognition with Gaussian mixture model (GMM) based classifiers. Spectral features consist of mel-scale cepstral coefficients (MFCC), line spectral frequency (LSF) features and their derivatives, whereas prosody-related features consist of mean normalized values of pitch, first derivative of pitch and intensity. Unsupervised training of HMM structures are employed to define prosody related temporal features for the emotion recognition problem. We also investigate data fusion of different features and decision fusion of different classifiers, which are not well studied for emotion recognition framework. Experimental results of automatic emotion recognition with the INTER-SPEECH 2009 Emotion Challenge corpus are presented.
Digital Signal Processing
2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)
2018 26th Signal Processing and Communications Applications Conference (SIU)
2000 10th European Signal Processing Conference, Sep 1, 2000
In this paper we investigate performance metrics for quantitative evaluation of object-based vide... more In this paper we investigate performance metrics for quantitative evaluation of object-based video segmentation algorithms. The metrics address the case when ground-truth video object planes are available. The proposed metrics are used to evaluate three essentially different approaches for video segmentation, i.e., an edge-based [1], a motion clustering based [2], and a total feature vector clustering based [3] algorithm.
IEEE Transactions on Affective Computing, 2016
Advances in Computer Vision and Pattern Recognition, 2015
Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205), 2001
We present metrics to evaluate the performance of video object segmentation and tracking methods ... more We present metrics to evaluate the performance of video object segmentation and tracking methods quantitatively when groundtruth segmentation maps are not available. The proposed metrics are based on the color and motion differences along the boundary of the estimated video object plane and the color histogram differences between the current object plane and its temporal neighbors. These metrics can be used to localize (spatially and/or temporally) regions where segmentation results are good or bad; or combined to yield a single numerical measure to indicate the goodness of the boundary segmentation and tracking results. Experimental results are presented to evaluate the segmentation map of the "Man" object in the "Hall Monitor" sequence both in terms of a single numerical measure, as well as localization of the good and bad segments of the boundary.
Signal, Image and Video Processing, 2015
In order to carry out research on audio-visual affect recognition, suitable databases are essenti... more In order to carry out research on audio-visual affect recognition, suitable databases are essential. In this work, we present a re-acted audio-visual database in Turkish, consisting of recordings of subjects expressing various emotional and mental states. The database contains synchronous facial recordings of subjects with a frontal stereo camera and a half profile mono camera. The subjects first watch visual or audio-visual stimuli on a screen in front of them, which are designed and timed to elicit certain emotions and mental states. The subjects answer questions about the visual stimuli in an unscripted way. The target emotions that we want to elicit are the six basic ones (happiness, anger, sadness, disgust, fear, surprise) and additionally boredom. We also aim to elicit several mental states such as unsure (including confused, undecided), thinking, concentrating, interested (including curious), and complaining. The database also contains short acted recordings of each subject.
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007
We present a new framework for joint analysis of head gesture and speech prosody patterns of a sp... more We present a new framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. The proposed twostage analysis aims to "learn" both elementary prosody and head gesture patterns for a particular speaker, as well as the correlations between these head gesture and prosody patterns from a training video sequence. The resulting audiovisual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech.
2008 15th IEEE International Conference on Image Processing, 2008
This paper presents a framework for unsupervised video analysis in the context of dance performan... more This paper presents a framework for unsupervised video analysis in the context of dance performances, where gestures and 3D movements of a dancer are characterized by repetition of a set of unknown dance figures. The system is trained in an unsupervised manner using Hidden Markov Models (HMMs) to automatically segment multi-view video recordings of a dancer into recurring elementary temporal body motion patterns to identify the dance figures. That is, a parallel HMM structure is employed to automatically determine the number and the temporal boundaries of different dance figures in a given dance video. The success of the analysis framework has been evaluated by visualizing these dance figures on a dancing avatar animated by the computed 3D analysis parameters. Experimental results demonstrate that the proposed framework enables synthetic agents and/or robots to learn dance figures from video automatically.
2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings, 2014
In this paper we present an effective framework for multimodal emotion recognition based on a nov... more In this paper we present an effective framework for multimodal emotion recognition based on a novel approach for automatic peak frame selection from audiovisual video sequences. Given a video with an emotional expression, peak frames are the ones at which the emotion is at its apex. The objective of peak frame selection is to make the training process for the automatic emotion recognition system easier by summarizing the expressed emotion over a video sequence. The main steps of the proposed framework consists of extraction of video and audio features based on peak frame selection, unimodal classification and decision level fusion of audio and visual results. We evaluated the performance of our approach on eNTERFACE'05 audiovisual database containing six basic emotional classes. Experimental results demonstrate the effectiveness and superiority of the proposed system over other methods in the literature.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
This paper presents a hybrid method for face detection in color images. The well known Haar featu... more This paper presents a hybrid method for face detection in color images. The well known Haar feature-based face detector developed by Viola and Jones (VJ), that has been designed for gray-scale images is combined with a skin-color filter, which provides complementary information in color images. The image is first passed through a Haar-Feature based face detector, which is adjusted such that it is operating at a point on its ROC curve that has a low number of missed faces but a high number of false detections. Then, using the proposed skin color post-filtering method many of these false detections can be eliminated easily. We also use a color compensation algorithm to reduce the effects of lighting. Our experimental results on the Bao color face database show that the proposed method is superior to the original VJ algorithm and also to other skin color based pre-filtering methods in the literature in terms of precision.
2013 21st Signal Processing and Communications Applications Conference (SIU), 2013
ABSTRACT In order to design algorithms for affect recognition from facial expressions and speech,... more ABSTRACT In order to design algorithms for affect recognition from facial expressions and speech, audio-visual databases are needed. The affective databases used by researchers today are generally recorded in laboratory environments and contain acted expressions. In this work, we present a method for extraction of audio-visual facial clips from movies. The database collected using the proposed method contains English and Turkish clips and can easily be extended for other languages. We also provide facial expresssion recognition results, which utilize local phase quantization based feature extraction and a support vector machine. Due to larger number of features compared to the number of examples, the affect recognition accuracy improves significantly when feature selection is also performed.