Mandana Fasounaki - Academia.edu (original) (raw)
Uploads
Papers by Mandana Fasounaki
2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019)
This paper addresses the problem of automatic facial expression recognition in videos, where the ... more This paper addresses the problem of automatic facial expression recognition in videos, where the goal is to predict discrete emotion labels best describing the emotions expressed in short video clips. Building on a pre-trained convolutional neural network (CNN) model dedicated to analyzing the video frames and LSTM network designed to process the trajectories of the facial landmarks, this paper investigates several novel directions. First of all, improved face descriptors based on 2D CNNs and facial landmarks are proposed. Second, the paper investigates fusion methods of the features temporally, including a novel hierarchical recurrent neural network combining facial landmark trajectories over time. In addition, we propose a modification to state-of-the-art expression recognition architectures to adapt them to video processing in a simple way. In both ensemble approaches, the temporal information is integrated. Comparative experiments on publicly available video-based facial expression recognition datasets verified that the proposed framework outperforms state-of-the-art methods. Moreover, we introduce a near-infrared video dataset containing facial expressions from subjects driving their cars, which are recorded in real world conditions.
2018 26th Signal Processing and Communications Applications Conference (SIU), 2018
Text detection is one of the most challenging and commonly dealt applications in computer vision.... more Text detection is one of the most challenging and commonly dealt applications in computer vision. Detecting text regions is the first step of the text recognition systems called Optical Character Recognition. This process requires the separation of text region from non-text region. In this paper, we utilize Maximally Stable Extremal Regions to acquire very first text region candidates. Then these possible regions are reduced in quantity by using geometric and stroke width properties. Candidate regions are joined to obtain text groups. Finally, Tesseract Optical Character Recognition engine is utilized as the last step to eliminate non-text groups. We evaluated the proposed system on KAIST and ICDAR datasets for both natural images and computer-generated images. For natural images 82.7% precision and 52.0% f-accuracy; for computer-generated images 64.0% precision and 65.2% f-accuracy is achieved.
European Journal of Science and Technology, 2021
Automatic Speaker Identification (ASI) is one of the active fields of research in signal processi... more Automatic Speaker Identification (ASI) is one of the active fields of research in signal processing. Various machine learning algorithms have been used for this purpose. With the recent developments in hardware technologies and data accumulation, Deep Learning (DL) methods have become the new state-of-the-art approach in several classification and identification tasks. In this paper, we evaluate the performance of traditional methods such as Gaussian Mixture Model-Universal Background Model (GMM-UBM) and DL-based techniques such as Factorized Time-Delay Neural Network (FTDNN) and Convolutional Neural Networks (CNN) for text-independent closed-set automatic speaker identification on two datasets with different conditions. LibriSpeech is one of the experimental datasets, which consists of clean audio signals from audiobooks, collected from a large number of speakers. The other dataset was collected and prepared by us, which has rather limited speech data with low signal-to-noise-ratio from real-life conversations of customers with the agents in a call center. The duration of the speech signals in the query phase is an important factor affecting the performances of ASI methods. In this work, a CNN architecture is proposed for automatic speaker identification from short speech segments. The architecture design aims at capturing the temporal nature of speech signal in an optimum convolutional neural network with low number of parameters compared to the well-known CNN architectures. We show that the proposed CNN-based algorithm performs better on the large and clean dataset, whereas on the other dataset with limited amount of data, traditional method outperforms all DL approaches. The achieved top-1 accuracy by the proposed model is 99.5% on 1-second voice instances from LibriSpeech dataset.
2021 6th International Conference on Computer Science and Engineering (UBMK)
With the widespread use of voice-controlling services and devices, the research for developing ro... more With the widespread use of voice-controlling services and devices, the research for developing robust and fast systems for automatic speaker identification had accelerated. In this paper, we present a Convolutional Neural Network (CNN) architecture for text-independent automatic speaker identification. The primary purpose is to identify a speaker, among many others, using a short speech segment. Most of the current researches focus on deep CNNs, which were initially designed for computer vision tasks. Besides, most of the existing speaker identification methods require audio samples longer than 3 seconds in the query phase for achieving a high accuracy. We created a CNN architecture appropriate for voice and speech-related classification tasks. We propose an optimum model that achieves 99.5% accuracy on LibriSpeech and 90% accuracy on VoxCeleb 1 dataset using only 1-second test utterances in our experiments.
Reinforcement learning began to perform at human-level success in game intelligence after deep le... more Reinforcement learning began to perform at human-level success in game intelligence after deep learning revolution. Geometry Friends is a puzzle game, where we can benefit from deep learning and expect to have successful game playing agents. In the game, agents are collecting targets in two dimensional environment and they try to overcome obstacles in the way. In this paper, Q-learning approach is applied to this game and a generalized circle agent for different types of environment is implemented. Agent is trained by giving only screen pixels as input via a Convolutional Neural Network. Experimental results show that with the proposed method game completion rate and completion times are improved compared to random agent.
2021 6th International Conference on Computer Science and Engineering (UBMK)
2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019)
This paper addresses the problem of automatic facial expression recognition in videos, where the ... more This paper addresses the problem of automatic facial expression recognition in videos, where the goal is to predict discrete emotion labels best describing the emotions expressed in short video clips. Building on a pre-trained convolutional neural network (CNN) model dedicated to analyzing the video frames and LSTM network designed to process the trajectories of the facial landmarks, this paper investigates several novel directions. First of all, improved face descriptors based on 2D CNNs and facial landmarks are proposed. Second, the paper investigates fusion methods of the features temporally, including a novel hierarchical recurrent neural network combining facial landmark trajectories over time. In addition, we propose a modification to state-of-the-art expression recognition architectures to adapt them to video processing in a simple way. In both ensemble approaches, the temporal information is integrated. Comparative experiments on publicly available video-based facial expression recognition datasets verified that the proposed framework outperforms state-of-the-art methods. Moreover, we introduce a near-infrared video dataset containing facial expressions from subjects driving their cars, which are recorded in real world conditions.
2018 26th Signal Processing and Communications Applications Conference (SIU), 2018
Text detection is one of the most challenging and commonly dealt applications in computer vision.... more Text detection is one of the most challenging and commonly dealt applications in computer vision. Detecting text regions is the first step of the text recognition systems called Optical Character Recognition. This process requires the separation of text region from non-text region. In this paper, we utilize Maximally Stable Extremal Regions to acquire very first text region candidates. Then these possible regions are reduced in quantity by using geometric and stroke width properties. Candidate regions are joined to obtain text groups. Finally, Tesseract Optical Character Recognition engine is utilized as the last step to eliminate non-text groups. We evaluated the proposed system on KAIST and ICDAR datasets for both natural images and computer-generated images. For natural images 82.7% precision and 52.0% f-accuracy; for computer-generated images 64.0% precision and 65.2% f-accuracy is achieved.
European Journal of Science and Technology, 2021
Automatic Speaker Identification (ASI) is one of the active fields of research in signal processi... more Automatic Speaker Identification (ASI) is one of the active fields of research in signal processing. Various machine learning algorithms have been used for this purpose. With the recent developments in hardware technologies and data accumulation, Deep Learning (DL) methods have become the new state-of-the-art approach in several classification and identification tasks. In this paper, we evaluate the performance of traditional methods such as Gaussian Mixture Model-Universal Background Model (GMM-UBM) and DL-based techniques such as Factorized Time-Delay Neural Network (FTDNN) and Convolutional Neural Networks (CNN) for text-independent closed-set automatic speaker identification on two datasets with different conditions. LibriSpeech is one of the experimental datasets, which consists of clean audio signals from audiobooks, collected from a large number of speakers. The other dataset was collected and prepared by us, which has rather limited speech data with low signal-to-noise-ratio from real-life conversations of customers with the agents in a call center. The duration of the speech signals in the query phase is an important factor affecting the performances of ASI methods. In this work, a CNN architecture is proposed for automatic speaker identification from short speech segments. The architecture design aims at capturing the temporal nature of speech signal in an optimum convolutional neural network with low number of parameters compared to the well-known CNN architectures. We show that the proposed CNN-based algorithm performs better on the large and clean dataset, whereas on the other dataset with limited amount of data, traditional method outperforms all DL approaches. The achieved top-1 accuracy by the proposed model is 99.5% on 1-second voice instances from LibriSpeech dataset.
2021 6th International Conference on Computer Science and Engineering (UBMK)
With the widespread use of voice-controlling services and devices, the research for developing ro... more With the widespread use of voice-controlling services and devices, the research for developing robust and fast systems for automatic speaker identification had accelerated. In this paper, we present a Convolutional Neural Network (CNN) architecture for text-independent automatic speaker identification. The primary purpose is to identify a speaker, among many others, using a short speech segment. Most of the current researches focus on deep CNNs, which were initially designed for computer vision tasks. Besides, most of the existing speaker identification methods require audio samples longer than 3 seconds in the query phase for achieving a high accuracy. We created a CNN architecture appropriate for voice and speech-related classification tasks. We propose an optimum model that achieves 99.5% accuracy on LibriSpeech and 90% accuracy on VoxCeleb 1 dataset using only 1-second test utterances in our experiments.
Reinforcement learning began to perform at human-level success in game intelligence after deep le... more Reinforcement learning began to perform at human-level success in game intelligence after deep learning revolution. Geometry Friends is a puzzle game, where we can benefit from deep learning and expect to have successful game playing agents. In the game, agents are collecting targets in two dimensional environment and they try to overcome obstacles in the way. In this paper, Q-learning approach is applied to this game and a generalized circle agent for different types of environment is implemented. Agent is trained by giving only screen pixels as input via a Convolutional Neural Network. Experimental results show that with the proposed method game completion rate and completion times are improved compared to random agent.
2021 6th International Conference on Computer Science and Engineering (UBMK)