Zoran Ivanovski | Ss. Cyril & Methodius University in Skopje (original) (raw)
Uploads
Papers by Zoran Ivanovski
In this paper, a selective perceptual-based (SELP) framework is presented to reduce the complexit... more In this paper, a selective perceptual-based (SELP) framework is presented to reduce the complexity of popular super-resolution (SR) algorithms while maintaining the desired quality of the enhanced images/video. A perceptual human visual system model is proposed to compute local contrast sensitivity thresholds. The obtained thresholds are used to select which pixels are super-resolved based on the perceived visibility of local edges. Processing only a set of perceptually significant pixels reduces significantly the computational complexity of SR algorithms without losing the achievable visual quality. The proposed SELP framework is integrated into a maximum-a posteriori-based SR algorithm as well as a fast two-stage fusion-restoration SR estimator. Simulation results show a significant reduction on average in computational complexity with comparable signal-to-noise ratio gains and visual quality.
2023 30th International Conference on Systems, Signals and Image Processing (IWSSIP)
Studies in systems, decision and control, 2022
2020 28th Telecommunications Forum (TELFOR), 2020
Intelligent Traffic Surveillance systems have helped improve road safety through ensuring timely ... more Intelligent Traffic Surveillance systems have helped improve road safety through ensuring timely response to events such as traffic accidents and congestion. Our aim is to devise a robust system capable of traffic audio events detection in a real-life environment. At the core of this system is a deep learning model capable of detecting anomalous events and their classification based on their acoustic waveform. We present the results of a series of experiments designed to optimize the architecture of this model based on different algorithms for audio processing. The results show that the designed model has competitive performance to approaches published in literature.
Journal of Electrical Engineering and Information Technologies, 2017
European Signal Processing Conference, Sep 1, 2014
Spectro-temporal features have shown a great promise in respect to improving the noise-robustness... more Spectro-temporal features have shown a great promise in respect to improving the noise-robustness of Automatic Speech Recognition (ASR) systems. The common approach uses a bank of 2D Gabor filters to process the speech signal spectrogram and generate the output feature vector. This approach suffers from generating a large number of coefficients, thus necessitating the use of feature dimensionality reduction. The proposed Gaussian Power flow Orientation Coefficients (GPOCs) use an alternative approach in which only the largest coefficients output from a bank of 2D Gaussian kernels are used to describe the spectro-temporal patterns of power flow in the auditory spectrogram. Whilst reducing the size of the feature vectors, the algorithm was shown to outperform traditional feature extraction methods, even a reference spectro-temporal approach, for low SNRs. Its performance for high SNRs is comparable but inferior to traditional ASR frontends, while falling behind state-of-the-art algorithms in all noise scenarios.
Journal of The Audio Engineering Society, Apr 26, 2012
Organization of video databases is becoming difficult task as the amount of video content increas... more Organization of video databases is becoming difficult task as the amount of video content increases. Video classification based on the content of videos can significantly increase the speed of tasks such as browsing and searching for a particular video in a database. In this paper, a content-based videos classification system for the classes indoor and outdoor is presented. The system is intended to be used on a mobile platform with modest resources. The algorithm makes use of the temporal redundancy in videos, which allows using an uncomplicated classification model while still achieving reasonable accuracy. The training and evaluation was done on a video database of 443 videos downloaded from a video sharing service. A total accuracy of 87.36% was achieved.
IEEE EUROCON 2017 -17th International Conference on Smart Technologies, 2017
The design of speaker diarisation and recognition systems is a mature research area and their dep... more The design of speaker diarisation and recognition systems is a mature research area and their deployment in the real world has gained momentum. There are still a number of parameters of these systems that have to be tuned and optimised for the application scenario at hand. An online call recording diarisation system is designed with integrated speaker identification of the call-centre operators. The parameters of the speaker diarisation and identification algorithms are cross-tuned using a testbench database. The system performance, as assessed by the true positive rate (TPR), is optimised in respect to the delay introduced by the system. As the system is designed to be used online, the TPR-delay trade-off is crucial to its deployment. The finalised system is flexible in that it allows the user to choose the delay or accuracy needed for on-site deployment.
2023 30th International Conference on Systems, Signals and Image Processing (IWSSIP)
Cornell University - arXiv, May 18, 2022
Speech technology is becoming ever more ubiquitous with the advance of speech enabled devices and... more Speech technology is becoming ever more ubiquitous with the advance of speech enabled devices and services. The use of speech synthesis in Augmentative and Alternative Communication tools, has facilitated inclusion of individuals with speech impediments allowing them to communicate with their surroundings using speech. Although there are numerous speech synthesis systems for the most spoken world languages, there is still a limited offer for smaller languages. We propose and compare three models built using parametric and deep learning techniques for Macedonian trained on a newly recorded corpus. We target low-resource edge deployment for Augmentative and Alternative Communication and assistive technologies, such as communication boards and screen readers. The listening test results show that parametric speech synthesis is as performant compared to the more advanced deep learning models. Since it also requires less resources, and offers full speech rate and pitch control, it is the preferred choice for building a Macedonian TTS system for this application scenario.
Abstract- Automatic Speech Recognition Systems of today are intensely deployed in real world appl... more Abstract- Automatic Speech Recognition Systems of today are intensely deployed in real world application scenarios which are often characterized by suboptimal operating conditions. Thus their noise robustness has become a crucial parameter when assessing ASR in-field performance. The paper examines the noise robustness of traditional ASR feature sets as applied to a Voice Dialing Application built for Macedonian. The analysis focused on the following features:
IEEE EUROCON 2017 -17th International Conference on Smart Technologies, 2017
In this paper, a new approach for high quality automated exposure fusion on mobile or handheld de... more In this paper, a new approach for high quality automated exposure fusion on mobile or handheld devices is presented. A utilization of the device's viewfinder screen video feed data is proposed, in order to increase the overall performance of the exposure fusion, both in static scenes and in scenes with moving objects. The introduced novelties are computationally inexpensive, since the preview video is of low frame resolution. The proposed extensions are embedded to an existing exposure fusion algorithm, and the performed experimental tests show that the new extended algorithm is better than its predecessor, both visually and in terms of objective quality measures.
The work presents an effective approach for subpixel motion estimation for Super-resolution (SR).... more The work presents an effective approach for subpixel motion estimation for Super-resolution (SR). The objective is to improve the quality of the estimated SR image by increasing the accuracy of the motion vectors used in the SR procedure. The correction of the motion vectors is based on appearance of error artifacts in the SR image, introduced due to registration errors. First, SR is performed using full pixel accuracy motion vectors obtained using full search block matching algorithm (FS-BMA). Then, machine learning based method is applied on the resulting images in order to detect and classify artifacts introduced due to missing subpixel components of the motion vectors. The outcome of the classification is a subpixel component of the motion vector. In the final step, SR process is repeated using the corrected (subpixel accuracy) motion vectors.
2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP), 2013
In this paper we present a no-reference quality assessment algorithm for highly compressed H.264 ... more In this paper we present a no-reference quality assessment algorithm for highly compressed H.264 videos. By analyzing the spatio-temporal artifacts and their effect on perceived visual quality we produced an array of viable predictors of quality. Using a feature selection method a small sub-set of features, each applied to a specific artefact-domain, was selected for optimal quality estimation. The features are mapped into video quality scores using a simple linear function, which is computationally efficient and minimizes the effect learning of content. The resulting algorithm is evaluated on content independent sets of highly compressed H.264 video sequences from the LIVE video database where it shows high correlation with the subjective scores.
2012 20th Telecommunications Forum (TELFOR), 2012
ABSTRACT
In this paper, a selective perceptual-based (SELP) framework is presented to reduce the complexit... more In this paper, a selective perceptual-based (SELP) framework is presented to reduce the complexity of popular super-resolution (SR) algorithms while maintaining the desired quality of the enhanced images/video. A perceptual human visual system model is proposed to compute local contrast sensitivity thresholds. The obtained thresholds are used to select which pixels are super-resolved based on the perceived visibility of local edges. Processing only a set of perceptually significant pixels reduces significantly the computational complexity of SR algorithms without losing the achievable visual quality. The proposed SELP framework is integrated into a maximum-a posteriori-based SR algorithm as well as a fast two-stage fusion-restoration SR estimator. Simulation results show a significant reduction on average in computational complexity with comparable signal-to-noise ratio gains and visual quality.
2023 30th International Conference on Systems, Signals and Image Processing (IWSSIP)
Studies in systems, decision and control, 2022
2020 28th Telecommunications Forum (TELFOR), 2020
Intelligent Traffic Surveillance systems have helped improve road safety through ensuring timely ... more Intelligent Traffic Surveillance systems have helped improve road safety through ensuring timely response to events such as traffic accidents and congestion. Our aim is to devise a robust system capable of traffic audio events detection in a real-life environment. At the core of this system is a deep learning model capable of detecting anomalous events and their classification based on their acoustic waveform. We present the results of a series of experiments designed to optimize the architecture of this model based on different algorithms for audio processing. The results show that the designed model has competitive performance to approaches published in literature.
Journal of Electrical Engineering and Information Technologies, 2017
European Signal Processing Conference, Sep 1, 2014
Spectro-temporal features have shown a great promise in respect to improving the noise-robustness... more Spectro-temporal features have shown a great promise in respect to improving the noise-robustness of Automatic Speech Recognition (ASR) systems. The common approach uses a bank of 2D Gabor filters to process the speech signal spectrogram and generate the output feature vector. This approach suffers from generating a large number of coefficients, thus necessitating the use of feature dimensionality reduction. The proposed Gaussian Power flow Orientation Coefficients (GPOCs) use an alternative approach in which only the largest coefficients output from a bank of 2D Gaussian kernels are used to describe the spectro-temporal patterns of power flow in the auditory spectrogram. Whilst reducing the size of the feature vectors, the algorithm was shown to outperform traditional feature extraction methods, even a reference spectro-temporal approach, for low SNRs. Its performance for high SNRs is comparable but inferior to traditional ASR frontends, while falling behind state-of-the-art algorithms in all noise scenarios.
Journal of The Audio Engineering Society, Apr 26, 2012
Organization of video databases is becoming difficult task as the amount of video content increas... more Organization of video databases is becoming difficult task as the amount of video content increases. Video classification based on the content of videos can significantly increase the speed of tasks such as browsing and searching for a particular video in a database. In this paper, a content-based videos classification system for the classes indoor and outdoor is presented. The system is intended to be used on a mobile platform with modest resources. The algorithm makes use of the temporal redundancy in videos, which allows using an uncomplicated classification model while still achieving reasonable accuracy. The training and evaluation was done on a video database of 443 videos downloaded from a video sharing service. A total accuracy of 87.36% was achieved.
IEEE EUROCON 2017 -17th International Conference on Smart Technologies, 2017
The design of speaker diarisation and recognition systems is a mature research area and their dep... more The design of speaker diarisation and recognition systems is a mature research area and their deployment in the real world has gained momentum. There are still a number of parameters of these systems that have to be tuned and optimised for the application scenario at hand. An online call recording diarisation system is designed with integrated speaker identification of the call-centre operators. The parameters of the speaker diarisation and identification algorithms are cross-tuned using a testbench database. The system performance, as assessed by the true positive rate (TPR), is optimised in respect to the delay introduced by the system. As the system is designed to be used online, the TPR-delay trade-off is crucial to its deployment. The finalised system is flexible in that it allows the user to choose the delay or accuracy needed for on-site deployment.
2023 30th International Conference on Systems, Signals and Image Processing (IWSSIP)
Cornell University - arXiv, May 18, 2022
Speech technology is becoming ever more ubiquitous with the advance of speech enabled devices and... more Speech technology is becoming ever more ubiquitous with the advance of speech enabled devices and services. The use of speech synthesis in Augmentative and Alternative Communication tools, has facilitated inclusion of individuals with speech impediments allowing them to communicate with their surroundings using speech. Although there are numerous speech synthesis systems for the most spoken world languages, there is still a limited offer for smaller languages. We propose and compare three models built using parametric and deep learning techniques for Macedonian trained on a newly recorded corpus. We target low-resource edge deployment for Augmentative and Alternative Communication and assistive technologies, such as communication boards and screen readers. The listening test results show that parametric speech synthesis is as performant compared to the more advanced deep learning models. Since it also requires less resources, and offers full speech rate and pitch control, it is the preferred choice for building a Macedonian TTS system for this application scenario.
Abstract- Automatic Speech Recognition Systems of today are intensely deployed in real world appl... more Abstract- Automatic Speech Recognition Systems of today are intensely deployed in real world application scenarios which are often characterized by suboptimal operating conditions. Thus their noise robustness has become a crucial parameter when assessing ASR in-field performance. The paper examines the noise robustness of traditional ASR feature sets as applied to a Voice Dialing Application built for Macedonian. The analysis focused on the following features:
IEEE EUROCON 2017 -17th International Conference on Smart Technologies, 2017
In this paper, a new approach for high quality automated exposure fusion on mobile or handheld de... more In this paper, a new approach for high quality automated exposure fusion on mobile or handheld devices is presented. A utilization of the device's viewfinder screen video feed data is proposed, in order to increase the overall performance of the exposure fusion, both in static scenes and in scenes with moving objects. The introduced novelties are computationally inexpensive, since the preview video is of low frame resolution. The proposed extensions are embedded to an existing exposure fusion algorithm, and the performed experimental tests show that the new extended algorithm is better than its predecessor, both visually and in terms of objective quality measures.
The work presents an effective approach for subpixel motion estimation for Super-resolution (SR).... more The work presents an effective approach for subpixel motion estimation for Super-resolution (SR). The objective is to improve the quality of the estimated SR image by increasing the accuracy of the motion vectors used in the SR procedure. The correction of the motion vectors is based on appearance of error artifacts in the SR image, introduced due to registration errors. First, SR is performed using full pixel accuracy motion vectors obtained using full search block matching algorithm (FS-BMA). Then, machine learning based method is applied on the resulting images in order to detect and classify artifacts introduced due to missing subpixel components of the motion vectors. The outcome of the classification is a subpixel component of the motion vector. In the final step, SR process is repeated using the corrected (subpixel accuracy) motion vectors.
2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP), 2013
In this paper we present a no-reference quality assessment algorithm for highly compressed H.264 ... more In this paper we present a no-reference quality assessment algorithm for highly compressed H.264 videos. By analyzing the spatio-temporal artifacts and their effect on perceived visual quality we produced an array of viable predictors of quality. Using a feature selection method a small sub-set of features, each applied to a specific artefact-domain, was selected for optimal quality estimation. The features are mapped into video quality scores using a simple linear function, which is computationally efficient and minimizes the effect learning of content. The resulting algorithm is evaluated on content independent sets of highly compressed H.264 video sequences from the LIVE video database where it shows high correlation with the subjective scores.
2012 20th Telecommunications Forum (TELFOR), 2012
ABSTRACT