Ioannis Mademlis | Harokopio University (original) (raw)

Papers by Ioannis Mademlis

arXiv (Cornell University), May 25, 2023

The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NL... more The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP) during the past decade. However, the demands of long document analysis are quite different from those of shorter texts, while the ever increasing size of documents uploaded on-line renders automated understanding of lengthy texts a critical issue. Relevant applications include automated Web mining, legal document review, medical records analysis, financial reports analysis, contract management, environmental impact assessment, news aggregation, etc. Despite the relatively recent development of efficient algorithms for analyzing long documents, practical tools in this field are currently flourishing. This article serves as an entry point into this dynamic domain and aims to achieve two objectives. Firstly, it provides an overview of the relevant neural building blocks, serving as a concise tutorial for the field. Secondly, it offers a brief examination of the current state-of-the-art in long document NLP, with a primary focus on two key tasks: document classification and document summarization. Sentiment analysis for long texts is also covered, since it is typically treated as a particular case of document classification. Consequently, this article presents an introductory exploration of document-level analysis, addressing the primary challenges, concerns, and existing solutions. Finally, the article presents publicly available annotated datasets that can facilitate further research in this area.

Computational UAV Cinematography for Intelligent A/V Shooting Based on Semantic Visual Analysis

As audiovisual coverage of sports events using Unmanned Aerial Vehicles (UAVs) is becoming increa... more As audiovisual coverage of sports events using Unmanned Aerial Vehicles (UAVs) is becoming increasingly popular, intelligent audiovisual (A/V) shooting tools are needed to assist the cameramen and directors. Several challenges also arise by employing autonomous UAVs, including the accurate identification of the 2D region of cinematographic attention (RoCA) depicting rapidly moving target ensembles (e.g., athletes) and the automatic control of the UAVs so as to take informative and aesthetically pleasing A/V shots, by performing automatic or semiautomatic visual content analysis with no or minimal human intervention. A novel method implementing computational UAV cinematography for assisting sports coverage, based on semantic, human-centered visual analysis is proposed in this work. Athlete detection and tracking, as well as spatial athlete distribution on the image plane are the semantic features extracted from an aerial video feed captured by a UAV and exploited for the extraction of the RoCA, based solely on present and past athlete detections and their regions of interest (ROIs). A PID controller that visually controls a real or virtual camera in order to track the sports RoCA and produce aesthetically pleasing shots, without using 3D location-related information, is subsequently employed. The proposed method is evaluated on actual UAV A/V footage from soccer matches and promising results are obtained.

Zenodo (CERN European Organization for Nuclear Research), Sep 30, 2021

Automated unsupervised video summarization by key-frame extraction consists in identifying repres... more Automated unsupervised video summarization by key-frame extraction consists in identifying representative video frames, best abridging a complete input sequence, and temporally ordering them to form a video summary, without relying on manually constructed ground-truth key-frame sets. Stateof-the-art unsupervised deep neural approaches consider the desired summary to be a subset of the original sequence, composed of video frames that are sufficient to visually reconstruct the entire input. They typically employ a pre-trained CNN for extracting a vector representation per RGB video frame and a baseline LSTM adversarial learning framework for identifying key-frames. In this paper, to better guide the network towards properly selecting video frames that can faithfully reconstruct the original video, we augment the baseline framework with an additional LSTM autoencoder, which learns in parallel a fixed-length representation of the entire original input sequence. This is exploited during training, where a novel loss term inspired by dictionary learning is added to the network optimization objectives, further biasing key-frame selection towards video frames which are collectively able to recreate the original video. Empirical evaluation on two common public relevant datasets indicates highly favourable results.

Procedural Terrain Generation Using Generative Adversarial Networks

Zenodo (CERN European Organization for Nuclear Research), Nov 22, 2021

Synthetic terrain realism is critical in VR applications based on computer graphics (e.g., games,... more Synthetic terrain realism is critical in VR applications based on computer graphics (e.g., games, simulations). Although fast procedural algorithms for automated terrain generation do exist, they still require human effort. This paper proposes a novel approach to procedural terrain generation, relying on Generative Adversarial Networks (GANs). The neural model is trained using terrestrial Points-of-Interest (PoIs, described by their geodesic coordinates/altitude) and publicly available corresponding satellite images. After training is complete, the GAN can be employed for deriving realistic terrain images on-the- fly, by merely forwarding through it a rough 2D scatter plot of desired PoIs in image form (so-called "altitude image"). We demonstrate that such a GAN is able to translate this rough, quickly produced sketch into an actual photorealistic terrain image. Additionally, we describe a strategy for enhancing the visual diversity of trained model synthetic output images, by tweaking input altitude image orientation during GAN training. Finally, we perform an objective and a subjective evaluation of the proposed method. Results validate the latter's ability to<br> rapidly create life-like terrain images from minimal input data.

Μέθοδοι μηχανικής μάθησης και μηχανικής όρασης για την ευφυή ανάλυση εικονοσειρών

Σε αυτή τη διδακτορική διατριβή, παρουσιάζονται τα αποτελέσματα της έρευνας που διεξήχθη στην περ... more Σε αυτή τη διδακτορική διατριβή, παρουσιάζονται τα αποτελέσματα της έρευνας που διεξήχθη στην περιοχή της ευφυούς ανάλυσης εικονοσειρών με χρήση μεθόδων μηχανικής μάθησης και μηχανικής όρασης. Η έμφαση δόθηκε σε δεδομένα κινηματογραφικής/τηλεοπτικής παραγωγής, προκειμένου να καταδειχθεί το δυναμικό της σύγχρονης τεχνητής νοημοσύνης στη βιομηχανία παραγωγής και μετεπεξεργασίας οπτικοακουστικού υλικού, αλλά οι προτεινόμενοι αλγόριθμοι έχουν ευρύτερη εφαρμογή σε κάθε τύπου εικονοσειρά. Η παρουσιαζόμενη έρευνα αφορά τα προβλήματα της ανίχνευσης στερεοσκοπικών ελαττωμάτων ποιότητας, της αναγνώρισης ανθρώπινων δραστηριοτήτων σε στερεοσκοπικές εικονοσειρές, της αυτόματης συνόψισης στερεοσκοπικών κινηματογραφικών ταινιών σύμφωνα με τις αφηγηματικές τους ιδιότητες και της αυτόματης συνόψισης εικονοσειρών ανθρώπινων δραστηριοτήτων. Η κύρια συνεισφορά μας στο πρόβλημα της ανίχνευσης στερεοσκοπικών ελαττωμάτων ποιότητας συνίσταται στην περιγραφή τεσσάρων αλγορίθμων αυτόματης ανίχνευσης και χαρακτηρισμού ελαττωμάτων για ισάριθμους τύπους ζητημάτων, κατά τη φάση της μετεπεξεργασίας στην παραγωγή κινηματογραφικού ή τηλεοπτικού υλικού. Όσον αφορά το ζήτημα της αναγνώρισης ανθρώπινων δραστηριοτήτων σε στερεοσκοπικές εικονοσειρές, προτείνονται τρόποι εκμετάλλευσης της πληροφορίας περί γεωμετρίας σκηνής την οποία κωδικοποιεί το κανάλι της στερεοσκοπικής παράλλαξης, με στόχο τη βελτίωση της απόδοσης στην αναγνώριση ανθρώπινων δραστηριοτήτων σε φυσικό σκηνικό. Η έρευνά μας επεκτάθηκε στο πρόβλημα της αυτόματης, πολυτροπικής συνόψισης στερεοσκοπικών 3Δ κινηματογραφικών ταινιών σύμφωνα με τις αφηγηματικές τους ιδιότητες, υπό τη μορφή μίας εικονοσειράς περίληψης. Προς αυτή την κατεύθυνση, αναπτύχθηκε μία πλήρης αλγοριθμική σωλήνωση συνόψισης η οποία λαμβάνει υπόψη οπτικά, ηχητικά, γεωμετρικά και αφηγηματικά χαρακτηριστικά των πλάνων και των καρέ της ταινίας. Τέλος, μελετήθηκε το ζήτημα αυτόματης συνόψισης εικονοσειρών δραστηριοτήτων μεγάλης διάρκειας, οι οποίες έχουν ορισμένες κοινές, επαναλαμβανόμενες ιδιότητες (στατική κάμερα, στατικό υπόβαθρο, υψηλό βαθμό οπτικής ομοιότητας μεταξύ των καρέ) και μπορούν να προκύψουν από ποικιλία πηγών (κάμερες επιτήρησης, συνεδρίες καταγραφής σε κινηματογραφικές/τηλεοπτικές παραγωγές κλπ.). Για την επίλυση του προβλήματος, αναπτύχθηκε ένα νέο αλγοριθμικό πλαίσιο συνόψισης εικονοσειρών δραστηριοτήτων, υπό τη μορφή της εξαγωγής ενός συνόλου αντιπροσωπευτικών καρέ-κλειδιών που συνοψίζει βέλτιστα τις διαφορετικές εικονιζόμενες δραστηριότητες.

arXiv (Cornell University), May 3, 2023

Automated detection of contraband items in X-ray images can significantly increase public safety,... more Automated detection of contraband items in X-ray images can significantly increase public safety, by enhancing the productivity and alleviating the mental load of security officers in airports, subways, customs/post offices, etc. The large volume and high throughput of passengers, mailed parcels, etc., during rush hours make it a Big Data analysis task. Modern computer vision algorithms relying on Deep Neural Networks (DNNs) have proven capable of undertaking this task even under resourceconstrained and embedded execution scenarios, e.g., as is the case with fast, single-stage, anchor-based object detectors. This paper proposes a twofold improvement of such algorithms for the Xray analysis domain, introducing two complementary novelties. Firstly, more efficient anchors are obtained by hierarchical clustering the sizes of the ground-truth training set bounding boxes; thus, the resulting anchors follow a natural hierarchy aligned with the semantic structure of the data. Secondly, the default Non-Maximum Suppression (NMS) algorithm at the end of the object detection pipeline is modified to better handle occluded object detection and to reduce the number of false predictions, by inserting the Efficient Intersection over Union (E-IoU) metric into the Weighted Cluster NMS method. E-IoU provides more discriminative geometrical correlations between the candidate bounding boxes/Regions-of-Interest (RoIs). The proposed method is implemented on a common single-stage object detector (YOLOv5) and its experimental evaluation on a relevant public dataset indicates significant accuracy gains over both the baseline and competing approaches. This highlights the potential of Big Data analysis in enhancing public safety.

Vision-based drone control for autonomous UAV cinematography

Multimedia Tools and Applications, Aug 15, 2023

Secure Communications for Autonomous Multiple-UAV Media Production

Springer eBooks, 2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

This paper presents a novel neural module for enhancing existing fast and lightweight 2D human po... more This paper presents a novel neural module for enhancing existing fast and lightweight 2D human pose estimation CNNs, in order to increase their accuracy. A baseline stem CNN is augmented by a collateral module, which is tasked to encode global spatial and semantic information and provide it to the stem network during inference. The latter one outputs the final 2D human pose estimations. Since global information encoding is an inherent subtask of 2D human pose estimation, this particular setup allows the stem network to better focus on the local details of the input image and on precisely localizing each body joint, thus increasing overall 2D human pose estimation accuracy. Furthermore, the collateral module is designed to be lightweight, adding negligible runtime computational cost, so that the unified architecture retains the fast execution property of the stem network. Evaluation of the proposed method on public 2D human pose estimation datasets shows that it increases the accuracy of different baseline stem CNNs, while outperforming all competing fast 2D human pose estimation methods.

Fast multidimensional scaling on big geospatial data using neural networks

Earth Science Informatics

Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

State-of-the-art deep neural unsupervised video summarization methods mostly fall under the adver... more State-of-the-art deep neural unsupervised video summarization methods mostly fall under the adversarial reconstruction framework. This employs a Generative Adversarial Network (GAN) structure and Long Short-Term Memory (LSTM) autoencoders during its training stage. The typical result is a selector LSTM that sequentially receives video frame representations and outputs corresponding scalar importance factors, which are then used to select key-frames. This basic approach has been augmented with an additional Deep Reinforcement Learning (DRL) agent, trained using the Discriminator's output as a reward, which learns to optimize the selector's outputs. However, local minima are a well-known problem in DRL. Thus, this paper presents a novel regularizer for escaping local loss minima, in order to improve unsupervised key-frame extraction. It is an additive loss term employed during a second training phase, that rewards the difference of the neural agent's parameters from those of a previously found good solution. Thus, it encourages the training process to explore more aggressively the parameter space in order to discover a better local loss minimum. Evaluation performed on two public datasets shows considerable increases over the baseline and against the state-of-the-art. CCS CONCEPTS • Computing methodologies → Reinforcement learning; Video summarization; Neural networks.

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Non-Maximum Suppression (NMS) is a post-processing step in almost every visual object detector, t... more Non-Maximum Suppression (NMS) is a post-processing step in almost every visual object detector, tasked with rapidly pruning the number of overlapping detected candidate rectangular Regions-of-Interest (RoIs) and replacing them with a single, more spatially accurate detection (in pixel coordinates). The common Greedy NMS algorithm suffers from drawbacks, due to the need for careful manual tuning. In visual person detection, most NMS methods typically suffer when analyzing crowded scenes with high levels of inbetween occlusions. This paper proposes a modification on a deep neural architecture for NMS, suitable for such cases and capable of efficiently cooperating with recent neural object detectors. The method approaches the NMS problem as a rescoring task, aiming to ideally assign precisely one detection per object. The proposed modification exploits the extraction of RoI representations, semantically capturing the region's visual appearance, from information-rich feature maps computed by the detector's intermediate layers. Experimental evaluation on two common public person detection datasets shows improved accuracy against competing methods, with acceptable inference speed.

Neural Attention-driven Non-Maximum Suppression for Person Detection

IEEE Transactions on Image Processing, 2023

Non-maximum suppression (NMS) is a postprocessing step in almost every visual object detector. NM... more Non-maximum suppression (NMS) is a postprocessing step in almost every visual object detector. NMS aims to prune the number of overlapping detected candidate regionsof-interest (RoIs) on an image, in order to assign a single and spatially accurate detection to each object. The default NMS algorithm (GreedyNMS) is fairly simple and suffers from severe drawbacks, due to its need for manual tuning. A typical case of failure with high application relevance is pedestrian/person detection in the presence of occlusions, where GreedyNMS doesn't provide accurate results. This paper proposes an efficient deep neural architecture for NMS in the person detection scenario, by capturing relations of neighboring RoIs and aiming to ideally assign precisely one detection per person. The presented Seq2Seq-NMS architecture assumes a sequence-to-sequence formulation of the NMS problem, exploits the Multihead Scale-Dot Product Attention mechanism and jointly processes both geometric and visual properties of the input candidate RoIs. Thorough experimental evaluation on three public person detection datasets shows favourable results against competing methods, with acceptable inference runtime requirements. ACCESSIBLE AT: https://www.researchgate.net/publication/370205735_Neural_Attention-driven_Non-Maximum_Suppression_for_Person_Detection

Fast CNN-based Single-Person 2D Human Pose Estimation for Autonomous Systems

IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022

This paper presents a novel Convolutional Neural Network (CNN) architecture for 2D human pose est... more This paper presents a novel Convolutional Neural Network (CNN) architecture for 2D human pose estimation from RGB images that balances between high 2D human pose/skeleton estimation accuracy and rapid inference. Thus, it is suitable for safety-critical embedded AI scenarios in autonomous systems, where computational resources are typically limited and fast execution is often required, but accuracy cannot be sacrificed. The architecture is composed of a shared feature extraction backbone and two parallel heads attached on top of it: one for 2D human body joint regression and one for global human body structure modelling through Image-to-Image Translation (I2I). A corresponding multitask loss function allows training of the unified network for both tasks, through combining a typical 2D body joint regression with a novel I2I term. Along with enhanced information flow between the parallel neural heads via skip synapses, this strategy is able to extract both ample semantic and rich spatial information, while using a less complex CNN; thus it permits fast execution. The proposed architecture is evaluated on public 2D human pose estimation datasets, achieving the best accuracy-speed ratio compared to the state-of-the-art. Additionally, it is evaluated on a pedestrian intention recognition task for self-driving cars, leading to increased accuracy and speed in comparison to competing approaches. ACCESSIBLE AT: https://www.researchgate.net/publication/363882289_Fast_CNN-based_Single-Person_2D_Human_Pose_Estimation_for_Autonomous_Systems

Neural Knowledge Transfer for Sentiment Analysis in Texts with Figurative Language

Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2022

Sentiment analysis in texts, also known as opinion mining, is a significant Natural Language Proc... more Sentiment analysis in texts, also known as opinion mining, is a significant Natural Language Processing (NLP) task, with many applications in automated social media monitoring, customer feedback processing, e-mail scanning, etc. Despite recent progress due to advances in Deep Neural Networks (DNNs), texts containing figurative language (e.g., sarcasm, irony, metaphors) still pose a challenge to existing methods due to the semantic ambiguities they entail. In this paper, a novel setup of neural knowledge transfer is proposed for DNN-based sentiment analysis of figurative texts. It is employed for distilling knowledge from a pretrained binary recognizer of figurative language into a multiclass sentiment classifier, while the latter is being trained under a multitask setting. Thus, hints about figurativeness implicitly help resolve semantic ambiguities. Evaluation on a relevant public dataset indicates that the proposed method leads to state-ofthe-art accuracy. ACCESSIBLE AT: https://www.researchgate.net/publication/362902257_Neural_Knowledge_Transfer_for_Sentiment_Analysis_in_Texts_With_Figurative_Language

AUTH-Persons: a dataset for detecting humans in crowds from aerial views

Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022

Recent advances in artificial intelligence, control and sensing technologies have facilitated the... more Recent advances in artificial intelligence, control and sensing technologies have facilitated the development of autonomous Unmanned Aerial Vehicles (UAVs). Detecting humans from video input captured on-the-fly from UAVs is a critical task for ensuring flight safety, mostly handled with lightweight Deep Neural Networks (DNNs). However the detection of individual people in the case of dense crowds and/or distribution shifts (i.e., significant visual differences between the training and the test sets) is still very challenging. This paper presents AUTH-Persons, a new, annotated, publicly available video dataset, that consists of both real and synthetic footage, suitable for training and evaluating aerial-view person detection algorithms. The synthetic data were collected from 8 visually distinct photorealistic outdoor environments and they mostly contain scenes with crowded areas, where heavy occlusions and high person densities pose challenges to common detectors. This dataset is employed to evaluate the generalization performance of various stateof-the-art detection frameworks, by testing them on environments that are visually distinct from those they have been trained on. Finally, given that Non-Maximum Suppression (NMS) methods at the end of person detection pipelines typically suffer in crowded scenes, the performance of various NMS algorithms is also compared in AUTH-Persons. ACCESSIBLE AT: https://www.researchgate.net/publication/364786488_Auth-Persons_A_Dataset_for_Detecting_Humans_in_Crowds_from_Aerial_Views

An Efficient Framework for Human Action Recognition Based on Graph Convolutional Networks

2022 IEEE International Conference on Image Processing (ICIP), 2022

This paper presents a novel framework for skeleton-based Human Action Recognition (HAR) based on ... more This paper presents a novel framework for skeleton-based Human Action Recognition (HAR) based on Graph Convolution Networks (GCNs). The proposed framework aims to increase human action recognition performance of GCN-based methods by incorporating a missing-joint-handling pre-processing step and a novel adjacency matrix construction method in a single human action recognition pipeline. The missing-joint-handling pre-processing step is utilized to infer missing data in the input sequence, which may occur due to imperfect skeleton extraction, based on imputation methods. The novel adjacency matrix construction method is executed offline to compute an improved weighted adjacency matrix specifically designed for HAR, which is utilized in every layer of the employed GCN. Moreover, both the pre-processing step and the adjacency construction method can be utilized along with any GCN architecture, allowing any GCN-based HAR method to be employed in the proposed framework. Experimental evaluation on two public datasets indicate favorable human action classification scores compared to the employed baseline and all competing methods both for 2D and 3D skeleton-based human action recognition, while using a GCN architecture with less learnable parameters. ACCESSIBLE AT: https://www.researchgate.net/publication/364836699_An_efficient_framework_for_human_action_recognition_based_on_Graph_Convolutional_Networks

Fast Semantic Image Segmentation for Autonomous Systems

2022 IEEE International Conference on Image Processing (ICIP), 2022

Fast semantic image segmentation is crucial for autonomous systems, as it allows an autonomous sy... more Fast semantic image segmentation is crucial for autonomous systems, as it allows an autonomous system (e.g., self-driving car, drone, etc.) to interpret its environment on-the-fly and decide on necessary actions by exploiting dense semantic maps. The speed of semantic segmentation on embedded computational hardware is as important as its accuracy. Thus, this paper proposes a novel framework for semantic image seg-mentation that is both fast and accurate. It augments existing real-time semantic image segmentation architectures by an auxiliary, parallel neural branch that is tasked to predict semantic maps in an alternative manner by utilizing Generative Adversarial Networks (GANs). Additional attention-based neural synapses linking the two branches allow information to flow between them during both the training and the inference stage. Extensive experiments on three public datasets for autonomous driving and for aerial-perspective image analysis indicate non-negligible gains in segmentation accuracy, without compromises on inference speed. ACCESSIBLE AT: https://www.researchgate.net/publication/364836833_Fast_Semantic_Image_Segmentation_for_Autonomous_Systems

Autonomous UAV Cinematography

Proceedings of the 30th ACM International Conference on Multimedia, Oct 10, 2022

The use of camera-equipped Unmanned Aerial Vehicles (UAVs, or"drones") for professional media pro... more The use of camera-equipped Unmanned Aerial Vehicles (UAVs, or"drones") for professional media production is already an exciting commercial reality. Currently available consumer UAVs for cinematography applications are equipped with high-end cameras and a degree of cognitive autonomy relying on artificial intelligence(AI). Current research promises to further exploit the potential of autonomous functionalities in the immediate future, resulting in portable flying robotic cameras with advanced intelligence concerning autonomous landing, subject detection/tracking, cinematic shot execution, 3D localization and environmental mapping, as well as autonomous obstacle avoidance combined with on-line motion replanning. Disciplines driving this progress are computer vision, machine/deep learning and aerial robotics. This Tutorial emphasizes the definition and formalization of UAV cinematography aesthetic components, as well as the use of robotic planning/control methods for autonomously capturing them on footage, without the need for manual tele-operation. Additionally, it focuses on state-of-the-art Imitation Learning and Deep Reinforcement Learning approaches for automated UAV/camera control, path planning and cinematography planning, in the general context of "flying & filming". ACCESSIBLE AT: https://www.researchgate.net/publication/364480491_Autonomous_UAV_Cinematography

Gesture Recognition by Self-Supervised Moving Interest Point Completion for CNN-LSTMs

2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), 2022

Gesture recognition, i.e., classification of videos depicting humans who perform hand gestures, i... more Gesture recognition, i.e., classification of videos depicting humans who perform hand gestures, is essential for Human-Computer Interaction. To this end, coupled Convolu-tional Neural Networks-Long Short-Term Memory architectures (CNN-LSTMs) are employed for fast semantic video analysis, but the typical transfer learning approach of initializing the CNN backbone using pretraining for whole-image classification is not necessarily ideal for spatiotemporal video understanding tasks. This paper investigates self-supervised CNN pretraining for a novel pretext task, relying on spatiotemporal video frame corruption via a set of low-level image/video processing building blocks that jointly force the CNN to learn to complete missing content. This is likely to coincide with visible moving object boundaries, including human body silhouettes. Such a CNN parameter set initialization is then able to augment gesture recognition performance, after retraining for this video classification downstream task, without inducing any runtime overhead during the inference stage. Evaluation on a gesture recognition dataset for autonomous Unmanned Aerial Vehicle (UAV) handling demonstrates the effectiveness of the proposed method, against both traditional ImageNet initialization and a competing self-supervised pretext task-based initialization. ACCESSIBLE AT: https://www.researchgate.net/publication/361583587_Gesture_Recognition_by_Self-Supervised_Moving_Interest_Point_Completion_for_CNN-LSTMs