Sotirios Xydis - Profile on Academia.edu (original) (raw)

Papers by Sotirios Xydis

AEGLE: A big bio-data analytics framework for integrated health-care services

2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), 2015

AEGLE project1 targets to build an innovative ICT solution addressing the whole data value chain ... more AEGLE project1 targets to build an innovative ICT solution addressing the whole data value chain for health based on: cloud computing enabling dynamic resource allocation, HPC infrastructures for computational acceleration and advanced visualization techniques. In this paper, we provide an analysis of the addressed Big Data health scenarios and we describe the key enabling technologies, as well as data privacy and regulatory issues to be integrated into AEGLE's ecosystem, enabling advanced health-care analytic services, while also promoting related research activities.

High Performance and Area Efficient Flexible

This paper presents a new methodology for the syn- thesis of high performance flexible datapaths,... more This paper presents a new methodology for the syn- thesis of high performance flexible datapaths, targeting compu- tationally intensive digital signal processing kernels of embedded applications.Theproposedmethodologyisbasedonanovelcoarse- grained reconfigurable/flexible architectural template, which en- ables the combined exploitation of the horizontal and vertical par- allelism along with the operation chaining opportunities found in the application's behavioral description. Efficient synthesis tech- niques exploiting these architectural optimization concepts from a higher level of abstraction are presented and analyzed. Exten- sive experimentation showed average latency and area reductions up to 33.9% and 53.9%, respectively, and higher hardware area utilization, compared to previously published high performance coarse-grained reconfigurable datapaths.

Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management

Two-dimensional rectangular bin packing (2DBP) is a known abstraction of dynamic storage allocati... more Two-dimensional rectangular bin packing (2DBP) is a known abstraction of dynamic storage allocation (DSA). We argue that such abstractions can aid practical purposes. 2DBP algorithms optimize their placements' makespan, i.e., the size of the used address range. Demand paging-enabled virtual memory systems render makespan irrelevant: allocators commonly employ sparse addressing and need worry only about fragmentation caused within page boundaries. But in the embedded domain, where portions of memory are statically pre-allocated, makespan remains a reasonable metric. Recent work has shown that viewing allocators as blackbox 2DBP solvers bears meaning. There exists a 2DBP-based fragmentation metric which often correlates monotonically with maximum resident set size (RSS). Given the eld's indeterminacy with respect to fragmentation de nitions, as well as the immense value of physical memory savings, we are motivated to set allocator-generated placements against their 2DBP-devised, makespan-optimizing counterparts. Of course, allocators must operate online while 2DBP algorithms work on complete request traces; but since both sides aim for minimum memory wastage, the idea of studying their relationship preserves its intellectual-and practical-interest.

IEEE Transactions on Parallel and Distributed Systems

International high-energy particle physics research centers, like CERN and Fermilab, require exce... more International high-energy particle physics research centers, like CERN and Fermilab, require excessive studies and simulations to plan for the upcoming upgrades of the world's largest particle accelerators, and the design of future machines given the technological challenges and tight budgetary constraints. The Beam Longitudinal Dynamics (BLonD) simulator suite incorporates the most detailed and complex physics phenomena in the field of longitudinal beam dynamics, required for providing extremely accurate predictions. Modern challenges in beam dynamics dictate for longer, larger and numerous simulation studies to draw meaningful conclusions that will drive the baseline choices for the daily operation of current machines and the design choices of future projects. These studies are extremely time consuming, and would be impractical to perform without a High-Performance Computing oriented simulator framework. In this article, at first, we design and evaluate a highly-optimized distributed version of BLonD. We combine approximate computing techniques, and leverage a dynamic load-balancing scheme to relax synchronization and improve scalability. In addition, we employ GPUs to accelerate the distributed implementation. We evaluate the highly optimized distributed beam longitudinal dynamics simulator in a supercomputing system and demonstrate speedups of more than two orders of magnitude when run on 32 GPU platforms, w.r.t. the previous state-of-art. By driving a wide range of new studies, the proposed high performance beam longitudinal dynamics simulator forms an invaluable tool for accelerator physicists.

Software Design and Optimization of ECG Signal Analysis and Diagnosis for Embedded IoT Devices

Components and Services for IoT Platforms, 2016

The medical domain is one of the most rapidly expanding application areas of Internet of Things (... more The medical domain is one of the most rapidly expanding application areas of Internet of Things (IoT) technology. For chronic diseases, this technology can be highly useful for the patient, providing constant monitoring and ability for timely intervention of medical staff in case of an emergency. This intended system behavior imposes new requirements to the design and implementation of processing flows implemented on embedded IoT devices which are already constrained by limited computational capabilities and power budget. This work aims at designing and implementing such a bio-medical signal analysis flow based on the case study of arrhythmia detection using electrocardiogram signals and machine learning techniques. Different architectural decisions of the flow are explored at high level and the final optimized version is implemented on a state-of-the-art IoT node. The evaluation of the execution flow on this device provides information on the actual requirements of each sub-component of the flow combined with an analysis of its behavior as computational requirements of the machine learning algorithms scale up.

2016 5th International Conference on Modern Circuits and Systems Technologies (MOCAST), 2016

Healthcare is one of the most rapidly expanding application areas of the Internet of Things (IoT)... more Healthcare is one of the most rapidly expanding application areas of the Internet of Things (IoT) technology. IoT devices can be used to enable remote health monitoring of patients with chronic diseases such as cardiovascular diseases (CVD). In this paper we develop an algorithm for ECG analysis and classification for heartbeat diagnosis, and implement it on an IoT-based embedded platform. This algorithm is our proposal for a wearable ECG diagnosis device, suitable for 24-hour continuous monitoring of the patient. We use Discrete Wavelet Transform (DWT) for the ECG analysis, and a Support Vector Machine (SVM) classifier. The best classification accuracy achieved is 98.9%, for a feature vector of size 18, and 2493 support vectors. Different implementations of the algorithm on the Galileo board, help demonstrate that the computational cost is such, that the ECG analysis and classification can be performed in real-time.

Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

This paper introduces an innovative post-implementation Dynamic Frequency Boosting (DFB) techniqu... more This paper introduces an innovative post-implementation Dynamic Frequency Boosting (DFB) technique to release "hidden" performance margins of digital circuit designs currently suppressed by typical critical path constraint design flows, thus defining higher limits of operation speed. The proposed technique goes beyond state-of-the-art and exploits the data-driven path delay variability incorporating an innovative hardware clocking mechanism that detects in real-time the paths' activation. In contrast to timing speculation, the operating speed is adjusted on the nominal path delay activation, succeeding an error-free acceleration. The proposed technique has been evaluated on three FPGA-based use cases carefully selected to exhibit differing domain characteristics, i.e i) a third party DNN inference accelerator IP for CIFAR-10 images achieving an average speedup of 18%, ii) a highly designer-optimized Optical Digital Equalizer design, in which DBF delivered a speedup of 50% and iii) a set of 5 synthetic designs examining high frequency (beyond 400 MHz) applications in FPGAs, achieving accelerations of 20-60% depending on the underlying path variability.

EVOLVE: Towards Converging Big-Data, High-Performance and Cloud-Computing Worlds

2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)

FaaS and Curious: Performance Implications of Serverless Functions on Edge Computing Platforms

Lecture Notes in Computer Science, 2021

2016 26th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), 2016

Electrocardiogram (ECG) analysis has been established as a key element regarding the evaluation o... more Electrocardiogram (ECG) analysis has been established as a key element regarding the evaluation of the human health status. The computational complexity along with the strict constraints of real-time assessment of a heart beat, has made the ECG analysis flow a very challenging application for embedded medical devices. Recent advancements in cyber-physical and IoT systems are transforming medical processing towards embedded and wearable devices, thus making energy consumption a first class design objective. In this work, we focus on analysing the power, performance and energy profiles of an ECG analysis and arrhythmia detection software pipeline during its execution on a ZYNQ-based SoC. We evaluate a large set of design alternatives spanning from a pure software-only implementation to HW/SW oriented designs, in which High-Level Synthesis capabilities are utilized. Using the medically validated MIT-BIH ECG database, we examine the efficiency and the sensitivity of the design solutions in different operating frequencies and examine three Quality of Service (QoS) levels concerning the sampling rate of the ECG signal.

IEEE Transactions on Circuits and Systems II: Express Briefs, 2018

Approximate computing forms a promising paradigm shift for energy efficient design by aggressivel... more Approximate computing forms a promising paradigm shift for energy efficient design by aggressively decreasing power consumption of inherently error tolerant applications. However, approximate computing architectures exacerbate the design complexity due to the diversity of inexact techniques and their impact on final circuit implementations. In this brief, we introduce approximate accelerator synthesis which enables the design of power optimized inexact circuits under error bound constraints, by leveraging the incorporation of diverse multi-level approximate techniques. We show the high efficiency of adopting multi-level approximation techniques and present a systematic approach for integrating multi-level approximation in hardware accelerator synthesis under error bound and voltage island constraints.

Rapid prototyping and Design Space Exploration methodologies for many-accelerator systems

2015 25th International Conference on Field Programmable Logic and Applications (FPL), 2015

The ever-growing design complexity of modern embedded systems and the need for lower energy consu... more The ever-growing design complexity of modern embedded systems and the need for lower energy consumption have lead to design techniques which target to bridge the gap between the designer's productivity and the design complexity. In particular, Virtual Prototyping enables the system modeling and simulation in multiple abstraction levels, while the automated Design Space Exploration (DSE) targets to find optimized design solutions in a reasonable time. However, there is the need for more efficient techniques for prototyping and co-simulation, as the rapid simulation has become a stringent requirement. In addition, as emerging heterogeneous architectures expose even higher design complexity, typical DSE techniques may not achieve high-quality design solutions. Towards this direction, the proposed design flow introduces (a) a set of prototyping techniques which target to faster but accurate simulation, also supporting the system co-simulation with other environments, and (b) a number of DSE methodologies for high-complexity computation and communication architectures.

MAx-DNN: Multi-Level Arithmetic Approximation for Energy-Efficient DNN Hardware Accelerators

2022 IEEE 13th Latin America Symposium on Circuits and System (LASCAS)

© 2019 Association for Computing Machinery. The massive increase of IoT devices and their collect... more © 2019 Association for Computing Machinery. The massive increase of IoT devices and their collected data raises the question of how to analyze all that data. Edge computing provides a suitable compromise, but the question remains: How much processing should be done locally vs. offloaded to other devices? The diverse application requirements and limited resources at the edge extend the challenges. We propose Oops, an optimization framework to adapt the resource management at runtime distributedly. It orchestrates the IoT devices and adapts their operation mode with respect to their constraints and the gateway's limited shared resources. Oops reduces runtime overhead significantly while increasing user utility compared to state-of-the-art.

BLonD++: performance analysis and optimizations for enabling complex, accurate and fast beam dynamics studies

This paper focuses on the performance analysis and optimization for enabling efficient implementa... more This paper focuses on the performance analysis and optimization for enabling efficient implementations of next generation beam dynamics simulations. Nowadays large worldwide research centers, e.g. CERN, Fermilab etc. are continuously investing in resources and infrastructures for progressing knowledge in the fields of particle physics, thus requiring careful studies and planing for the upcoming upgrades of the synchrotrons and the design of future machines. Consequently, there is an emerging need for simulations that incorporate a collection of complex physics phenomena, produce extremely accurate predictions while keeping the computing resources and run-time to a minimum. A variety of simulator suites have been developed, however, they have been reported to lack in simulation speed, features and ease-of-use. In this paper we introduce the Beam Longitudinal Dynamics (BLonD) simulator suite from a computer engineering perspective. We analyze its performance to understand its current ...

Fade

Proceedings of the 24th International Workshop on Software and Compilers for Embedded Systems, 2021

Lately, more and more applications are deployed on heterogeneous, power-constrained edge-computin... more Lately, more and more applications are deployed on heterogeneous, power-constrained edge-computing devices. Bringing computation closer to the data, contributes both to latency and energy consumption reduction due to the elimination of excessive data transfers. However, while the main concern in such environments is the minimization of energy consumption, the heterogeneity in compute resources found at the edge may lead to Quality of Service (QoS) violations. At the same time, Serverless computing, the next frontier of Cloud computing has emerged to offer unprecedented elasticity by utilizing fine-grained, stateless functions. The reduction in the execution time and the modest memory footprint of such decomposed applications, allow for fine-grained resource multiplexing. In this work, we propose a methodology for application decomposition into fine-grained functions and energy-aware function placement on a cluster of edge devices subject to user-specified QoS guarantees.

IEEE Transactions on Multi-Scale Computing Systems, 2018

Modern many-core computing platforms execute a diverse set of dynamic workloads in the presence o... more Modern many-core computing platforms execute a diverse set of dynamic workloads in the presence of varying application arrival rates. This inflicts strict requirements on run-time management to efficiently allocate system resources. On the way towards kilo-core processor architectures, centralized resource management approaches will most probably form a severe performance bottleneck, thus focus has been turned to the study of Distributed Run-Time Resource Management (DRTRM) schemes. In this article, we examine the behavior of a DRTRM of dynamic applications with malleable characteristics against stressing incoming application interval rate scenarios, using Intel SCC as the target many-core system. We show that resource allocation is highly affected by application input rate and propose an application-arrival aware DRTRM framework implementing an effective admission control strategy by carefully utilizing voltage and frequency scaling on parts of its resource allocation infrastructure. Through extensive experimental evaluation, we quantitatively analyze the behavior of the introduced DRTRM scheme and show that it achieves up to 44 percent performance gains while consuming 31 percent less energy, in comparison to a state-of-art DRTRM solution. In comparison to a centralized RTRM, the respective metric values rise up to 62 and 45 percent performance and energy gains, respectively. Index Terms-Distributed run-time resource management, many-core systems, intel SCC, application admission regulation policy Ç 1 INTRODUCTION C ONTEMPORARY technological achievements and system designs such as cloud computing drive innovation towards many-core computer architectures and dynamic applications design . Many-core architectures [2], [3], are nowadays a reality, proposed as an effective computer organization to address the ever increasing user demands for higher performance, reliability and lower power consumption. The nature of applications' design is also evolving by incorporating dynamic characteristics such as high variability in workload, self-awareness and malleability, i.e seamless run-time adaptivity to available system resources [5], [6], . The combination of highly dynamic, parallel applications and emerging technologies in many-core systems dictates the need for run-time decision making regarding resource distribution, which incorporates sophisticated logic in an effort to meet the requirements of high performance, low power consumption, safety and reliability. Inevitably, this comes at the price of high computational requirements in order to provide results within acceptable time limits. To alleviate the computational bottleneck, the resource allocation paradigm has shifted from centralized to Distributed Run-Time Resource Management (DRTRM) decision making processes. The new paradigm has been adopted, leading to user-space DRTRM implementations [7], [8], design of new Operating Systems [9], [10], and even novel implementations of the Linux Operating System with distributed, replicated kernel instances . In overall, the advantages of DRTRM are increased scalability, inherent distribution of computational burden for decision making, support for heterogeneous systems, as well as enhanced system reliability, i.e., eliminating the single point of failure of centralized approaches. However, the nature of distributed decision making does come with an increased complexity in the resource management process. The lack of a single point with overview of the platform leads to limited ability to adjust in scenarios that the need for resources is stressed, since numerous distributed agents need to communicate via exchanged messages in order to enforce a global policy. In this work, we identify this limited adaptivity ability by examining resource stressful scenarios, resulting from the arrival rate of incoming applications on many-core systems. We show that a very fast and resource hungry scenario of incoming applications can be the breaking point for the efficiency of the distributed framework. The effects of the arrival rate of incoming execution requests have been widely investigated for different target systems such as Cloud infrastructure [13], Map-Reduce Clusters [14] and many core-systems [15] but all solutions The authors are with the School of Electrical and Computer Engineering,

Politecnico di Milano-Dipartimento di Elettronica, Informazione e Bioingegneria

Cooperative Data Fusion for Advanced Monitoring and Assessment in Healthcare Infrastructures

VOSsim: A Framework for Enabling Fast Voltage Overscaling Simulation for Approximate Computing Circuits

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Approximate computing emerges as a new design paradigm for generating energy-efficient computing ... more Approximate computing emerges as a new design paradigm for generating energy-efficient computing systems. Voltage overscaling (VOS) forms a very promising technique to generate approximate circuits, and its application in cooperation to other approximate techniques is proven to lead to more efficient solutions. However, the existing design tools fail to provide effective voltage-aware simulation for early exploration of power-error approximate design tradeoffs. In this brief, we propose VOSsim, a framework that extends state-of-the-art industry strength tools, to enable fast and accurate simulations of voltage overscaled circuits. We extensively evaluate VOSsim showing that it attains 99.2% output and 98.4% power accuracy, with an average speedup of 32times32\times 32times in simulation time compared to high-precision SPICE simulations, i.e., the only available solution today for VOS-aware simulation.