Todor Stefanov | Universiteit Leiden (original) (raw)
Papers by Todor Stefanov
arXiv (Cornell University), Jul 12, 2018
In real-time systems, the application's behavior has to be predictable at compile-time to guarant... more In real-time systems, the application's behavior has to be predictable at compile-time to guarantee timing constraints. However, modern streaming applications which exhibit adaptive behavior due to mode switching at run-time, may degrade system predictability due to unknown behavior of the application during mode transitions. Therefore, proper temporal analysis during mode transitions is imperative to preserve system predictability. To this end, in this paper, we initially introduce Mode Aware Data Flow (MADF) which is our new predictable Model of Computation (MoC) to efficiently capture the behavior of adaptive streaming applications. Then, as an important part of the operational semantics of MADF, we propose the Maximum-Overlap Offset (MOO) which is our novel protocol for mode transitions. The main advantage of this transition protocol is that, in contrast to self-timed transition protocols, it avoids timing interference between modes upon mode transitions. As a result, any mode transition can be analyzed independently from the mode transitions that occurred in the past. Based on this transition protocol, we propose a hard real-time analysis as well to guarantee timing constraints by avoiding processor overloading during mode transitions. Therefore, using this protocol, we can derive a lower bound and an upper bound on the earliest starting time of the tasks in the new mode during mode transitions in such a way that hard real-time constraints are respected.
Software and Compilers for Embedded Systems, Jun 28, 2010
This paper explores the energy-efficient scheduling of real-time tasks on a non-ideal DVS process... more This paper explores the energy-efficient scheduling of real-time tasks on a non-ideal DVS processor in the presence of resource sharing. We assume that tasks are periodic, preemptive and may access to shared resources. When dynamic-priority and fixed-priority scheduling are considered, we use the earliest deadline first (EDF) algorithm and the rate monotonic (RM) algorithm to schedule the given set of tasks. Based on the stack resource policy (SRP), we propose an approach, called blocking-aware two-speed (BATS) algorithm, to synchronize the tasks with shared resources and to calculate appropriate execution speeds so that the shared resources can be accessed in a mutual exclusive manner and the energy consumption can be reduced. Particularly, BATS uses a static low speed to execute tasks initially, and then it switches to a high speed dynamically whenever a task blocks a higher priority task. More specifically, the processor runs at the high speed from the beginning of the blocking until the deadline of the blocked task or the processor becomes idle. In order to guarantee that the deadlines of tasks are met, the static low speed and the dynamic high speeds are derived based on the theoretical analysis of the schedulability of tasks. Compared with existing work, BATS achieves more energy saving because its dynamic high speeds are lower than that of existing work and the processor has less chance to execute tasks at the high speeds. The schedulability analysis and the properties of our proposed BATS are provided in this paper. We also evaluated the capabilities of BATS by a series of experiments, for which we have some encouraging results.
High power consumption becomes the major bottleneck that prevents applying Network-on-Chips (NoCs... more High power consumption becomes the major bottleneck that prevents applying Network-on-Chips (NoCs) on future many-core systems. Power gating is an effective way to reduce the power consumption of a NoC. However, conventional power gating approaches cause significant packet latency increase as well as additional power consumption overhead due to the power gating mechanism. One comprehensive way to reduce these negative impacts is to bypass powered-off routers in a NoC when transferring packets. Therefore, in this paper, we propose an express virtual channel based (EVC-based) power gating approach. In our approach, packets can take pre-defined virtual bypass paths to bypass intermediate routers that can be powered-on or powered-off. Furthermore, based on our extended router structure, a certain transmission ability of the poweredoff routers is kept to transfer packets going through the normal paths. Thus, even though some packets do not take a virtual bypass path, they still have less probability to be blocked by the powered-off routers. Compared with a conventional NoC without power gating, our EVC-based power gating approach causes only 2.67% performance penalty, which is less than 28.67%, 7.24%, and 5.69% penalties in related approaches. With small hardware overhead, our approach reduces on average 68.29% of the total power consumption in a NoC, which is comparable with the 72.94%, 73.56%, and 75.3% reduction of the total power consumption in related approaches.
Streaming applications often require a parallel Model of Computation (MoC) to specify their appli... more Streaming applications often require a parallel Model of Computation (MoC) to specify their application behavior and to facilitate mapping onto Multi-Processor System-on-Chip (MPSoC) platforms. Various performance requirements and resource budgets of embedded systems ask for an efficient design space exploration (DSE) approach to select the best design from a design space consisting of a large number of design choices. However, existing DSE approaches explore the design space that includes only architecture and mapping alternatives for an initial application specification given by the application designer. In this article, we first show that a design often might not be optimal if alternative specifications of a given application are not taken into account. We further argue that the best alternative specification consists of only independent and load-balanced application tasks. Based on the Polyhedral Process Network (PPN) MoC, we present an approach to analyze and transform an initial PPN to an alternative one that contains only independent processes if possible. Finally, by prototyping real-life applications on both FPGA-based MPSoCs and desktop multi-core platforms, we demonstrate that mapping the alternative application specification results in a large performance gain compared to those approaches, in which alternative application specifications are not taken into account.
2022 25th Euromicro Conference on Digital System Design (DSD), Aug 1, 2022
It has been shown that the mode-aware dataflow (MADF) is an advantageous analysis model for adapt... more It has been shown that the mode-aware dataflow (MADF) is an advantageous analysis model for adaptive streaming applications. However, no attention has been paid on how to implement and execute an application, modeled and analyzed with the MADF model, on a Multi-Processor System-on-Chip, such that the properties of the analysis model are preserved. Therefore, in this paper, we consider this matter and propose a generic parallel implementation and execution approach for adaptive streaming applications modeled with MADF. Our approach can be easily realized on top of existing operating systems while supporting the utilization of a wider range of schedules. In particular, we demonstrate our approach on LITMUS RT as one of the existing real-time extensions of the Linux kernel. Finally, to show the practical applicability of our approach and its conformity to the analysis model, we present a case study using a real-life adaptive streaming application.
Communications in computer and information science, 2023
2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)
2018 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), 2018
Miniaturized satellites are currently not considered suitable for critical, high-priority, and co... more Miniaturized satellites are currently not considered suitable for critical, high-priority, and complex multi-phased missions, due to their low reliability. As hardware-side fault tolerance (FT) solutions designed for larger spacecraft can not be adopted aboard very small satellites due to budget, energy, and size constraints, we developed a hybrid FT-approach based upon only COTS components, commodity processor cores, library IP, and standard software. This approach facilitates fault detection, isolation, and recovery in software, and utilizes fault-coverage techniques across the embedded stack within a multiprocessor system-on-chip (MPSoC). This allows our FPGA-based proofof-concept implementation to deliver strong fault-coverage even for missions with a long duration, but also to adapt to varying performance requirements during the mission. The operator of a spacecraft utilizing this approach can define performance profiles, which allow an on-board computer (OBC) to trade between processing capacity, fault coverage, and energy consumption using simple heuristics. The software-side FT approach developed also offers advantages if deployed aboard larger spacecraft through spare resource pooling, enabling an OBC to more efficiently handle permanent faults. This FT approach in part mimics a critical biological system's ability to tolerate faults, adapt to permanent failure, and enables graceful aging of an MPSoC.
arXiv (Cornell University), Jul 20, 2022
Deep Learning approaches based on Convolutional Neural Networks (CNNs) are extensively utilized a... more Deep Learning approaches based on Convolutional Neural Networks (CNNs) are extensively utilized and very successful in a wide range of application areas, including image classification and speech recognition. For the execution of trained CNNs, i.e. model inference, we nowadays witness a shift from the Cloud to the Edge. Unfortunately, deploying and inferring large, compute-and memory-intensive CNNs on edge devices is challenging because these devices typically have limited power budgets and compute/memory resources. One approach to address this challenge is to leverage all available resources across multiple edge devices to deploy and execute a large CNN by properly partitioning the CNN and running each CNNpartition on a separate edge device. Although such distribution, deployment, and execution of large CNNs on multiple edge devices is a desirable and beneficial approach, there currently does not exist a design and programming framework that takes a trained CNN model, together with a CNN partitioning specification, and fully automates the CNN model splitting and deployment on multiple edge devices to facilitate distributed CNN inference at the Edge. Therefore, in this paper, we propose a novel framework, called AutoDiCE, for automated splitting of a CNN model into a set of sub-models and automated code generation for distributed and collaborative execution of these sub-models on multiple, possibly heterogeneous, edge devices, while supporting the exploitation of parallelism among and within the edge devices. Our experimental results show that AutoDiCE can deliver distributed CNN inference with reduced energy consumption and memory usage per edge device, and improved overall system throughput at the same time.
ACM Transactions in Embedded Computing Systems, Feb 8, 2022
Convolutional Neural Networks (CNNs) are biologically inspired computational models that are at t... more Convolutional Neural Networks (CNNs) are biologically inspired computational models that are at the heart of many modern computer vision and natural language processing applications. Some of the CNN-based applications are executed on mobile and embedded devices. Execution of CNNs on such devices places numerous demands on the CNNs, such as high accuracy, high throughput, low memory cost, and low energy consumption. These requirements are very difficult to satisfy at the same time, so CNN execution at the edge typically involves trade-offs (e.g., high CNN throughput is achieved at the cost of decreased CNN accuracy). In existing methodologies, such trade-offs are either chosen once and remain unchanged during a CNN-based application execution, or are adapted to the properties of the CNN input data. However, the application needs can also be significantly affected by the changes in the application environment, such as a change of the battery level in the edge device. Thus, CNN-based applications need a mechanism that allows to dynamically adapt their characteristics to the changes in the application environment at run-time. Therefore, in this article, we propose a scenario-based run-time switching (SBRS) methodology, that implements such a mechanism. CCS Concepts: • Computing methodologies → Neural networks; • Computer systems organization → Embedded software;
Loops are an important source of optimization. In this paper, we propose a new technique for opti... more Loops are an important source of optimization. In this paper, we propose a new technique for optimizing loops that contain kernels mapped on a reconfigurable fabric. We assume the Molen machine organization and programming paradigm as our framework. The method we propose extends our previous work on loop unrolling for reconfigurable architectures by combining unrolling with shifting to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm is based on profiling information about the kernel's execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to a loop nest extracted from MPEG2 encoder containing the DCT kernel. The achieved speedup is 19.65x over software execution and 1.8x over loop unrolling.
Vlsi Design, Mar 4, 2012
System adaptivity is becoming an important feature of modern embedded multiprocessor systems. To ... more System adaptivity is becoming an important feature of modern embedded multiprocessor systems. To achieve the goal of system adaptivity when executing Polyhedral Process Networks (PPNs) on a generic tiled Network-on-Chip (NoC) MPSoC platform, we propose an approach to enable the run-time migration of processes among the available platform resources. In our approach, process migration is allowed by a middleware layer which comprises two main components. The first component concerns the inter-tile data communication between processes. We develop and evaluate a number of different communication approaches which implement the semantics of the PPN model of computation on a generic NoC platform. The presented communication approaches do not depend on the mapping of processes and have been implemented on a Network-on-Chip multiprocessor platform prototyped on an FPGA. Their comparison in terms of the introduced overhead is presented in two case studies with different communication characteristics. The second middleware component allows the actual run-time migration of PPN processes. To this end, we propose and evaluate a process migration mechanism which leverages the PPN model of computation to guarantee a predictable and efficient migration procedure. The efficiency and applicability of the proposed migration mechanism is shown in a real-life case study.
Embedded streaming applications specified using parallel Models of Computation (MoC) often contai... more Embedded streaming applications specified using parallel Models of Computation (MoC) often contain ample amount of parallelism which can be exploited using Multi-Processor System-on-Chip (MPSoC) platforms. It has been shown that the various forms of parallelism in an application should be explored to achieve the maximum system performance. However, if more parallelism is revealed than needed, it will overload the underlying MPSoC platform. At the same time, the revealed parallelism should be sufficient such that the MPSoC platform is fully utilized. Therefore, the amount of revealed and exploited parallelism has to be justenough with respect to the platform constraints. In this paper, we study the problem of exploiting just-enough parallelism by application task unfolding, when mapping streaming applications modeled using the Synchronous Data Flow (SDF) MoC onto MPSoC platforms in hard real-time systems. We show that our problem of simultaneously unfolding and allocating tasks under hard real-time scheduling has a bounded solution space and derive its upper bounds. Subsequently, we devise an efficient algorithm to solve the problem, while the obtained solution meets a pre-specified quality. The experiments on a set of real-life streaming applications demonstrate that our algorithm results, within reasonable amount of time, in a system specification with large performance gain. Finally, we show that our proposed algorithm is on average 100 times faster than one of the state-of-the-art meta-heuristics, i.e., NSGA-II genetic algorithm, while achieving the same quality of solutions.
IEEE Internet of Things Journal
arXiv (Cornell University), Oct 15, 2022
Nowadays, many AI applications utilizing resourceconstrained edge devices (e.g., small mobile rob... more Nowadays, many AI applications utilizing resourceconstrained edge devices (e.g., small mobile robots, tiny IoT devices, etc.) require Convolutional Neural Network (CNN) inference on a distributed system at the edge due to limited resources of a single edge device to accommodate and execute a large CNN. There are four main partitioning strategies that can be utilized to partition a large CNN model and perform distributed CNN inference on multiple devices at the edge. However, to the best of our knowledge, no research has been conducted to investigate how these four partitioning strategies affect the energy consumption per edge device. Such an investigation is important because it will reveal the potential of these partitioning strategies to be used effectively for reduction of the per-device energy consumption when a large CNN model is deployed for distributed inference at the edge. Therefore, in this paper, we investigate and compare the per-device energy consumption of CNN model inference at the edge on a distributed system when the four partitioning strategies are utilized. The goal of our investigation and comparison is to find out which partitioning strategies (and under what conditions) have the highest potential to decrease the energy consumption per edge device when CNN inference is performed at the edge on a distributed system.
2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2019
High power consumption becomes the major bottleneck that prevents applying Network-on-Chips (NoCs... more High power consumption becomes the major bottleneck that prevents applying Network-on-Chips (NoCs) on future many-core systems. Power gating is an effective way to reduce the power consumption of a NoC. However, conventional power gating approaches cause significant packet latency increase as well as additional power consumption overhead due to the power gating mechanism. One comprehensive way to reduce these negative impacts is to bypass the powered-off routers in a NoC to transfer packets. Therefore, in this paper, we propose a dynamic bypass (D-bypass) approach, which is based on a reservation mechanism to allow different upstream routers to forward packets through the same powered-off router at different times. With this feature, our D-bypass power gating approach overcomes the drawbacks in related power gating approaches. Compared with a conventional NoC without power gating, our D-bypass approach causes only 2.55% performance penalty, which is less than 28.67%, 19.26%, 7.24%, and 6.69% penalties in related approaches. With small hardware overhead, our approach just consumes on average 22.23% of total power consumption in a NoC, which is slightly better compared to 27.06%, 23.89%, 26.45%, and 24.70% total power consumption in related approaches.
2021 25th International Conference Electronics, 2021
The following paper discusses the structure and semantics of an open-source high-level embedded s... more The following paper discusses the structure and semantics of an open-source high-level embedded system design framework called DAEDALUS. It consists of multiple tools that help making the transition between the electronic system level (ESL) to register transfer level (RTL) description of streaming data multiprocessor systems. Application, platform, and mapping specifications are thoroughly discussed.
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018
A Brain Computer Interface (BCI) character speller allows human-beings to directly spell characte... more A Brain Computer Interface (BCI) character speller allows human-beings to directly spell characters using eye-gazes, thereby building communication between the human brain and a computer. Convolutional Neural Networks (CNNs) have shown better performance than traditional machine learning methods for BCI signal recognition and its application to the character speller. However, current CNN architectures limit further accuracy improvements of signal detection and character spelling and also need high complexity to achieve competitive accuracy, thereby preventing the use of CNNs in portable BCIs. To address these issues, we propose a novel and simple CNN which effectively learns feature representations from both raw temporal information and raw spatial information. The complexity of the proposed CNN is significantly reduced compared with state-of-the-art CNNs for BCI signal detection. We perform experiments on three benchmark datasets and compare our results with those in previous resea...
arXiv (Cornell University), Jul 12, 2018
In real-time systems, the application's behavior has to be predictable at compile-time to guarant... more In real-time systems, the application's behavior has to be predictable at compile-time to guarantee timing constraints. However, modern streaming applications which exhibit adaptive behavior due to mode switching at run-time, may degrade system predictability due to unknown behavior of the application during mode transitions. Therefore, proper temporal analysis during mode transitions is imperative to preserve system predictability. To this end, in this paper, we initially introduce Mode Aware Data Flow (MADF) which is our new predictable Model of Computation (MoC) to efficiently capture the behavior of adaptive streaming applications. Then, as an important part of the operational semantics of MADF, we propose the Maximum-Overlap Offset (MOO) which is our novel protocol for mode transitions. The main advantage of this transition protocol is that, in contrast to self-timed transition protocols, it avoids timing interference between modes upon mode transitions. As a result, any mode transition can be analyzed independently from the mode transitions that occurred in the past. Based on this transition protocol, we propose a hard real-time analysis as well to guarantee timing constraints by avoiding processor overloading during mode transitions. Therefore, using this protocol, we can derive a lower bound and an upper bound on the earliest starting time of the tasks in the new mode during mode transitions in such a way that hard real-time constraints are respected.
Software and Compilers for Embedded Systems, Jun 28, 2010
This paper explores the energy-efficient scheduling of real-time tasks on a non-ideal DVS process... more This paper explores the energy-efficient scheduling of real-time tasks on a non-ideal DVS processor in the presence of resource sharing. We assume that tasks are periodic, preemptive and may access to shared resources. When dynamic-priority and fixed-priority scheduling are considered, we use the earliest deadline first (EDF) algorithm and the rate monotonic (RM) algorithm to schedule the given set of tasks. Based on the stack resource policy (SRP), we propose an approach, called blocking-aware two-speed (BATS) algorithm, to synchronize the tasks with shared resources and to calculate appropriate execution speeds so that the shared resources can be accessed in a mutual exclusive manner and the energy consumption can be reduced. Particularly, BATS uses a static low speed to execute tasks initially, and then it switches to a high speed dynamically whenever a task blocks a higher priority task. More specifically, the processor runs at the high speed from the beginning of the blocking until the deadline of the blocked task or the processor becomes idle. In order to guarantee that the deadlines of tasks are met, the static low speed and the dynamic high speeds are derived based on the theoretical analysis of the schedulability of tasks. Compared with existing work, BATS achieves more energy saving because its dynamic high speeds are lower than that of existing work and the processor has less chance to execute tasks at the high speeds. The schedulability analysis and the properties of our proposed BATS are provided in this paper. We also evaluated the capabilities of BATS by a series of experiments, for which we have some encouraging results.
High power consumption becomes the major bottleneck that prevents applying Network-on-Chips (NoCs... more High power consumption becomes the major bottleneck that prevents applying Network-on-Chips (NoCs) on future many-core systems. Power gating is an effective way to reduce the power consumption of a NoC. However, conventional power gating approaches cause significant packet latency increase as well as additional power consumption overhead due to the power gating mechanism. One comprehensive way to reduce these negative impacts is to bypass powered-off routers in a NoC when transferring packets. Therefore, in this paper, we propose an express virtual channel based (EVC-based) power gating approach. In our approach, packets can take pre-defined virtual bypass paths to bypass intermediate routers that can be powered-on or powered-off. Furthermore, based on our extended router structure, a certain transmission ability of the poweredoff routers is kept to transfer packets going through the normal paths. Thus, even though some packets do not take a virtual bypass path, they still have less probability to be blocked by the powered-off routers. Compared with a conventional NoC without power gating, our EVC-based power gating approach causes only 2.67% performance penalty, which is less than 28.67%, 7.24%, and 5.69% penalties in related approaches. With small hardware overhead, our approach reduces on average 68.29% of the total power consumption in a NoC, which is comparable with the 72.94%, 73.56%, and 75.3% reduction of the total power consumption in related approaches.
Streaming applications often require a parallel Model of Computation (MoC) to specify their appli... more Streaming applications often require a parallel Model of Computation (MoC) to specify their application behavior and to facilitate mapping onto Multi-Processor System-on-Chip (MPSoC) platforms. Various performance requirements and resource budgets of embedded systems ask for an efficient design space exploration (DSE) approach to select the best design from a design space consisting of a large number of design choices. However, existing DSE approaches explore the design space that includes only architecture and mapping alternatives for an initial application specification given by the application designer. In this article, we first show that a design often might not be optimal if alternative specifications of a given application are not taken into account. We further argue that the best alternative specification consists of only independent and load-balanced application tasks. Based on the Polyhedral Process Network (PPN) MoC, we present an approach to analyze and transform an initial PPN to an alternative one that contains only independent processes if possible. Finally, by prototyping real-life applications on both FPGA-based MPSoCs and desktop multi-core platforms, we demonstrate that mapping the alternative application specification results in a large performance gain compared to those approaches, in which alternative application specifications are not taken into account.
2022 25th Euromicro Conference on Digital System Design (DSD), Aug 1, 2022
It has been shown that the mode-aware dataflow (MADF) is an advantageous analysis model for adapt... more It has been shown that the mode-aware dataflow (MADF) is an advantageous analysis model for adaptive streaming applications. However, no attention has been paid on how to implement and execute an application, modeled and analyzed with the MADF model, on a Multi-Processor System-on-Chip, such that the properties of the analysis model are preserved. Therefore, in this paper, we consider this matter and propose a generic parallel implementation and execution approach for adaptive streaming applications modeled with MADF. Our approach can be easily realized on top of existing operating systems while supporting the utilization of a wider range of schedules. In particular, we demonstrate our approach on LITMUS RT as one of the existing real-time extensions of the Linux kernel. Finally, to show the practical applicability of our approach and its conformity to the analysis model, we present a case study using a real-life adaptive streaming application.
Communications in computer and information science, 2023
2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)
2018 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), 2018
Miniaturized satellites are currently not considered suitable for critical, high-priority, and co... more Miniaturized satellites are currently not considered suitable for critical, high-priority, and complex multi-phased missions, due to their low reliability. As hardware-side fault tolerance (FT) solutions designed for larger spacecraft can not be adopted aboard very small satellites due to budget, energy, and size constraints, we developed a hybrid FT-approach based upon only COTS components, commodity processor cores, library IP, and standard software. This approach facilitates fault detection, isolation, and recovery in software, and utilizes fault-coverage techniques across the embedded stack within a multiprocessor system-on-chip (MPSoC). This allows our FPGA-based proofof-concept implementation to deliver strong fault-coverage even for missions with a long duration, but also to adapt to varying performance requirements during the mission. The operator of a spacecraft utilizing this approach can define performance profiles, which allow an on-board computer (OBC) to trade between processing capacity, fault coverage, and energy consumption using simple heuristics. The software-side FT approach developed also offers advantages if deployed aboard larger spacecraft through spare resource pooling, enabling an OBC to more efficiently handle permanent faults. This FT approach in part mimics a critical biological system's ability to tolerate faults, adapt to permanent failure, and enables graceful aging of an MPSoC.
arXiv (Cornell University), Jul 20, 2022
Deep Learning approaches based on Convolutional Neural Networks (CNNs) are extensively utilized a... more Deep Learning approaches based on Convolutional Neural Networks (CNNs) are extensively utilized and very successful in a wide range of application areas, including image classification and speech recognition. For the execution of trained CNNs, i.e. model inference, we nowadays witness a shift from the Cloud to the Edge. Unfortunately, deploying and inferring large, compute-and memory-intensive CNNs on edge devices is challenging because these devices typically have limited power budgets and compute/memory resources. One approach to address this challenge is to leverage all available resources across multiple edge devices to deploy and execute a large CNN by properly partitioning the CNN and running each CNNpartition on a separate edge device. Although such distribution, deployment, and execution of large CNNs on multiple edge devices is a desirable and beneficial approach, there currently does not exist a design and programming framework that takes a trained CNN model, together with a CNN partitioning specification, and fully automates the CNN model splitting and deployment on multiple edge devices to facilitate distributed CNN inference at the Edge. Therefore, in this paper, we propose a novel framework, called AutoDiCE, for automated splitting of a CNN model into a set of sub-models and automated code generation for distributed and collaborative execution of these sub-models on multiple, possibly heterogeneous, edge devices, while supporting the exploitation of parallelism among and within the edge devices. Our experimental results show that AutoDiCE can deliver distributed CNN inference with reduced energy consumption and memory usage per edge device, and improved overall system throughput at the same time.
ACM Transactions in Embedded Computing Systems, Feb 8, 2022
Convolutional Neural Networks (CNNs) are biologically inspired computational models that are at t... more Convolutional Neural Networks (CNNs) are biologically inspired computational models that are at the heart of many modern computer vision and natural language processing applications. Some of the CNN-based applications are executed on mobile and embedded devices. Execution of CNNs on such devices places numerous demands on the CNNs, such as high accuracy, high throughput, low memory cost, and low energy consumption. These requirements are very difficult to satisfy at the same time, so CNN execution at the edge typically involves trade-offs (e.g., high CNN throughput is achieved at the cost of decreased CNN accuracy). In existing methodologies, such trade-offs are either chosen once and remain unchanged during a CNN-based application execution, or are adapted to the properties of the CNN input data. However, the application needs can also be significantly affected by the changes in the application environment, such as a change of the battery level in the edge device. Thus, CNN-based applications need a mechanism that allows to dynamically adapt their characteristics to the changes in the application environment at run-time. Therefore, in this article, we propose a scenario-based run-time switching (SBRS) methodology, that implements such a mechanism. CCS Concepts: • Computing methodologies → Neural networks; • Computer systems organization → Embedded software;
Loops are an important source of optimization. In this paper, we propose a new technique for opti... more Loops are an important source of optimization. In this paper, we propose a new technique for optimizing loops that contain kernels mapped on a reconfigurable fabric. We assume the Molen machine organization and programming paradigm as our framework. The method we propose extends our previous work on loop unrolling for reconfigurable architectures by combining unrolling with shifting to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm is based on profiling information about the kernel's execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to a loop nest extracted from MPEG2 encoder containing the DCT kernel. The achieved speedup is 19.65x over software execution and 1.8x over loop unrolling.
Vlsi Design, Mar 4, 2012
System adaptivity is becoming an important feature of modern embedded multiprocessor systems. To ... more System adaptivity is becoming an important feature of modern embedded multiprocessor systems. To achieve the goal of system adaptivity when executing Polyhedral Process Networks (PPNs) on a generic tiled Network-on-Chip (NoC) MPSoC platform, we propose an approach to enable the run-time migration of processes among the available platform resources. In our approach, process migration is allowed by a middleware layer which comprises two main components. The first component concerns the inter-tile data communication between processes. We develop and evaluate a number of different communication approaches which implement the semantics of the PPN model of computation on a generic NoC platform. The presented communication approaches do not depend on the mapping of processes and have been implemented on a Network-on-Chip multiprocessor platform prototyped on an FPGA. Their comparison in terms of the introduced overhead is presented in two case studies with different communication characteristics. The second middleware component allows the actual run-time migration of PPN processes. To this end, we propose and evaluate a process migration mechanism which leverages the PPN model of computation to guarantee a predictable and efficient migration procedure. The efficiency and applicability of the proposed migration mechanism is shown in a real-life case study.
Embedded streaming applications specified using parallel Models of Computation (MoC) often contai... more Embedded streaming applications specified using parallel Models of Computation (MoC) often contain ample amount of parallelism which can be exploited using Multi-Processor System-on-Chip (MPSoC) platforms. It has been shown that the various forms of parallelism in an application should be explored to achieve the maximum system performance. However, if more parallelism is revealed than needed, it will overload the underlying MPSoC platform. At the same time, the revealed parallelism should be sufficient such that the MPSoC platform is fully utilized. Therefore, the amount of revealed and exploited parallelism has to be justenough with respect to the platform constraints. In this paper, we study the problem of exploiting just-enough parallelism by application task unfolding, when mapping streaming applications modeled using the Synchronous Data Flow (SDF) MoC onto MPSoC platforms in hard real-time systems. We show that our problem of simultaneously unfolding and allocating tasks under hard real-time scheduling has a bounded solution space and derive its upper bounds. Subsequently, we devise an efficient algorithm to solve the problem, while the obtained solution meets a pre-specified quality. The experiments on a set of real-life streaming applications demonstrate that our algorithm results, within reasonable amount of time, in a system specification with large performance gain. Finally, we show that our proposed algorithm is on average 100 times faster than one of the state-of-the-art meta-heuristics, i.e., NSGA-II genetic algorithm, while achieving the same quality of solutions.
IEEE Internet of Things Journal
arXiv (Cornell University), Oct 15, 2022
Nowadays, many AI applications utilizing resourceconstrained edge devices (e.g., small mobile rob... more Nowadays, many AI applications utilizing resourceconstrained edge devices (e.g., small mobile robots, tiny IoT devices, etc.) require Convolutional Neural Network (CNN) inference on a distributed system at the edge due to limited resources of a single edge device to accommodate and execute a large CNN. There are four main partitioning strategies that can be utilized to partition a large CNN model and perform distributed CNN inference on multiple devices at the edge. However, to the best of our knowledge, no research has been conducted to investigate how these four partitioning strategies affect the energy consumption per edge device. Such an investigation is important because it will reveal the potential of these partitioning strategies to be used effectively for reduction of the per-device energy consumption when a large CNN model is deployed for distributed inference at the edge. Therefore, in this paper, we investigate and compare the per-device energy consumption of CNN model inference at the edge on a distributed system when the four partitioning strategies are utilized. The goal of our investigation and comparison is to find out which partitioning strategies (and under what conditions) have the highest potential to decrease the energy consumption per edge device when CNN inference is performed at the edge on a distributed system.
2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2019
High power consumption becomes the major bottleneck that prevents applying Network-on-Chips (NoCs... more High power consumption becomes the major bottleneck that prevents applying Network-on-Chips (NoCs) on future many-core systems. Power gating is an effective way to reduce the power consumption of a NoC. However, conventional power gating approaches cause significant packet latency increase as well as additional power consumption overhead due to the power gating mechanism. One comprehensive way to reduce these negative impacts is to bypass the powered-off routers in a NoC to transfer packets. Therefore, in this paper, we propose a dynamic bypass (D-bypass) approach, which is based on a reservation mechanism to allow different upstream routers to forward packets through the same powered-off router at different times. With this feature, our D-bypass power gating approach overcomes the drawbacks in related power gating approaches. Compared with a conventional NoC without power gating, our D-bypass approach causes only 2.55% performance penalty, which is less than 28.67%, 19.26%, 7.24%, and 6.69% penalties in related approaches. With small hardware overhead, our approach just consumes on average 22.23% of total power consumption in a NoC, which is slightly better compared to 27.06%, 23.89%, 26.45%, and 24.70% total power consumption in related approaches.
2021 25th International Conference Electronics, 2021
The following paper discusses the structure and semantics of an open-source high-level embedded s... more The following paper discusses the structure and semantics of an open-source high-level embedded system design framework called DAEDALUS. It consists of multiple tools that help making the transition between the electronic system level (ESL) to register transfer level (RTL) description of streaming data multiprocessor systems. Application, platform, and mapping specifications are thoroughly discussed.
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018
A Brain Computer Interface (BCI) character speller allows human-beings to directly spell characte... more A Brain Computer Interface (BCI) character speller allows human-beings to directly spell characters using eye-gazes, thereby building communication between the human brain and a computer. Convolutional Neural Networks (CNNs) have shown better performance than traditional machine learning methods for BCI signal recognition and its application to the character speller. However, current CNN architectures limit further accuracy improvements of signal detection and character spelling and also need high complexity to achieve competitive accuracy, thereby preventing the use of CNNs in portable BCIs. To address these issues, we propose a novel and simple CNN which effectively learns feature representations from both raw temporal information and raw spatial information. The complexity of the proposed CNN is significantly reduced compared with state-of-the-art CNNs for BCI signal detection. We perform experiments on three benchmark datasets and compare our results with those in previous resea...