Hannu Tenhunen | KTH Royal Institute of Technology (original) (raw)

Papers by Hannu Tenhunen

We propose and analyse an on-chip interconnect design for improving the efficiency of multicore p... more We propose and analyse an on-chip interconnect design for improving the efficiency of multicore processors. Conventional interconnection networks are usually based on a single homogeneous network with uniform processing of all traffic. While the design is simplified, this approach can have performance bottlenecks and limitations on system efficiency. We investigate the traffic pattern of several real world applications. Based on a directory cache coherence protocol, we characterise and categorize the traffic in terms of various aspects. It is discovered that control and unicast packets dominated the network, while the percentages of data and multicast messages are relatively low. Furthermore, we find most of the invalidation messages are multicast messages, and most of the multicast messages are invalidation message. The multicast invalidation messages usually have higher number of destination nodes compared with other multicast messages. These observations lead to the proposed triple class interconnect, where a dedicated multicast-capable network is responsible for the control messages and the data messages are handled by another network. By using a detailed full system simulation environment, the proposed design is compared with the homogeneous baseline network, as well as two other network designs. Experimental results show that the average network latency and energy delay product of the proposed design have improved 24.4% and 10.2% compared with the baseline network.

Springer eBooks, 2017

The use of general descriptive names, registered names, trademarks, service marks, etc. in this p... more The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The future of Moore's Law is in jeopardy. The number of cores of many-core systems is steadily in... more The future of Moore's Law is in jeopardy. The number of cores of many-core systems is steadily increasing for every technology node generation. Voltage scaling does not keep pace with the unabated decrease of transistor size. Higher leakage power and manufacturing variabilities are the consequences and lead to extremely critical power as well as thermal issues. These phenomena can downgrade the performance or endanger system's functionality as well as its reliability if they are not properly addressed. In near future, up to 90% of a many-core chip's area may have to remain inactive; this non-active area is termed Dark Silicon. These issues make the problem of resource management challenging. Future management systems need to be intelligent, anticipatory, and self-adaptive. They are supposed to integrate management of different aspects such as thermal, power, energy, performance, quality of service, process variability, occurrence of faults and aging effects, all in one. In this paper, we study the contributions in the literature focusing on techniques for dynamic resource management in multi-and many-core systems. We put emphasis on advanced approaches that exhibit learning, self-awareness, hierarchical monitoring and management. We categorize the existing approaches from a new perspective and argue that a self-aware hierarchical agent-based model is a proper methodology to monitor and management many-core systems, in particular when they need to deal with different competing goals. In addition, we evaluate the main objectives and trends in resource management of many-core systems in order to pave the way for designing future computer systems ranging from highperformance computers to embedded processors used in the era of Internet-of-Things.

Microprocessors and Microsystems, Jun 1, 2013

Abstract One of the major design bottlenecks in today's high-performance VLSI systems is the... more Abstract One of the major design bottlenecks in today's high-performance VLSI systems is the distribution of a single global clock across a chip due to process variability, power dissipation, and multi-cycle cross-chip signaling. A Network-on-Chip architecture partitioned into several Voltage/Frequency Islands (VFIs) is considered as a promising approach for achieving fine-grain system-level power management. In a VFI-based architecture, a clock is utilized for local data synchronization, while inter-island communication is handled ...

IEEE Transactions on Computers, Mar 1, 2016

ABSTRACT

2019 Fourth International Conference on Fog and Mobile Edge Computing (FMEC), 2019

Protocols enable things to connect and communicate, thus making the Internet of Things possible. ... more Protocols enable things to connect and communicate, thus making the Internet of Things possible. The performance aspect of the Internet of Things protocols, vital to its widespread utilization, have received much attention. However, one aspect of IoT protocols, essential to its adoption in the real world, is a protocols’ feature set. Comparative analysis based on competing features and properties are rarely if ever, discussed in the literature. In this paper, we define 19 attributes in 5 categories that are essential for IoT stakeholders to consider. These attributes are then used to contrast four IoT protocols, MQTT, HTTP, CoAP and XMPP. Furthermore, we discuss scenarios where an assessment based on comparative strengths and weaknesses would be beneficial. The provided comparison model can be easily extended to include protocols like MQTT-SN, AMQP and DDS.

Architecture and Implementation of adaptive NoC to improve performance and power consumption is p... more Architecture and Implementation of adaptive NoC to improve performance and power consumption is presented. On platforms hosting multiple applications, hardware variations and unpredictable workloads make static design-time assignments highly sub-optimal e.g. in terms of power and performance. As a solution to this problem, adaptive NoCs are designed, which dynamically adapt towards optimal implementation. This paper addresses the architectural design of adaptive NoC, which is an essential step towards design automation. The architecture involves two levels of agents: a system level agent implemented in software on a dedicated general purpose processor and the local agents implemented as microcontrollers of each network node. The system agent issues specific instructions to perform monitoring and reconfiguration operations, while the local agents operate according to the commands from the system agent. To demonstrate the system architecture, best-effort power management with distribu...

Concurrency and Computation: Practice and Experience, 2018

The importance of optimization and NP-problem solving cannot be overemphasized. The usefulness an... more The importance of optimization and NP-problem solving cannot be overemphasized. The usefulness and popularity of evolutionary computing methods are also well established. There are various types of evolutionary methods; they are mostly sequential but some of them have parallel implementations as well. We propose a multi-population method to parallelize the Imperialist Competitive Algorithm. The algorithm has been implemented with the Message Passing Interface on 2 computer platforms, and we have tested our method based on shared memory and message passing architectural models. An outstanding performance is obtained, demonstrating that the proposed method is very efficient concerning both speed and accuracy. In addition, compared with a set of existing well-known parallel algorithms, our approach obtains more accurate results within a shorter time period.

2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), 2017

Recent advances in computing and sensor technologies have facilitated the emergence of increasing... more Recent advances in computing and sensor technologies have facilitated the emergence of increasingly sophisticated and complex cyber-physical systems and wireless sensor networks. Moreover, integration of cyber-physical systems and wireless sensor networks with other contemporary technologies, such as unmanned aerial vehicles (i.e. drones) and fog computing, enables the creation of completely new smart solutions. By building upon the concept of a Smart Mobile Access Point (SMAP), which is a key element for a smart network, we propose a novel hierarchical placement strategy for SMAPs to improve scalability of SMAP based monitoring systems. SMAPs predict communication behavior based on information collected from the network, and select the best approach to support the network at any given time. In order to improve the network performance, they can autonomously change their positions. Therefore, placement of SMAPs has an important role in such systems. Initial placement of SMAPs is an NP problem. We solve it using a parallel implementation of the genetic algorithm with an efficient evaluation phase. The adopted hierarchical placement approach is scalable, it enables construction of arbitrarily large SMAP based systems.

2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), 2016

Increasingly sophisticated, complex,, energy-efficient cyber-physical systems, wireless sensor ne... more Increasingly sophisticated, complex,, energy-efficient cyber-physical systems, wireless sensor networks are emerging, facilitated by recent advances in computing, sensor technologies. Integration of cyber-physical systems, wireless sensor networks with other contemporary technologies, such as unmanned aerial vehicles, fog or edge computing, enable creation of completely new smart solutions. We present the concept of a Smart Mobile Access Point (SMAP), which is a key building block for a smart network,, propose an efficient placement approach for such SMAPs. SMAPs predict the behavior of the network, based on information collected from the network,, select the best approach to support the network at any given time. When needed, they autonomously change their positions to obtain a better configuration from the network performance perspective. Therefore, placement of SMAPs is an important issue in such a system. Initial placement of SMAPs is an NP problem,, evolutionary algorithms provide an efficient means to solve it. Specifically, we present a parallel implementation of the imperialistic competitive algorithm, an efficient evaluation or fitness function to solve the initial placement of SMAPs in the fog computing context.

IEEE Design & Test, 2017

With the break down of the Dennard scaling, we entered the Dark silicon era, where the available ... more With the break down of the Dennard scaling, we entered the Dark silicon era, where the available power budget is no more able to feed all the cores available within the same chip at full throttle. At the same time, the extreme downscaling of CMOS technologies has caused an acceleration in device aging and wear-out processes. In the dark silicon era, runtime resource management in many-core systems becomes more challenging as many aspects have to be considered all together, such as power capping, dynamic applications mapping, performance improvement and also reliability management. In this paper, we claim that dark silicon can be exploited for reliability purposes by efficiently managing system resources (both cores and power) in order to prolong the system lifetime while achieving the same level of performance.

2016 International Conference on High Performance Computing & Simulation (HPCS), 2016

The widespread importance of optimization and solving NP-hard problems, like solving systems of n... more The widespread importance of optimization and solving NP-hard problems, like solving systems of nonlinear equations, is indisputable in a diverse range of sciences. Vast uses of non-linear equations are undeniable. Some of their applications are in economics, engineering, chemistry, mechanics, medicine, and robotics. There are different types of methods of solving the systems of nonlinear equations. One of the most popular of them is Evolutionary Computing (EC). This paper presents an evolutionary algorithm that is called Parallel Imperialist Competitive Algorithm (PICA) which is based on a multi-population technique for solving systems of nonlinear equations. In order to demonstrate the efficiency of the proposed approach, some well-known problems are utilized. The results indicate that the PICA has a high success and a quick convergence rate.

Proceedings of the 9th International Symposium on Networks-on-Chip, 2015

Increasing dynamic workloads running on NoC-based many-core systems necessitates efficient runtim... more Increasing dynamic workloads running on NoC-based many-core systems necessitates efficient runtime mapping strategies. With an unpredictable nature of application profiles, selecting a rational region to map an incoming application is an NP-hard problem in view of minimizing congestion and maximizing performance. In this paper, we propose a proactive region selection strategy which prioritizes nodes that offer lower congestion and dispersion. Our proposed strategy, MapPro, quantitatively represents the propagated impact of spatial availability and dispersion on the network with every new mapped application. This allows us to identify a suitable region to accommodate an incoming application that results in minimal congestion and dispersion. We cluster the network into squares of different radii to suit applications of different sizes and proactively select a suitable square for a new application, eliminating the overhead caused with typical reactive mapping approaches. We evaluated our proposed strategy over different traffic patterns and observed gains of up to 41% in energy efficiency, 28% in congestion and 21% dispersion when compared to the state-of-the-art region selection methods.

2015 IEEE Sensors Applications Symposium (SAS), 2015

A novel Internet of Things based architecture supporting scalability and fault tolerance for heal... more A novel Internet of Things based architecture supporting scalability and fault tolerance for healthcare is presented in this paper. The wireless system is constructed on top of 6LoWPAN energy efficient communication infrastructure to maximize the operation time. Fault tolerance is achieved via backup routing between nodes and advanced service mechanisms to maintain connectivity in case of failing connections between system nodes. The presented fault tolerance approach covers many fault situations such as malfunction of sink node hardware and traffic bottleneck at a node due to a high receiving data rate. A method for extending the number of medical sensing nodes at a single gateway is presented. A complete system architecture providing a quantity of features from bio-signal acquisition such as Electrocardiogram (ECG), Electroencephalography (EEG), and Electromyography (EMG) to the representation of graphical waveforms of these gathered bio-signals for remote real-time monitoring is proposed.

2014 IEEE Conference on Norbert Wiener in the 21st Century (21CW), 2014

This paper presents information and communication system technological development towards person... more This paper presents information and communication system technological development towards personalized and pervasive healthcare systems. In the recent years, information and communication technology has been providing an innovative health care solution to the society. With the massive increase in aging and stressed population, the health care solutions necessitate the use of more in-home monitoring and tracking system. In this paper, we have gathered the data from early 1970s to present date in terms of information and communication technology's developments, use of wireless communication systems, and state-of-art of body area sensor network in personalized and pervasive health care. We have also put an effort to present our work in developing a robust, privacy compliant, accurate and cost-effective system that facilitates monitoring of patient status, patient activity and compliance with therapy. The proposed system will also be capable in developing patient-oriented services to support patient empowerment, self-care, adherence to care plans and treatment at the point of need.

Lecture Notes in Computer Science, 2012

In this paper, we study two hierarchical N-Body methods for Network-on-Chip (NoC) architectures. ... more In this paper, we study two hierarchical N-Body methods for Network-on-Chip (NoC) architectures. The modern Chip Multiprocessor (CMP) designs are mainly based on the shared-bus communication architecture. As the number of cores increases, it suffers from high communication delays. Therefore, NoC based architecture is proposed. The N-Body problem is a classical problem of approximating the motion of bodies. Two methods, namely Barnes-Hut (Barnes) and Fast Multipole (FMM), have been developed for fast simulation. The two algorithms have been implemented and studied in conventional computer systems and Graphics Processing Units (GPUs). However, as a promising unconventional multicore architecture, the evaluation of N-Body methods in a NoC platform has not been well addressed. We define a NoC model based on state-of-the-art systems. Evaluation results are presented using a cycle accurate full system simulator. Experiments show that, Barnes scales better (53.7x/Barnes and 36.6x/FMM for 64 processing elements) and requires less cache than FMM. However, we observe hot-spot traffic in Barnes. Our analysis and experiment results provide a guideline for studying N-Body methods in a NoC platform. ¶ This work is supported by Academy of Finland and Nokia Foundation. The authors would like to thank the anonymous reviewers for their feedback and suggestions.

Lecture Notes in Computer Science, 2013

The Network-on-Chip NoC paradigm plays an essential role in designing emerging multicore processo... more The Network-on-Chip NoC paradigm plays an essential role in designing emerging multicore processors. Three Dimensional 3D NoC design expands the on-chip network vertically. To achieve high performance in a 3D NoC, it is crucial to reduce the access latency of caches and memories. In this paper, we propose an optimized design that provides high performance, low power consumption and manufacturing cost. The proposed scheme shifts the fully connected mesh network to a partially connected network, with the optimization of heterogeneous routers and links. Full system evaluation shows that, compared to a previous optimized heterogeneous design, OPTNOC can further reduce the execution time by 12.1% and energy delay product by 23.5%.

2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), 2014

ABSTRACT In this paper, we propose a run-time mapping algorithm, CASqA, for networked many-core s... more ABSTRACT In this paper, we propose a run-time mapping algorithm, CASqA, for networked many-core systems. In this algorithm, the level of contiguousness of the allocated processors (α) can be adjusted in a fine-grained fashion. A strictly contiguous allocation (α = 0) decreases the latency and power dissipation of the network and improves the applications execution time. However, it limits the achievable throughput and increases the turnaround time of the applications. As a result, recent works consider non-contiguous allocation (α = 1) to improve the throughput traded off against applications execution time and network metrics. In contradiction, our experiments show that a higher throughput (by 3%) with improved network performance can be achieved when using intermediate α values. More precisely, up to 35% drop in the network costs can be gained by adjusting the level of contiguity compared to non-contiguous cases, while the achieved throughput is kept constant. Moreover, CASqA provides at least 32% energy saving in the network compared to other works.

ACM Journal on Emerging Technologies in Computing Systems, 2015

In the era of platforms hosting multiple applications with arbitrary performance requirements, pr... more In the era of platforms hosting multiple applications with arbitrary performance requirements, providing a worst-case platform-wide voltage/frequency operating point is neither optimal nor desirable. As a solution to this problem, designs commonly employ dynamic voltage and frequency scaling (DVFS). DVFS promises significant energy and power reductions by providing each application with the operating point (and hence the performance) tailored to its needs. To further enhance the optimization potential, recent works interleave dynamic parallelism with conventional DVFS. The induced parallelism results in performance gains that allow an application to lower its operating point even further (thereby saving energy and power consumption). However, the existing works employ costly dedicated hardware (for synchronization) and rely solely on greedy algorithms to make parallelism decisions. To efficiently integrate parallelism with DVFS, compared to state-of-the-art, we exploit the reconfiguration (to reduce DVFS synchronization overheads) and enhance the intelligence of the greedy algorithm (to make optimal parallelism decisions). Specifically, our solution relies on dynamically reconfigurable isolation cells and an autonomous parallelism, voltage, and frequency selection algorithm. The dynamically reconfigurable isolation cells reduce the area overheads of DVFS circuitry by configuring the existing resources to provide synchronization. The autonomous parallelism, voltage, and frequency selection algorithm ensures high power efficiency by combining parallelism with DVFS. It selects that parallelism, voltage, and frequency trio which consumes minimum power to meet the deadlines on available resources. Synthesis and simulation results using various applications/algorithms (WLAN, MPEG4, FFT, FIR, matrix multiplication) show that our solution promises significant reduction in area and power consumption (23% and 51%) compared to state-of-the-art.

2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2014

Dark Silicon issue stresses that a fraction of silicon chip being able to switch in a full freque... more Dark Silicon issue stresses that a fraction of silicon chip being able to switch in a full frequency is dropping and designers will soon face a growing underutilization inherent in future technology scaling. On the other hand, by reducing the transistor sizes, susceptibility to internal defects increases and large range of defects such as aging or transient faults will be shown up more frequently. In this paper, we propose an online concurrent test scheduling approach for the fraction of chip that cannot be utilized due to the restricted utilization wall. Dynamic voltage and frequency scaling including near-threshold operation is utilized in order to maximize the concurrency of the online testing process under the constant power. As the dark area of the system is dynamic and reshapes at a runtime, our approach dynamically tests unused cores in a runtime to provided tested cores for upcoming application and hence enhance system reliability. Empirical results show that our proposed concurrent testing approach using dynamic voltage and frequency scaling (DVFS) improves the overall test throughput by over 250% compared to the state-of-the-art dark silicon aware online testing approaches under the same power budget.