Multicore Programming Research Papers - Academia.edu (original) (raw)

Multicore embedded systems introduce new opportunities and challenges. Scaling of computational power is one of the main reasons for a transition to a multicore environment. Parallel design patterns, such as Map Reduce, Task Graph, Thread... more

Multicore embedded systems introduce new opportunities and challenges. Scaling of computational power is one of the main reasons for a transition to a multicore environment. Parallel design patterns, such as Map Reduce, Task Graph, Thread Pool, Task Parallelism assist to derive a parallel approach for calculating the Fast Fourier Transform. By combining these design patterns, a robust application can be obtained. The key issues for concurrent calculation of a Fast Fourier Transform are determined at a higher level avoiding low-level patch-ups.

Automobile manufacturers are controlled by stringent govt. regulations for safety and fuel emissions and motivated towards adding more advanced features and sophisticated applications to the existing electronic system. Ever increasing... more

Automobile manufacturers are controlled by stringent govt. regulations for safety and fuel emissions and motivated towards adding more advanced features and sophisticated applications to the existing electronic system. Ever increasing customer's demands for high level of comfort also necessitate providing even more sophistication in vehicle electronics system. All these, directly make the vehicle software system more complex and computationally more intensive. In turn, this demands very high computational capability of the microprocessor used in electronic control unit (ECU). In this regard, multicore processors have already been implemented in some of the task rigorous ECUs like, power train, image processing and infotainment. To achieve greater performance from these multicore processors, parallelized ECU software needs to be efficiently scheduled by the underlaying operating system for execution to utilize all the computational cores to the maximum extent possible and meet the real time constraint. In this paper, we propose a dynamic task scheduler for multicore engine control ECU that provides maximum CPU utilization, minimized preemption overhead, minimum average waiting time and all the tasks meet their real time deadlines while compared to the static priority scheduling suggested by Automotive Open Systems Architecture (AUTOSAR).

In the embedded world, symmetric multiprocessing architectures are currently most popular, however more embedded hardware platforms are being developed with asymmetric multiprocessor architectures. These may enable higher performance and... more

In the embedded world, symmetric multiprocessing architectures are currently most popular, however more embedded hardware platforms are being developed with asymmetric multiprocessor architectures. These may enable higher performance and provide cleaner separation of subsystems. Telecom applications are typically designed applying a planar architecture pattern. The goal of our experiments is to compare the performance and cross-plane influence in dualcore symmetric and asymmetric multiprocessing environments. Next to a pronounced performance difference, a crossinfluence between the different planes has been verified.

This book “Multi-Core Architectures and Programming” is about an introductory conceptual idea about Multicore Processor with Architecture and programming using OpenMP API. It gives an outline on Multicore Architecture and its functional... more

This book “Multi-Core Architectures and Programming” is about an introductory conceptual idea about Multicore Processor with Architecture and programming using OpenMP API. It gives an outline on Multicore Architecture and its functional blocks like Intercommunication, Cache and Memory. It provides an ideology of working mechanism process scheduling in Operating System is performed in a Multicore processor. Memory programming in core processor using OpenMP API and its libraries for C language is discussed.

Multicore platforms allow developers to optimize applications by intelligent partitioning different workloads on different processor cores. Currently, application programs are optimized to use multiple processor resources, resulting in... more

Multicore platforms allow developers to optimize applications by intelligent partitioning different workloads on different processor cores. Currently, application programs are optimized to use multiple processor resources, resulting in faster application performance. Our earlier research work focused native thread for Java on windows thread, Pthread, Intel TBB, respectively, we developed NativeThreads, NativePthread, Java Native Intel TBB beneath windows 32-bit platform. This article aims to identify the future directions of native thread for Java on windows thread, Pthread, Intel TBB through JNI beneath windows 64-bit platforms and other platform besides. Furthermore, it articulates additional opening to pursue approaching developments on parallel programming models through Java

Multicore embedded systems introduce new opportunities and challenges. Scaling of computational power is one of the main reasons for a transition to a multicore environment. Parallel design patterns, such as Map Reduce, Task Graph, Thread... more

Multicore embedded systems introduce new opportunities and challenges. Scaling of computational power is one of the main reasons for a transition to a multicore environment. Parallel design patterns, such as Map Reduce, Task Graph, Thread Pool, Task Parallelism assist to derive a parallel approach for calculating the Fast Fourier Transform. By combining these design patterns, a robust application can be obtained.

The Leibniz Supercomputing Centre publishes in this booklet the complete material of the Intel MIC programming workshop that took place at LRZ on June 26 – 28, 2017. The workshop discussed Intel’s Many Integrated Core (MIC) architecture... more

The Leibniz Supercomputing Centre publishes in this booklet the complete material of the Intel MIC programming workshop that took place at LRZ on June 26 – 28, 2017. The workshop discussed Intel’s Many Integrated Core (MIC) architecture and various programming models for Intel Xeon Phi co-/processors. The workshop covered a wide range of topics from the description of the hardware of the Intel Xeon Phi co-/processors through information about the basic programming models as well as information about vectorisation and MCDRAM usage up to tools and strategies how to analyse and improve the performance of applications. The workshop mainly concentrated on techniques relevant for Knights Landing (KNL) based systems. During a plenary session on the last day 8 invited speakers from IPCC@LRZ, IPCC@TUM, IPCC@IT4Innovations, Intel, RRZE, the University of Regensburg, IPP and MPCDF talked about Intel Xeon Phi experience and best practice recommendations. Hands-on sessions were done on the Knights Corner (KNC) based system SuperMIC and two KNL test systems at LRZ.

Task intensive electronic control units (ECUs) in automotive domain, equipped with multicore processors , real time operating systems (RTOSs) and various application software, should perform efficiently and time deterministically. The... more

Task intensive electronic control units (ECUs) in automotive domain, equipped with multicore processors , real time operating systems (RTOSs) and various application software, should perform efficiently and time deterministically. The parallel computational capability offered by this multicore hardware can only be exploited and utilized if the ECU application software is parallelized. Having provided with such
parallelized software, the real time operating system scheduler component should schedule the time critical tasks so that, all the computational cores are utilized to a greater extent and the safety critical deadlines are met. As original equipment manufacturers (OEMs) are always motivated towards adding more sophisticated features to the existing ECUs, a large number of task sets can be effectively scheduled for execution within the bounded time limits. In this paper, a hybrid scheduling algorithm has been proposed, that meticulously calculates the running slack of every task and estimates the probability of meeting deadline either being in the same partitioned queue or by migrating to another. This algorithm was run and tested using a scheduling simulator with different real time task models of periodic tasks . This algorithm
was also compared with the existing static priority scheduler, which is suggested by Automotive Open Systems Architecture (AUTOSAR). The performance parameters considered here are, the % of core utilization, average response time and task deadline missing rate. It has been verified that, this proposed
algorithm has considerable improvements over the existing partitioned static priority scheduler based on each performance parameter mentioned above.

The mapping process of high performance embedded applications to today's multiprocessor system-on-chip devices suffers from a complex toolchain and programming process. The problem is the expression of parallelism with a pure imperative... more

The mapping process of high performance embedded applications to today's multiprocessor system-on-chip devices suffers from a complex toolchain and programming process. The problem is the expression of parallelism with a pure imperative programming language, which is commonly C. This traditional approach limits the mapping, partitioning and the generation of optimized parallel code, and consequently the achievable performance and power consumption of applications from different domains. The Architecture oriented paraLlelization for high performance embedded Multicore systems using scilAb (ALMA) European project aims to bridge these hurdles through the introduction and exploitation of a Sci-lab-based toolchain which enables the efficient mapping of applications on multiprocessor platforms from a high level of abstraction. The holistic solution of the ALMA toolchain allows the complexity of both the application and the architecture to be hidden, which leads to better acceptance, reduced development cost, and shorter time-to-market. Driven by the technology restrictions in chip design, the end of exponential growth of clock speeds and an unavoidable increasing request of computing performance, ALMA is a fundamental step forward in the necessary introduction of novel computing paradigms and methodologies.

This research paper aims at comparing two multi-core processors machines, the Intel core i7-4960X processor (Ivy Bridge E) and the AMD Phenom II X6. It starts by introducing a single-core processor machine to motivate the need for... more

This research paper aims at comparing two multi-core processors machines, the Intel core i7-4960X processor (Ivy Bridge E) and the AMD Phenom II X6. It starts by introducing a single-core processor machine to motivate the need for multi-core processors. Then, it explains the multi-core processor machine and the issues that rises in implementing them. It also provides a real life example machines such as TILEPro64 and Epiphany-IV 64-core 28nm Microprocessor (E64G401). The methodology that was used in comparing the Intel core i7 and AMD phenom II processors starts by explaining how processors' performance are measured, then by listing the most important and relevant technical specification to the comparison. After that, running the comparison by using different metrics such as power, the use of Hyper-Threading technology, the operating frequency, the use of AES encryption and decryption, and the different characteristics of cache memory such as the size, classification, and its memory controller. Finally, reaching to a roughly decision about which one of them has a better over all performance.

In this work, we consider the C++ Actor Framework (CAF), a recent proposal that revamped the interest in building concurrent and distributed applications using the actor programming model in C++. CAF has been optimized for high-throughput... more

In this work, we consider the C++ Actor Framework (CAF), a recent proposal that revamped the interest in building concurrent and distributed applications using the actor programming model in C++. CAF has been optimized for high-throughput computing, whereas message latency between actors is greatly influenced by the message data rate: at low and moderate rates the latency is higher than at high data rates. To this end, we propose a modification of the polling strategies in the work-stealing CAF scheduler, which can reduce message latency at low and moderate data rates up to two orders of magnitude without compromising the overall throughput and message latency at maximum pressure. The technique proposed uses a lightweight event notification protocol that is general enough to be used used to optimize the runtime of other frameworks experiencing similar issues.

Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of many-core-on-a-chip systems with a software... more

Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of many-core-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures.
In this paper, we use dense matrix multiplication as a case of study to present a general methodology to map applications to these kinds of architectures. Our methodology exposes the following characteristics: (1) Balanced distribution of work among threads to fully exploit available resources. (2) Optimal register tiling and sequence of traversing tiles, calculated analytically and parametrized according to the register file size of the processor used. This results in minimal memory transfers and optimal register usage. (3) Implementation of architecture specific optimizations to further increase performance. Our experimental evaluation on a real C64 chip shows a performance of 44.12 GFLOPS, which corresponds to
55.2% of the peak performance of the chip. Additionally, measurements of power consumption
prove that the C64 is very power efficient providing 530 MFLOPS/W for the problem
under consideration.

This paper aims at developing applications based on active RFID and wireless GSM message to construct an active student attendance system that sends the message to parents cellular phone informing whether their children are safely arrive... more

This paper aims at developing applications based on active RFID and wireless GSM message to construct an active student attendance system that sends the message to parents cellular phone informing whether their children are safely arrive in classroom at morning. Meanwhile, the system is also used to relieve the traffic congestion around kindergartens especially while the parents are driving cars to pick up their children after class at rush or on rainy days. Finally, the encountered problems are summarized to discuss according to our developing experience.

Nowadays multi-core architectures have become mainstream in the microprocessor industry. However, while the number of cores integrated in a single chip growth more important becomes the need of an adequate programming model. In recent... more

Nowadays multi-core architectures have become mainstream in the microprocessor industry. However, while the number of cores integrated in a single chip growth more important becomes the need of an adequate programming model. In recent years the OpenCL programming model attracts the attention of multi-core designers’ community. This paper presents an OpenCL-compliant architecture and demonstrates that such programming model can be successfully used as programming model for general-purpose multi-core architectures.

Multicore architectures are becoming available in embedded systems. However, parallelizing sequential software is a challenging task. A structured approach is needed to exploit parallel opportunities. Therefore we propose a topdown... more

Multicore architectures are becoming available in embedded systems. However, parallelizing sequential software is a challenging task. A structured approach is needed to exploit parallel opportunities. Therefore we propose a topdown approach based on a layered model of parallel design patterns. As a proof of concept this approach has been applied on a number of algorithms including the Fast Fourier Transformation.

The User Centric Smart Card Ownership Model (UCOM) provides an open and dynamic smart card environment enabling cardholders to request installation/deletion of an application to which they are entitled. As in this model, smart cards are... more

The User Centric Smart Card Ownership Model (UCOM) provides an open and dynamic smart card environment enabling cardholders to request installation/deletion of an application to which they are entitled. As in this model, smart cards are not under the control of a centralised authority; hence, it is difficult for an application provider to ascertain their trustworthiness. At present, proposed secure channel protocols for the smart card environment do not provide adequate assurance required by the UCOM. In this paper, we explore the reasons behind their failure to meet the UCOM requirements and then propose a secure and trusted channel protocol that meets them. In addition, the proposed protocol is also suitable to GlobalPlatform's consumer-centric smart cards. A comparison of the proposed protocol with existing smart card and selected Internet protocols is provided. Then we analyse the protocol with the CasperFDR tool. Finally, we detail the implementation and the performance measurement.

Multicore embedded systems introduce new opportunities and challenges. Scaling of computational power is one of the main reasons for a transition to a multicore environment. Parallel design patterns, such as Map Reduce, Task Graph, Thread... more

Multicore embedded systems introduce new opportunities and challenges. Scaling of computational power is one of the main reasons for a transition to a multicore environment. Parallel design patterns, such as Map Reduce, Task Graph, Thread Pool, Task Parallelism assist to derive a parallel approach for calculating the Fast Fourier Transform. By combining these design patterns, a robust application can be obtained. The key issues for concurrent calculation of a Fast Fourier Transform are determined at a higher level avoiding low-level ...

Existing mobility models for wireless sensor networks generally do not preserve a uniform scattering of the sensor nodes within the monitored area. This paper proposes a coverage-preserving random mobility model called DPRMM. Direction... more

Existing mobility models for wireless sensor networks generally do not preserve a uniform scattering of the sensor nodes within the monitored area. This paper proposes a coverage-preserving random mobility model called DPRMM. Direction and velocity are randomly chosen according to the local information about sensor density. Our mobility model is devised in a manner that sensors move towards the least covered regions within their neighborhood. We show that this guarantees a rapid convergence to the steady state while preserving a uniform coverage degree on the monitored region. Throughout the simulations we have carried out, we find that the analytical study is corroborated by practical experiments. Our experiments show that the average distance made by a target without being detected is approximately enhanced at least by a factor of 2 using DPRMM.

The problem of time series subsequence matching occurs in a wide spectrum of subject areas. Currently Dynamic Time Warping (DTW) is the best similarity measure but despite various existing speedup techniques it is still computationally... more

The problem of time series subsequence matching occurs in a wide spectrum of subject areas. Currently Dynamic Time Warping (DTW) is the best similarity measure but despite various existing speedup techniques it is still computationally expensive. Due to this reason science community is trying to accelerate DTW calculation by means of parallel hardware. There are implementations of DTW-based subsequence matching on GPU and FPGA but there none for accelerators based on the Intel Many Integrated Core architecture. This paper presents a parallel algorithm for time series subsequence matching based on DTW distance adapted to the Intel Xeon Phi coprocessor. The experimental results on synthetic and real data sets are presented and confirm the efficiency of the algorithm.

In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and... more

In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectorization. Since the straightforward porting process of the already existing OpenCL version of the code encountered performance problems that require further analysis, we focused our efforts on the implementation and optimization of two core building block kernels for FEASTFLOW: an axpy vector operation and a sparse matrix-vector multiplication (spmv). Our experimental results on these building blocks indicate the Xeon Phi can serve as a promising accelerator for our software infrastructure.

The design of contemporary multi-core architectures has progressively diversified from more conventional architectures. Instead of simply “gluing” together a number of slightly modified existing uniprocessor cores, a new class of... more

The design of contemporary multi-core architectures has progressively diversified from more conventional architectures. Instead of simply “gluing” together a number of slightly modified existing uniprocessor cores, a new class of multi-core architectures is emerging, which is the results of a more radical exploration of the multiprocessor architecture design space. An important feature of these new architectures is the integration of a large number of simple ores with software-managed embedded memory, in place of a hardware managed cache hierarchy. These two subsystems communicate through a powerful on-chip interconnection network, which is capable of providing a very high bandwidth. However, what remains an open question is what the programming model of this new class of multi-core architectures should be. In this report we present an implementation of the LU application for Cyclops-64, an architecture that fits into the above category. Through this experience, we identified a number of program developing methodologies that are extensively used on cache-based parallel systems to improve performance, but behave poorly on Cyclops-64. These include algorithmic design, the interaction between the high-level algorithm and the architecture and architecture specific optimizations. Moreover, we identified methodologies that improve performance on both kind of systems. Along with the description of our algorithm for LU and the experimental evaluation, we analyze and explore the impact of those methodologies on the performance of LU and provide alternatives whenever they fail on our architecture. As a result, we are able to achieve a performance of 11.19 GFlops with double-precision floating point numbers, even for a small matrix of size 512 x 512. To our knowledge, this is the highest GFlops per chip rate reported so far for this application.

The emergence of multi-core processors has led to the expansion of parallel programming in all areas. OpenMP appears to be one of the most suitable API for new processor architectures. This choice is justified by its ease of use compared... more

The emergence of multi-core processors has led to the expansion of parallel programming in all areas. OpenMP appears to be one of the most suitable API for new processor architectures. This choice is justified by its ease of use compared to other alternatives of parallel programming. However, due to many factors, developing efficient OpenMP programs is a challenging task. In this work, we present a new performance model dealing with performance modeling of OpenMP programs on multi-core machines. Experimental results achieved on a matrix-matrix product prove the simplicity and the accuracy of the predicted performance given by the model.

The rising demand for high-performing embedded systems made FPGAs ubiquitous. Combining the strengths of an FPGA and a general purpose processor on one chip does not only simplify PCB layout, it also removes the communication bottleneck... more

The rising demand for high-performing embedded systems made FPGAs ubiquitous. Combining the strengths of an FPGA and a general purpose processor on one chip does not only simplify PCB layout, it also removes the communication bottleneck between processor and FPGA. Moreover, it allows the designer to partition applications and map parts of an application to either programmable logic or processing system. A case study on a voice-over-ethernet system illustrates partitioning by means of the processing requirements. This partitioning is called the Planar Design Pattern. Management of the communication channel is done by one of the processor cores. While high-speed data streaming itself is done in programmable logic.

This paper addresses the problem of designing scaling strategies for elastic data stream processing. Elasticity allows applications to rapidly change their configuration on-the-fly (e.g., the amount of used resources) in response to... more

This paper addresses the problem of designing scaling strategies for elastic data stream processing. Elasticity allows applications to rapidly change their configuration on-the-fly (e.g., the amount of used resources) in response to dynamic workload fluctuations. In this work we face this problem by adopting the Model Predictive Control technique, a control-theoretic method aimed at finding the optimal application configuration along a limited prediction horizon in the future by solving an online optimization problem. Our control strategies are designed to address latency constraints, using Queueing Theory models, and energy consumption by changing the number of used cores and the CPU frequency through the Dynamic Voltage and Frequency Scaling (DVFS) support available in the modern multicore CPUs. The proactive capabilities, in addition to the latency-and energy-awareness, represent the novel features of our approach. To validate our methodology, we develop a thorough set of experiments on a high-frequency trading application. The results demonstrate the high-degree of flexibility and configurability of our approach, and show the effectiveness of our elastic scaling strategies compared with existing state-of-the-art techniques used in similar scenarios.

The design of contemporary multi-core architectures has progressively diversified from more conventional architectures. Instead of simply “gluing” together a number of slightly modified existing uniprocessor cores, a new class of... more

The design of contemporary multi-core architectures has progressively diversified from more conventional architectures. Instead of simply “gluing” together a number of slightly modified existing uniprocessor cores, a new class of multi-core architectures is emerging, which is the results of a more radical exploration of the multiprocessor architecture design space. An important feature of these new architectures is the integration of a large number of simple cores with software-managed embedded memory, in place of a hardware managed cache hierarchy. These two subsystems communicate through a powerful on-chip interconnection network, which is capable of providing a very high bandwidth. However, what remains an open question is what the programming model of this new class of multi-core architectures should be. In this report we present an implementation of the LU application for Cyclops-64, an architecture that fits into the above category. Through this experience, we identified a number of program developing methodologies that are extensively used on cache-based parallel systems to improve performance, but behave poorly on Cyclops-64. These include algorithmic design, the interaction between the high-level algorithm and the architecture and architecture specific optimizations. Moreover, we identified methodologies that improve performance on both kind of systems. Along with the description of our algorithm for LU and the experimental evaluation, we analyze and explore the impact of those methodologies on the performance of LU and provide alternatives whenever they fail on our architecture. As a result, we are able to achieve a performance of 11.19 GFlops with double-precision floating point numbers, even for a small matrix of size 512 x 512. To our knowledge, this is the highest GFlops per chip rate reported so far for this application.

ABSTRACT The emergence of multi-core processors has led to the expansion of parallel programming in all areas. OpenMP appears to be one of the most suitable API for new processor architectures. This choice is justified by its ease of use... more

ABSTRACT The emergence of multi-core processors has led to the expansion of parallel programming in all areas. OpenMP appears to be one of the most suitable API for new processor architectures. This choice is justified by its ease of use compared to other alternatives of parallel programming. However, due to many factors, developing efficient OpenMP programs is a challenging task. In this work, we present a new performance model dealing with performance modeling of OpenMP programs on multi-core machines. Experimental results achieved on a matrix-matrix product prove the simplicity and the accuracy of the predicted performance given by the model.