Multithreading Research Papers - Academia.edu (original) (raw)
To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize... more
To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify them along several parameters. The aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area.
The problem of testing randomness is motivated by the need of evaluating the quality of different random number generators used by many practical applications including computer simulations, cryptography and communications industry. In... more
The problem of testing randomness is motivated by the need of evaluating the quality of different random number generators used by many practical applications including computer simulations, cryptography and communications industry. In particular, the quality of the randomness of the generated numbers affects the quality of such applications. In this study, the authors focus on one of the most popular approaches for testing randomness, Poker test. Two versions of Poker test are known: the classical Poker test and the approximated Poker test, in which the latter has been motivated by the difficulties involved in implementing the classical approach at the time it is designed. The paper is motivated by certain practical applications such as cryptography and Monte Carlo simulation. Moreover, Pseudo-random numbers are often required for simulations performed on parallel computers. This motivates implementing the classical Poker test in parallel in this paper with MATLAB using MEX-file (MEX stands for MATLAB Executable) with one, two, three and four threads and from the computations point of view, the authors compare the performance. It shows that the speedups of the implementation using two threads are close to three threads and both of them are greater than one thread. However, with four threads is significantly greater than one, two and three threads.
- by Wael Abdel-Rehim and +1
- •
- Cryptography, Matlab, Multithreading, Tests for randomness
Initially introduced as special-purpose accelerators for graphics applications, GPUs have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs... more
Initially introduced as special-purpose accelerators for graphics applications, GPUs have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU-GPU heterogeneous computing etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.
A Chunk List is a new, concurrent, chunk-based data structure that is easily modifiable and allows for fast run-time operations.
- by Hervé Jourdren and +2
- •
- Parallel Programming, Multithreading, High performance, Debugging
A parallel programming model is a set of software technologies to articulate parallel algorithms and match applications with the underlying parallel systems. It surroundings with applications, languages, libraries, compilers,... more
A parallel programming model is a set of software technologies to articulate parallel algorithms and
match applications with the underlying parallel systems. It surroundings with applications, languages, libraries,
compilers, communication systems, and parallel I/O. Programmer have to decide a proper parallel programming
model or a form of mixture of them to develop their parallel applications on a particular platform. Multithreaded
programming is written in many programming languages with usually increased performance. Thread libraries
can be implicit or explicit. OpenMP, MPI, Intel Threading Building Blocks (TBB) are implicit thread libraries.
Pthreads and Windows Threads are explicit thread libraries. Threads can be accessed by different programming
interfaces. Many software libraries provide an interface for threads usually based on POSIX Threads, Windows
threads, OpenMP, MPI and Threading Building Blocks frameworks. These frameworks provide a different level
of abstraction from the underlying thread implementation of the operating system. The general parallelism is the
execution of separate tasks in parallel. These multithreading libraries provide difference features. For example,
Java support flexible and easy use of threads; yet, java does not have contained methods for thread affinity to
the processors; because, green threads are scheduled by the virtual machine itself where as Windows or POSIX
thread can fix thread affinity. The native threads are scheduled by the operating system that is hosting the virtual
machine. This research finding carry overview aspects on how Java can facilitate Win32, POSIX, TBB threads
through JNI, which enables Java threads, to add other threading library features in Java.
All new computers have multicore processors. To exploit this hardware parallelism for improved performance, the predominant approach today is multithreading using shared variables and locks. This approach has potential data races that can... more
All new computers have multicore processors. To exploit this hardware parallelism for improved performance, the predominant approach today is multithreading using shared variables and locks. This approach has potential data races that can create a nondeterministic program. This paper presents a promising new approach to parallel programming that is both lock-free and deterministic. The standard forall primitive for parallel execution of for-loop iterations is extended into a more highly structured primitive called a Parallel Operation (POP). Each parallel process created by a POP may read shared variables (or shared collections) freely. Shared collections modified by a POP must be selected from a special set of predefined Parallel Access Collections (PAC). Each PAC has several Write Modes that govern parallel updates in a deterministic way. This paper presents an overview of a Prototype Library that implements this POP-PAC approach for the C++ language, including performance results for two benchmark parallel programs.
Models are used in performance analysis when the analyst needs to be able to predict the effect of system changes that go beyond what can be measured. The model can be obtained from a combination of system knowledge and experimentation.... more
Models are used in performance analysis when the analyst needs to be able to predict the effect of system changes that go beyond what can be measured. The model can be obtained from a combination of system knowledge and experimentation. This thesis addresses an experimental approach to obtaining layered queueing network (LQN) models of distributed systems. It applies and extends an approach called SAME (Software Architecture and Model Extraction)which was developed to interpret application-level traces, to interpreting Kernel-level traces. Kernel-level traces have the benefit that application instrumentation is not required, and communication with attached devices can be modeled, but they lack application context information. The research shows that modeling from Kernel traces is feasible in systems which communicate via TCP messages, including Java remote procedure calls. This covers most web-based systems. Systems using middleware pose special problems. The combination of Kernel and application-level tracing was included in some experiments. Tools are described that adapt the Kernel traces to SAME, and that extract CPU demand parameter calibration information.
Api TESTING+ PRE-LOAD TESTING + AZURE MULTI THREADING AS A PRE-LOAD TESTING SCALING IN AZURE Database Throughput Unit (DTU): DTUs provide a way to describe the relative capacity of a performance level. DTUs are based on a blended measure... more
Api TESTING+ PRE-LOAD TESTING + AZURE
MULTI THREADING AS A PRE-LOAD TESTING
SCALING IN AZURE
Database Throughput Unit (DTU): DTUs provide a way to describe the relative capacity of a performance level. DTUs are based on a blended measure of CPU, memory, reads, and writes. As DTUs increase, the power offered by the performance level increases.
Agent-based modeling (ABM) is a bottom-up modeling approach, where each entity of the system being modeled is uniquely represented as an independent decision-making agent. Large scale emergent behavior in ABMs is population sensitive. As... more
Agent-based modeling (ABM) is a bottom-up modeling approach, where each entity of the system being modeled is uniquely represented as an independent decision-making agent. Large scale emergent behavior in ABMs is population sensitive. As such, the number of agents in a simulation should be able to reflect the reality of the system being modeled, which can be in the order of millions or billions of individuals in certain domains. A natural solution to reach acceptable scalability in commodity multi-core processors consists of decomposing models such that each component can be independently processed by a different thread in a concurrent manner. In this paper we present a multithreaded Java implementation of the PPHPC ABM, with two goals in mind: 1) compare the performance of this implementation with an existing NetLogo implementation; and, 2) study how different parallelization strategies impact simulation performance on a shared memory architecture. Results show that: 1) model parallelization can yield considerable performance gains; 2) distinct parallelization strategies offer specific trade-offs in terms of performance and simulation reproducibility; and, 3) PPHPC is a valid reference model for comparing distinct implementations or parallelization strategies, from both performance and statistical accuracy perspectives.
We undoubtedly have a chunk of images on our computer. The problem with having a lot of pictures is that you tend to accumulate duplicates along the way. It would be prudent to manage space efficiently. Detecting duplicate images from a... more
We undoubtedly have a chunk of images on our computer. The problem with having a lot of pictures is that you tend to accumulate duplicates along the way. It would be prudent to manage space efficiently. Detecting duplicate images from a set of images is a timeconsuming task that can be automated, and duplicate data can be removed to save space. As we use our phones more, the number of unwanted duplicate photo and picture files grows in the device at random, ideally in every folder. Duplicate photos/pictures consume a lot of phone memory and slow down the phone's performance. Finding and removing them manually is difficult. Since human visual ability is not well developed enough to extract structure similarity from the naked eye, we propose a novel approach based on structural information degradation. As a practical solution to this problem, we create a structural similarity index and demonstrate it with a set of images from our database. Finding similar and duplicate photos from these samples can be a time-consuming task. Duplicate photo finders come in handy in this situation. Finally, we will compare the computation time and power required by processing on multiple cores vs. single core threads, as well as provide benchmarks and graphical representations for each.
Today, in this world of social media there are many applications that enable us to share data between people who are distances apart. These social media applications run a variety of platforms. Our project is about a social media... more
Today, in this world of social media there are many applications that enable us to share data between people who are distances apart. These social media applications run a variety of platforms. Our project is about a social media application through which we can chat and share files with other people living in different parts of the world which runs on a Desktop. Python programming language and its modules were used in this project. A client-server model and TCP protocol for communication are used in our project. It has a simple GUI interface implemented.
Dalam perhitungan data pemilih aktif di Indonesia, terkadang dibutuhkan suatu sistem untuk mempermudah prosesnya. Proses tersebut bergantung pada performa sistem perhitungan yang digunakan untuk dapat bekerja secara cepat, tepat, dan... more
Dalam perhitungan data pemilih aktif di Indonesia, terkadang dibutuhkan suatu sistem untuk mempermudah prosesnya. Proses tersebut bergantung pada performa sistem perhitungan yang digunakan untuk dapat bekerja secara cepat, tepat, dan akurat. Pada kesempatan kali ini penulis mengembangkan sistem untuk menjalankan proses tersebut menggunakan bahasa pemrograman PHP. Hal tersebut sulit dipenuhi apabila sistem yang dikembangkan menggunakan singlethreading. Untuk mengatasi masalah tersebut, penggunaan multithreading dengan pthreads yang terbagi dalam beberapa klaster secara parallel dan menggunakan komputer dengan spesifikasi yang cocok merupakan suatu solusi. Maka dari itu, pada makalah ini akan dibahas mengenai analisa perbandingan antara proses perhitungan suara menggunakan aplikasi php singlethreading dan multithreading pada spesifikasi komputer yang berbeda, yang bertujuan untuk mengetahui proses perhitungan suara pemilu mana dan spesifikasi seperti apa yang dibutuhkan komputer agar dapat mengeksekusi lebih cepat dalam menghasilkan data pemilih aktif pemilu perhitungan suara.
With increasing need of higher computational capacity and speed for visual data such as images & videos, Multicore computing technology has become a viable solution. Graphite multicore architecture simulator provides a necessary parallel... more
With increasing need of higher computational capacity and speed for visual data such as images & videos, Multicore computing technology has become a viable solution. Graphite multicore architecture simulator provides a necessary parallel computing environment not only for computational test but also optimized modelling. This paper discusses about the efficient model for multi-core architecture based on image processing algorithm incorporating both Dynamic Voltage Frequency Scaling (DVFS) & concept of heterogeneity.
Ziel dieser Arbeit ist es, eine Erzeuger–Verbraucher Anwendung in Java zu entwickeln. Anhand dieser Anwendung sollen die typischen Problemstellungen nebenläufiger Anwendungen diskutiert, sowie das Laufzeitverhalten der erstellten... more
Ziel dieser Arbeit ist es, eine Erzeuger–Verbraucher Anwendung in
Java zu entwickeln. Anhand dieser Anwendung sollen die typischen Problemstellungen nebenläufiger Anwendungen diskutiert, sowie das Laufzeitverhalten der erstellten Anwendung beobachtet und erörtert werden.
Current distributed computing systems comprising of commodity computers like Network of Workstations (NOW) are obliged to deploy multicore processors to raise their performance. However, because multicore processors were absent when... more
Current distributed computing systems comprising of commodity computers like Network of Workstations (NOW) are obliged to deploy multicore processors to raise their performance. However, because multicore processors were absent when traditional standard programming models and APIs for distributed computing such as MPI and PVM were designed, traditional models are not suitable for programming multicore processors. In this paper, we argue