Wu Feng - Academia.edu (original) (raw)

Papers by Wu Feng

Proceedings. 13th International Conference on Computer Communications and Networks (IEEE Cat. No.04EX969)

The cost of data transfers over PCI-Express often limits application performance on traditional d... more The cost of data transfers over PCI-Express often limits application performance on traditional discrete GPUs. To address this, AMD Fusion introduces a novel architecture that fuses the CPU and GPU onto a single die and connects the two with a high-performance memory controller. This architecture features a shared memory space between CPU and GPU, enabling several new memory access techniques that are not available on discrete architectures. For instance, a kernel running on the GPU can now directly access a host memory buffer and vice versa. As an initial step towards understanding the implications of the fused CPU+GPU architecture to heterogeneous computing, we characterize the performance impact of various memory-access techniques on applications running on an AMD Fusion platform (i.e., Llano A8-3850). The experimental results show that AMD Fusion can outperform a discrete GPU of the same performance class by as much as 4-fold for a memory-bound kernel.

Multicore processors have quickly become ubiquitous in supercomputing, cluster computing, datacen... more Multicore processors have quickly become ubiquitous in supercomputing, cluster computing, datacenter computing, and even personal computing. Software advances, however, continue to lag behind. In the past, software designers could simply rely on clock-speed increases to improve the performance of their software. With clock speeds now stagnant, software designers need to tap into the increased horsepower of multiple cores in a processor by creating software artifacts that support parallelism. Rather than forcing designers to write such software artifacts from scratch, we propose a pluggable framework that designers can reuse for lightweight task offloading in a parallel computing environment of multiple cores, whether those cores be colocated on a processor within a compute node, between compute nodes in a tightly-coupled system like a supercomputer, or between compute nodes in a loosely-coupled one like a cloud computer. To demonstrate the efficacy of our framework, we use the framework to implement lightweight task offloading (or software acceleration) for a popular parallel sequence-search application called mpiBLAST. Our experimental results on a 9-node, 36-core AMD Opteron cluster show that using mpiBLAST with our pluggable framework results in a 205% speed-up.

The general-purpose graphics processing unit (GPGPU) continues to make significant strides in hig... more The general-purpose graphics processing unit (GPGPU) continues to make significant strides in high-end computing by delivering unprecedented performance at a commodity price. However, the many-core architecture of the GPGPU currently allows only data-parallel applications to extract the full potential out of the hardware. Applications that require frequent synchronization during their execution do not experience much performance gain out of the GPGPU. This is mainly due to the lack of explicit hardware or software support for inter-thread communication across the entire GPGPU chip. In this paper, we design, implement, and evaluate a highly-efficient software barrier that synchronizes all the thread blocks running on an offloaded kernel on the GPGPU without having to transfer execution control back to the host processor. We show that our custom software barrier achieves a threefold performance improvement over the existing approach, i.e., synchronization via the host processor. To illustrate the aforementioned performance benefit, we parallelize a data-serial application, specifically an optimal sequence-search algorithm called Smith-Waterman (SWat), that requires frequent barrier synchronization across the many cores of the nVIDIA GeForce GTX 280 GPGPU. Our parallelization consists of a suite of optimization techniques-optimal data layout, coalesced memory accesses, and blocked data decomposition. Then, when coupled with our custom software-barrier implementation, we achieve nearly a nine-fold speed-up over the serial implementation of SWat. We also show that our solution delivers 25× faster on-chip execution than the naïve implementation.

Proceedings of the …, May 17, 2010

Computational neuroscience is being revolutionized with the advent of multi-electrode arrays that... more Computational neuroscience is being revolutionized with the advent of multi-electrode arrays that provide real-time, dynamic perspectives into brain function. Mining neuronal spike streams from these chips is critical to understand the firing patterns of neurons and gain insight into the underlying cellular activity. To address this need, we present a solution that uses a massively parallel graphics processing unit (GPU) to mine the stream of spikes. We focus on mining frequent episodes that capture coordinated events ...

While GPGPU stands for general-purpose computation on graphics processing units, the lack of expl... more While GPGPU stands for general-purpose computation on graphics processing units, the lack of explicit support for inter-block communication on the GPU arguably hampers its broader adoption as a general-purpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. We then evaluate the efficacy of each approach via a micro-benchmark as well as three well-known algorithms-Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lockfree synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.

We present a fully automated approach to project the relative performance of an OpenCL program ov... more We present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases of this technique include scheduling or migrating GPU workloads over a heterogeneous cluster with different types of GPUs.

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

This paper presents an eco-friendly daemon that reduces power and energy consumption while better... more This paper presents an eco-friendly daemon that reduces power and energy consumption while better maintaining high performance via an accurate workload characterization that infers "processor stall cycles due to off-chip activities." The eco-friendly daemon is an interval-based, run-time algorithm that uses the workload characterization to dynamically adjust a processor's frequency and voltage to reduce power and energy consumption with little impact on application performance. Using the NAS Parallel Benchmarks as our workload, we then evaluate our eco-friendly daemon on a cluster computer. The results indicate that our workload characterization allows the power-aware daemon to more tightly control performance (5% loss instead of 11%) while delivering substantial energy savings (11% instead of 8%).

Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14, 2014

Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering - ICPE '15, 2015

As core counts increase and as heterogeneity becomes more common in parallel computing, we face t... more As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of programming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal. In this paper, we present three contributions: (1) a characterization of queue designs in terms of modern multi-and many-core architectures, (2) the design of a high-throughput concurrent FIFO queue for many-core architectures that avoids the bottlenecks common in modern queue designs, and (3) a thorough evaluation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focusing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magnitude (1000×) faster than lock-free and combining queues on GPU platforms and two times (2×) faster on CPU devices. These results deliver critical insight into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can actually increase throughput.

Proceedings 20th IEEE International Conference on Distributed Computing Systems, 2000

Several studies in network traffic characterization have concluded that network traffic is self-s... more Several studies in network traffic characterization have concluded that network traffic is self-similar and therefore not readily amenable to statistical multiplexing in a distributed computing system. This paper examines the effects of the TCP protocol stack on network traffic via an experimental study on the different implementations of TCP. We show that even when aggregate application traffic smooths out as more applications' traffic are multiplexed, TCP introduces burstiness into the aggregate traffic load, reducing network performance when statistical multiplexing is used within the network gateways.

ACM/IEEE SC 2000 Conference (SC'00), 2000

Distributed computational grids depend on TCP to ensure reliable end-to-end communication between... more Distributed computational grids depend on TCP to ensure reliable end-to-end communication between nodes across the wide-area network (WAN). Unfortunately, TCP performance can be abysmal even when buffers on the end hosts are manually optimized. Recent studies blame the self-similar nature of aggregate network traffic for TCP's poor performance because such traffic is not readily amenable to statistical multiplexing in the Internet, and hence computational grids. In this paper, we identify a source of self-similarity previously ignored, a source that is readily controllable-TCP. Via an experimental study, we examine the effects of the TCP stack on network traffic using different implementations of TCP. We show that even when aggregate application traffic ought to smooth out as more applications' traffic are multiplexed, TCP induces burstiness into the aggregate traffic load, thus adversely impacting network performance. Furthermore, our results indicate that TCP performance will worsen as WAN speeds continue to increase.

2009 IEEE International Symposium on Parallel & Distributed Processing, 2009

The graphics processing unit (GPU) has emerged as a computational accelerator that dramatically r... more The graphics processing unit (GPU) has emerged as a computational accelerator that dramatically reduces the time to discovery in high-end computing (HEC). However, while today's state-of-the-art GPU can easily reduce the execution time of a parallel code by many orders of magnitude, it arguably comes at the expense of significant power and energy consumption. For example, the NVIDIA GTX 280 video card is rated at 236 watts, which is as much as the rest of a compute node, thus requiring a 500-W power supply. As a consequence, the GPU has been viewed as a "nongreen" computing solution. This paper seeks to characterize, and perhaps debunk, the notion of a "power-hungry GPU" via an empirical study of the performance, power, and energy characteristics of GPUs for scientific computing. Specifically, we take an important biological code that runs in a traditional CPU environment and transform and map it to a hybrid CPU+GPU environment. The end result is that our hybrid CPU+GPU environment, hereafter referred to simply as GPU environment, delivers an energy-delay product that is multiple orders of magnitude better than a traditional CPU environment, whether unicore or multicore.

Lecture Notes in Computer Science, 2015

The computing community is facing several big data challenges due to the unprecedented growth in ... more The computing community is facing several big data challenges due to the unprecedented growth in the volume and variety of data. Many large-scale Internet companies use distributed NoSQL data stores to mitigate these challenges. These NoSQL data-store installations require massive computing infrastructure, which consume significant amount of energy and contribute to operational costs. This cost is further aggravated by the lack of energy proportionality in servers. Therefore, in this paper, we study the energy proportionality of servers in the context of a distributed NoSQL data store, namely Apache Cassandra. Towards this goal, we measure the power consumption and performance of a Cassandra cluster. We then use power and resource provisioning techniques to improve the energy proportionality of the cluster and study the feasibility of achieving an energy-proportional data store. Our results show that a hybrid (i.e., power and resource) provisioning technique provides the best power savings-as much as 55%.

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, 2011

It has been traditionally viewed that as the scale of a supercomputer increases, its energy effic... more It has been traditionally viewed that as the scale of a supercomputer increases, its energy efficiency decreases due to performance that scales sub-linearly and power consumption that scales at least linearly with size. However, based on the first three years of the Green500, this view does not hold true for the fastest supercomputers in the world. Many reasons for this counterintuitive trend have been proposed-with improvements in feature size, more efficient networks, and larger numbers of slower cores being amongst the most prevalent. Consequently, this paper provides an analysis of emerging trends in the Green500 and delves more deeply into how larger-scale supercomputers compete with smallerscale supercomputers with respect to energy efficiency. In addition, our analysis provides a compelling early indicator of the future of exascale computing. We then close with a discussion on the evolution of the Green500 based on community feedback.

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2014

The increasing demand for computation and the commensurate rise in the power density of data cent... more The increasing demand for computation and the commensurate rise in the power density of data centers have led to increased costs associated with constructing and operating a data center. Exacerbating such costs, data centers are often over-provisioned to avoid costly outages associated with the potential overloading of electrical circuitry. However, such overprovisioning is often unnecessary since a data center rarely operates at its maximum capacity. It is imperative that we maximize the use of the available power budget in order to enhance the efficiency of data centers. On the other hand, introducing power constraints to improve the efficiency of a data center can cause unacceptable violation of performance agreements (i.e., throughput and response time constraints). As such, we present a thorough empirical study of performance under power constraints as well as a runtime system to set appropriate power constraints for meeting strict performance targets. In this paper, we design a runtime system based on a load prediction model and an optimization framework to set the appropriate power constraints to meet specific performance targets. We then present the effects of our runtime system on energy proportionality, average power, performance, and instantaneous power consumption of enterprise applications. Our results shed light on mechanisms to tune the power provisioned for a server under strict performance targets and opportunities to improve energy proportionality and instantaneous power consumption via power limiting.

Lecture Notes in Computer Science, 1993