Peng Wu - Academia.edu (original) (raw)
Papers by Peng Wu
Communications of The ACM, 2001
When Java (TM) was first introduced, there was a perception (properly founded at the time) that i... more When Java (TM) was first introduced, there was a perception (properly founded at the time) that its many benefits came at a significant performance cost. In few areas were the performance deficiencies of Java so blatant as in numerical computing. Our own measurements, with second-generation Java virtual machines, showed differences in performance of up to one hundred-fold relative to C or Fortran. The initial experiences with such poor performance caused many developers of high performance numerical applications to reject Java out-of-hand as a platform for their applications.
One glaring weakness of Java for numerical programming is its lack of support for complex numbers... more One glaring weakness of Java for numerical programming is its lack of support for complex numbers. Simply creating a Complex number class leads to poor performance relative to Fortran. We show in this paper, however, that the combination of stich a Complex class and a compiler that understands its semantics does indeed lead to Fortran-like performance. This performance gain is achieved while leaving the Java language completely unchanged and maintaining full compatibility with existing Java Virtual Machines. We quantify the effectiveness of our approach through experiments with linear algebra, electromagnetics, and computational fluid-dynamics kernels.
Ibm Journal of Research and Development, 2005
We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extensi... more We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extension of the IBM PowerPC® 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it. This extended FPU is targeted at both the IBM massively parallel Blue Gene®/L machine and the more pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the performance bene.ts of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory-bound kernels, such as DAXPY, while remaining largely insensitive to data alignment.
Communications of The ACM, 2008
practice trAnsActionAL MEMorY (TM) 13 is a concurrency control paradigm that provides atomic and ... more practice trAnsActionAL MEMorY (TM) 13 is a concurrency control paradigm that provides atomic and isolated execution for regions of code. TM is considered by many researchers to be one of the most promising solutions to address the problem of programming multicore processors. Its most appealing feature is that most programmers only need to reason locally about shared data accesses, mark the code region to be executed transactionally, and let the underlying system ensure the correct concurrent execution. This model promises to provide the scalability of finegrained locking while avoiding common pitfalls of lock composition such as deadlock. In this article, we explore the performance of a highly optimized STM and observe the overall performance of TM is much worse at low levels of parallelism, which is likely to limit the adoption of this programming paradigm. Different implementations of transactional memory systems make tradeoffs that impact both performance and programmability. Larus and Rajwar 16 present an overview of design trade-offs for implementations of transactional memory systems. We summarize some of the design choices here:
In 1994, the first multimedia extension, MAX-1, was introduced to general-purpose processors by H... more In 1994, the first multimedia extension, MAX-1, was introduced to general-purpose processors by HP. Almost ten years have passed, the present means of accessing the computing power of multimedia extensions are still limited to mostly assembly programming, intrinsic functions, and the use of system libraries. Because of the similarities between multimedia extensions and vector processors, it is believed that traditional vectorization can be used to compile for multimedia extensions. Can traditional vectorization effectively vectorize multimedia applications for multimedia extensions? If not, what additional techniques are needed? To answer these two questions, we conducted a code study on the Berkeley Multimedia Workload. Through this, we identified several new challenges arise in vectorizing for multimedia extensions and proposed some solutions to these challenges.
The widespread presence of SIMD devices in today's microprocessors has made compiler techniques f... more The widespread presence of SIMD devices in today's microprocessors has made compiler techniques for these devices tremendously important. One of the most important and difficult issues that must be addressed by these techniques is the generation of the data permutation instructions needed for non-contiguous and misaligned memory references. These instructions are expensive and, therefore, it is of crucial importance to minimize their number to improve performance and, in many cases, enable speedups over scalar code.
Sigplan Notices, 2003
Empirical program optimizers estimate the values of key optimization parameters by generating dif... more Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose these parameters. It is widely believed that model-driven optimization does not compete with empirical optimization, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the empirical optimization engine in ATLAS (a system for generating a dense numerical linear algebra library called the BLAS) with a model-driven optimization engine that used detailed models to estimate values for optimization parameters, and then measured the relative performance of the two systems on three different hardware platforms. Our experiments show that model-driven optimization can be surprisingly effective, and can generate code whose performance is comparable to that of code generated by empirical optimizers for the BLAS.
Chemical Communications, 2005
Dendrimers O 0450 Multivalent, Bifunctional Dendrimers Prepared by Click Chemistry -[containing m... more Dendrimers O 0450 Multivalent, Bifunctional Dendrimers Prepared by Click Chemistry -[containing mannose binding units and coumarin fluorescent units]. -(WU, P.; MALKOCH, M.; HUNT, J. N.; VESTBERG, R.; KALTGRAD, E.; FINN*, M. G.; FOKIN, V. V.; SHARPLESS*, K. B.; HAWKER, C. J.; Chem. Commun. (Cambridge) 2005, 46, 5775-5777; Dep. Chem., Scripps Res. Inst.,
Macromolecules, 2005
The high fidelity and efficiency of Click chemistry are exploited in the synthesis of a library o... more The high fidelity and efficiency of Click chemistry are exploited in the synthesis of a library of chain end functionalized dendritic macromolecules. In this example, the selectivity of the Cu-catalyzed [3 + 2π] cycloaddition reaction of azides with terminal acetylenes, coupled with mild reaction conditions, permits unprecedented functional group tolerance during the derivatization of dendrimeric and hyperbranched scaffolds. The resulting dendritic libraries are structurally diverse, encompassing a variety of backbones/surface functional groups, and are prepared in almost quantitative yields under very mild conditions. The robust and simple nature of this procedure, combined with its applicability to many aspects of polymer synthesis and materials chemistry, demonstrates an evolving synergy between advanced organic chemistry and functional materials.
Angewandte Chemie-international Edition, 2004
... Peng Wu, Alina K. Feldman, Anne K. Nugent, Craig J. Hawker,* Arnulf Scheel, Brigitte Voit, Je... more ... Peng Wu, Alina K. Feldman, Anne K. Nugent, Craig J. Hawker,* Arnulf Scheel, Brigitte Voit, Jeffrey Pyun, Jean MJ FrØchet, K. Barry ... All second-generation dendrons were isolated as pure white solids by simple filtration or aqueous workup in yields exceeding 90% for the ...
richard eckersley, richard angstadt, charles m. ellerston, richard hendel, naomi b. pascal, and a... more richard eckersley, richard angstadt, charles m. ellerston, richard hendel, naomi b. pascal, and anita walker scott Writing Ethnographic Fieldnotes robert m. emerson, rachel i. fretz, and linda l. shaw A True Story As we were preparing the second edition, Booth got a call from a former student who, as had all of his students, been directed again and again by Booth to revise his work. Now a professional in his midforties, he called to tell Booth about a dream he had had the night before: "You were standing before Saint Peter at the Pearly Gates," he said, "hoping for admission. He looked at you, hesitant and dubious, then fi nally said, 'Sorry, Booth, we need another draft.'"
Communications of The ACM, 2001
When Java (TM) was first introduced, there was a perception (properly founded at the time) that i... more When Java (TM) was first introduced, there was a perception (properly founded at the time) that its many benefits came at a significant performance cost. In few areas were the performance deficiencies of Java so blatant as in numerical computing. Our own measurements, with second-generation Java virtual machines, showed differences in performance of up to one hundred-fold relative to C or Fortran. The initial experiences with such poor performance caused many developers of high performance numerical applications to reject Java out-of-hand as a platform for their applications.
One glaring weakness of Java for numerical programming is its lack of support for complex numbers... more One glaring weakness of Java for numerical programming is its lack of support for complex numbers. Simply creating a Complex number class leads to poor performance relative to Fortran. We show in this paper, however, that the combination of stich a Complex class and a compiler that understands its semantics does indeed lead to Fortran-like performance. This performance gain is achieved while leaving the Java language completely unchanged and maintaining full compatibility with existing Java Virtual Machines. We quantify the effectiveness of our approach through experiments with linear algebra, electromagnetics, and computational fluid-dynamics kernels.
Ibm Journal of Research and Development, 2005
We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extensi... more We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extension of the IBM PowerPC® 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it. This extended FPU is targeted at both the IBM massively parallel Blue Gene®/L machine and the more pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the performance bene.ts of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory-bound kernels, such as DAXPY, while remaining largely insensitive to data alignment.
Communications of The ACM, 2008
practice trAnsActionAL MEMorY (TM) 13 is a concurrency control paradigm that provides atomic and ... more practice trAnsActionAL MEMorY (TM) 13 is a concurrency control paradigm that provides atomic and isolated execution for regions of code. TM is considered by many researchers to be one of the most promising solutions to address the problem of programming multicore processors. Its most appealing feature is that most programmers only need to reason locally about shared data accesses, mark the code region to be executed transactionally, and let the underlying system ensure the correct concurrent execution. This model promises to provide the scalability of finegrained locking while avoiding common pitfalls of lock composition such as deadlock. In this article, we explore the performance of a highly optimized STM and observe the overall performance of TM is much worse at low levels of parallelism, which is likely to limit the adoption of this programming paradigm. Different implementations of transactional memory systems make tradeoffs that impact both performance and programmability. Larus and Rajwar 16 present an overview of design trade-offs for implementations of transactional memory systems. We summarize some of the design choices here:
In 1994, the first multimedia extension, MAX-1, was introduced to general-purpose processors by H... more In 1994, the first multimedia extension, MAX-1, was introduced to general-purpose processors by HP. Almost ten years have passed, the present means of accessing the computing power of multimedia extensions are still limited to mostly assembly programming, intrinsic functions, and the use of system libraries. Because of the similarities between multimedia extensions and vector processors, it is believed that traditional vectorization can be used to compile for multimedia extensions. Can traditional vectorization effectively vectorize multimedia applications for multimedia extensions? If not, what additional techniques are needed? To answer these two questions, we conducted a code study on the Berkeley Multimedia Workload. Through this, we identified several new challenges arise in vectorizing for multimedia extensions and proposed some solutions to these challenges.
The widespread presence of SIMD devices in today's microprocessors has made compiler techniques f... more The widespread presence of SIMD devices in today's microprocessors has made compiler techniques for these devices tremendously important. One of the most important and difficult issues that must be addressed by these techniques is the generation of the data permutation instructions needed for non-contiguous and misaligned memory references. These instructions are expensive and, therefore, it is of crucial importance to minimize their number to improve performance and, in many cases, enable speedups over scalar code.
Sigplan Notices, 2003
Empirical program optimizers estimate the values of key optimization parameters by generating dif... more Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose these parameters. It is widely believed that model-driven optimization does not compete with empirical optimization, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the empirical optimization engine in ATLAS (a system for generating a dense numerical linear algebra library called the BLAS) with a model-driven optimization engine that used detailed models to estimate values for optimization parameters, and then measured the relative performance of the two systems on three different hardware platforms. Our experiments show that model-driven optimization can be surprisingly effective, and can generate code whose performance is comparable to that of code generated by empirical optimizers for the BLAS.
Chemical Communications, 2005
Dendrimers O 0450 Multivalent, Bifunctional Dendrimers Prepared by Click Chemistry -[containing m... more Dendrimers O 0450 Multivalent, Bifunctional Dendrimers Prepared by Click Chemistry -[containing mannose binding units and coumarin fluorescent units]. -(WU, P.; MALKOCH, M.; HUNT, J. N.; VESTBERG, R.; KALTGRAD, E.; FINN*, M. G.; FOKIN, V. V.; SHARPLESS*, K. B.; HAWKER, C. J.; Chem. Commun. (Cambridge) 2005, 46, 5775-5777; Dep. Chem., Scripps Res. Inst.,
Macromolecules, 2005
The high fidelity and efficiency of Click chemistry are exploited in the synthesis of a library o... more The high fidelity and efficiency of Click chemistry are exploited in the synthesis of a library of chain end functionalized dendritic macromolecules. In this example, the selectivity of the Cu-catalyzed [3 + 2π] cycloaddition reaction of azides with terminal acetylenes, coupled with mild reaction conditions, permits unprecedented functional group tolerance during the derivatization of dendrimeric and hyperbranched scaffolds. The resulting dendritic libraries are structurally diverse, encompassing a variety of backbones/surface functional groups, and are prepared in almost quantitative yields under very mild conditions. The robust and simple nature of this procedure, combined with its applicability to many aspects of polymer synthesis and materials chemistry, demonstrates an evolving synergy between advanced organic chemistry and functional materials.
Angewandte Chemie-international Edition, 2004
... Peng Wu, Alina K. Feldman, Anne K. Nugent, Craig J. Hawker,* Arnulf Scheel, Brigitte Voit, Je... more ... Peng Wu, Alina K. Feldman, Anne K. Nugent, Craig J. Hawker,* Arnulf Scheel, Brigitte Voit, Jeffrey Pyun, Jean MJ FrØchet, K. Barry ... All second-generation dendrons were isolated as pure white solids by simple filtration or aqueous workup in yields exceeding 90% for the ...
richard eckersley, richard angstadt, charles m. ellerston, richard hendel, naomi b. pascal, and a... more richard eckersley, richard angstadt, charles m. ellerston, richard hendel, naomi b. pascal, and anita walker scott Writing Ethnographic Fieldnotes robert m. emerson, rachel i. fretz, and linda l. shaw A True Story As we were preparing the second edition, Booth got a call from a former student who, as had all of his students, been directed again and again by Booth to revise his work. Now a professional in his midforties, he called to tell Booth about a dream he had had the night before: "You were standing before Saint Peter at the Pearly Gates," he said, "hoping for admission. He looked at you, hesitant and dubious, then fi nally said, 'Sorry, Booth, we need another draft.'"