Scientific Computing Kernels on the Cell Processor (original) (raw)
References
S. Williams, J. Shalf, L. Oliker, et al., The Potential of the Cell Processor for Scientific Computing, Computing Frontiers, pp. 9–20 (May 2006).
M. Kondo, H. Okawara, H. Nakamura, et al., Scima: A Novel Processor Architecture for High Performance Computing, 4th International Conference on High Performance Computing in the Asia Pacific Region, volume 1, pp. 355–360 (May 2000).
P. Keltcher, S. Richardson, S. Siu, et al., An Equal Area Comparison of Embedded DRAM and SRAM Memory Architectures for a Chip Multiprocessor. Technical report, HP Laboratories (April 2000).
S. Tomar, S. Kim, N. Vijaykrishnan, et al., Use of Local Memory for Efficient Java Execution, Proceedings of the International Conference on Computer Design, pp. 468–473 (September 2001).
M. Kandemir, J. Ramanujam, M. Irwin, et al., Dynamic Management of Scratch-pad Memory Space, Proceedings of the Design Automation Conference, pp. 690–695 (June 2001).
P. Francesco, P. Marchal, D. Atienzaothers, et al., An Integrated Hardware/Software Approach for Run-time Scratchpad Management, Proceedings of the 41st Design Automation Conference, pp. 238–243 (June 2004).
Khailany B., Dally W., Rixner S. et al (March–April 2001). Imagine: Media Processing with Streams. IEEE Micro, 21(2):35–46 Article Google Scholar
Oka M., Suzuoki M. (November 1999). Designing and Programming the Emotion Engine. IEEE Micro, 19(6):20–28 Article Google Scholar
Kunimatsu A., Ide N., Sato T. et al. (March 2000). Vector Unit Architecture for Emotion Synthesis. IEEE Micro, 20(2):40–47 Article Google Scholar
Suzuoki M. et al. (November 1999). A Microprocessor with a 128-bit cpu, Ten Floating Point Macs, Four Floating-point Dividers, and an mpeg-2 Decoder. IEEE Solid State Circuits, 34(1):1608–1618 Article Google Scholar
B. Flachs, S. Asano, S.H. Dhong, et al., A Streaming Processor Unit for a Cell Processor, ISSCC Dig. Tech. Papers, pp. 134–135 (February 2005).
D. Pham, S. Asano, M. Bollier, et al., The Design and Implementation of a First-generation Cell Processor, ISSCC Dig. Tech. Papers, pp. 184–185 (February 2005).
S. M. Mueller, C. Jacobi, C. Hwa-Joon, et al., The Vector Floating-point Unit in a Synergistic Processor Element of a Cell Processor, 17th IEEE Annual Symposium on Computer Arithmetic (ISCA), pp. 59–67 (June 2005).
J. A. Kahle, M. N. Day, H. P. Hofstee, et al., Introduction to the Cell Multiprocessor. IBM Journal of R&D, 49(4) (2005).
N. Park, B. Hong, and V. K. Prasanna, Analysis of Memory Hierarchy Performance of Block Data Layout. International Conference on Parallel Processing (ICPP), p. 35 (August 2002).
L. Cannon, A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, p. 228 (1969).
Cell Broadband Engine Architecture and its First Implementation. http://www-128\. ibm.com/developerworks/power/library/pa-cellperf/
Saad Y. (1996). Iterative Methods for Sprarse Linear Systems. PWS, Boston, MA Google Scholar
G. Blelloch, M. Heroux, and M. Zagha, Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors, Technical Report CMU-CS-93-173, CMU (1993).
R. Vuduc, Automatic Performance Tuning of Sparse Matrix Kernels, PhD thesis, University of California at Berkeley (2003).
E.-J. Im, K. Yelick, and R. Vuduc, Sparsity: Optimization Framework for Sparse Matrix Kernels, International Journal of High Performance Computing Applications, pp. 135–158 (2004).
E. F. D’Azevedo, M. R. Fahey, and R. T. Mills, Vectorized Sparse Matrix Multiply for Compressed Row Storage Format, International Conference on Computational Science (ICCS), pp. 99–106 (2005).
Li Z., Song Y. (2004). Automatic Tiling of Iterative Stencil Loops. ACM Transactions on Programming Language Systems, 26(6):975–1028 Article Google Scholar
David Wonnacott, Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations. International Parallel and Distributed Processing Symposium (IPDPS), pp. 171–180 (2000).
G. Jin, J. Mellor-Crummey, and R. Fowlerothers, Increasing Temporal Locality with Skewing and Recursive Blocking, Proc. SC2001 (2001).
S. Kamil, K. Datta, S. Williams, et al., Implicit and Explicit Optimizations for Stencil Computations, ACM Workshop on Memory System Performance and Correctness, pp. 51–60 (October 2005).
S. Kamil, P. Husbands, L. Oliker, et al., Impact of Modern Memory Subsystems on Cache Optimizations for Stencil Computations, ACM Workshop on Memory System Performance, pp. 36–43 (June 2005).
L. Oliker, R. Biswas, J. Borrill, et al., A Performance Evaluation of the Cray X1 for Scientific Applications, Proc. 6th International Meeting on High Performance Computing for Computational Science, pp. 51–65 (2004).
A. Chow, G. Fosum, D, and Brokenshire, A Programming Example: Large FFT on the Cell Broadband Engine, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).
J. Greene and R. Cooper, A Parallel 64k Complex FFT Algorithm for the PIBM/Sony/Toshiba Cell Broadband Engine processor, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).