Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight (original) (raw)

References

  1. Adams, M.F., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., Williams, S.: HPGMG 1.0:A Benchmark for Ranking High Performance Computing Systems. Lawrence Berkeley National Lab, Berkeley (2014)
  2. Aldinucci, M., Danelutto, M., Drocco, M., Kilpatrick, P., Misale, C., Peretti Pezzi, G., Torquati, M.: A parallel pattern for iterative stencil + reduce. J. Supercomput. 74(11), 5690–5705 (2018). https://doi.org/10.1007/s11227-016-1871-z
    Article Google Scholar
  3. Ao, Y., Liu, Y., Yang, C., Liu, F., Zhang, P., Lu, Y., Du, Y.: Performance Evaluation of HPGMG on Tianhe-2: arly Experience, pp. 230–243. Springer, Cham (2015)
  4. Ao, Y., Yang, C., Wang, X., Xue, W., Fu, H., Liu, F., Gan, L., Xu, P., Ma, W.: 26 PFLOPS stencil computations for atmospheric modeling on sunway TaihuLight. In: 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, May 29–June 2, 2017, pp. 535–544 (2017)
  5. Basu, P., Hall, M., Williams, S., Straalen, B.V., Oliker, L., Colella, P.: In: 2015 IEEE International Parallel and Distributed Processing Symposium
  6. Basu, P., Hall, M., Williams, S., Van Straalen, B., Oliker, L.: Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid, pp. 9–16. Association for Computing Machinery, Inc (2014)
  7. Basu, P., Venkat, A., Hall, M., Williams, S., Van Straalen, B., Oliker, L.: Compiler Generation and Autotuning of Communication-Avoiding Operators for Geometric Multigrid. IEEE Computer Society (2013)
  8. Basu, P., Williams, S., Van Straalen, B., Oliker, L., Colella, P., Hall, M.: Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. Parallel Comput. 64(C), 50–64 (2017)
    Article MathSciNet Google Scholar
  9. Cao, W., Xu, C.F., Wang, Z.H., Yao, L., Liu, H.Y.: Cpu/gpu computing for a multi-block structured grid based high-order flow solver on a large heterogeneous system. Clust. Comput. 17(2), 255–270 (2014). https://doi.org/10.1007/s10586-013-0332-1
    Article Google Scholar
  10. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pp. 4:1–4:12. IEEE Press, Piscataway (2008). http://dl.acm.org/citation.cfm?id=1413370.1413375
  11. Datta, K., Williams, S., Volkov, V., Carter, J., Oliker, L., Shalf, J., Yelick, K.: Auto-tuning Stencil Computations on Multicore and Accelerators. CRC Press, Boca Raton (2010)
    Book Google Scholar
  12. Dong, W., Kang, L., Quan, Z., Li, K., Li, K., Hao, Z., Xie, X.H.: Implementing molecular dynamics simulation on sunway TaihuLight system. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 443–450 (2016). https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0070
  13. Dongarra, J.: Confessions of an accidental benchmarker. http://sc13.supercomputing.org/sites/default/files/WorkshopsArchive/pdfs/wp156s1.pdf
  14. Dongarra, J., Heroux, M.A., Luszczek, P.: High-performance conjugate-gradient benchmark: a new metric for ranking high-performance computing systems. Int. J. High Perform. Comput. Appl. 30, 3–10 (2015). https://doi.org/10.1177/1094342015593158
    Article Google Scholar
  15. Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurr. Comput. 15, 803–820 (2003). https://doi.org/10.1002/cpe.728
    Article Google Scholar
  16. Fu, H., He, C., Chen, B., Yin, Z., Zhang, Z., Zhang, W., Zhang, T., Xue, W., Liu, W., Yin, W., Yang, G., Chen, X.: 18.9Pflopss nonlinear earthquake simulation on sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 2:1–2:12. ACM, New York (2017)
  17. Fu, H., Liao, J., Ding, N., Duan, X., Gan, L., Liang, Y., Wang, X., Yang, J., Zheng, Y., Liu, W., Wang, L., Yang, G.: Redesigning CAM-SE for peta-scale climate modeling performance and ultra-high resolution on sunway TaihuLight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 1:1–1:12. ACM, New York (2017)
  18. Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W., Liu, F., Qiao, F., Zhao, W., Yin, X., Hou, C., Zhang, C., Ge, W., Zhang, J., Wang, Y., Zhou, C., Yang, G.: The sunway TaihuLight supercomputer: system and applications. Sci. China Inf. Sci. 59, 072001 (2016). https://doi.org/10.1007/s11432-016-5588-7
    Article Google Scholar
  19. Hagedorn, B., Stoltzfus, L., Steuwer, M., Gorlatch, S., Dubach, C.: High performance stencil code generation with lift. In: CGO. ACM, pp. 100–112 (2018)
  20. Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pp. 311–320. ACM, New York (2012)
  21. https://graph500.org (2017)
  22. Jiang, L., Yang, C., Ao, Y., Ma, W.: Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In: The 46th International Conference on Parallel Processing’ (2017)
  23. Köstler, H., Feichtinger, C., Rüde, U., Aoki, T.: A Geometric Multigrid Solver on Tsubame 2.0, pp. 155–173. Springer Berlin Heidelberg, Berlin, Heidelberg (2014)
  24. Köstler, H., Ritter, D., Feichtinger, C.: A Geometric Multigrid Solver on GPU Clusters, pp. 407–422. Springer Berlin Heidelberg, Berlin, Heidelberg (2013)
  25. Kwack, J., Bauer, G.H.: HPCG and HPGMG Benchmark Tests on Multiple Program, Multiple Data (MPMD) Mode on Blue Waters—A Cray XE6/XK7 Hybrid System. https://cug.org/proceedings/cug2017_proceedings/includes/files/pap118s2-file1.pdf (2017)
  26. Ma, W., Gao, K., Long, G.: Highly optimized code generation for stencil codes with computation reuse for GPUs. J. Comput. Sci. Technol. 31(6), 1262–1274 (2016)
    Article MathSciNet Google Scholar
  27. Maruyama, N., Aoki, T.: Optimizing Stencil Computations for nvidia kepler gpus (2014)
  28. Meuer, H., Strohmaier, E., Dongarra, J., Simon, H., Martin, M.: Top 500 Supercomputer Lists (2016). http://www.top500.org
  29. Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2010)
  30. Qiao, F., Zhao, W., Yin, X., Huang, X., Liu, X., Shu, Q., Wang, G., Song, Z., Li, X., Liu, H., Yang, G., Yuan, Y.: A highly effective global surface wave numerical simulation with ultra-high resolution. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pp. 5:1–5:11. IEEE Press, Piscataway (2016). http://dl.acm.org/citation.cfm?id=3014904.3014911
  31. Sakharnykh, N.: https://github.com/e-ago/hpgmg-cuda-async (2016)
  32. Sakharnykh, N.: Beyond GPU Memory Limits with Unified Memory on Pascal. https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/ (2016)
  33. Stock, K., Kong, M., Grosser, T., Pouchet, L.N., Rastello, F., Ramanujam, J., Sadayappan, P.: A framework for enhancing data reuse via associative reordering. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pp. 65–76. ACM, New York (2014)
  34. Tan, G., Li, L., Triechle, S., Phillips, E., Bao, Y., Sun, N.: Fast implementation of DGEMM on fermi GPU. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 35. ACM (2011)
  35. Williams, S.: Hpgmg. https://crd.lbl.gov/assets/pubs_presos/HPGMG-FV-FF2-Proxy-App.pdf
  36. Williams, S., Kalamkar, D.D., Singh, A., Deshpande, A.M., Straalen, B.V., Smelyanskiy, M., Almgren, A., Dubey, P., Shalf, J., Oliker, L.: Optimization of geometric multigrid for emerging multi- and manycore processors. In: High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pp. 1–11 (2012)
  37. Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The potential of the cell processor for scientific computing. In: Proceedings of the 3rd Conference on Computing Frontiers, CF ’06, pp. 9–20. ACM, New York (2006)
  38. Yang, C., Xue, W., Fu, H., You, H., Wang, X., Ao, Y., Liu, F., Gan, L., Xu, P., Wang, L., Yang, G., Zheng, W.: 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pp. 6:1–6:12. IEEE Press, Piscataway (2016)
  39. Zhang, J., Zhou, C., Wang, Y., Ju, L., Du, Q., Chi, X., Xu, D., Chen, D., Liu, Y., Liu, Z.: Extreme-scale phase field simulations of coarsening dynamics on the sunway TaihuLight supercomputer. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 34–45 (2016)
  40. Zhang, Y., Mueller, F.: Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2012, San Jose, March 31– April 04, 2012, pp. 155–164 (2012)

Download references