DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures (original) (raw)

References

Parkhurst, J., Darringer, J., Grundmann, B.: From single core to multi-core: preparing for a new exponential. In: 2006 IEEE/ACM International Conference on Computer Aided Design, pp. 67–72 (2006). https://doi.org/10.1109/ICCAD.2006.320067
Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995). https://doi.org/10.1145/216585.216588
Article Google Scholar
Patterson, D.A., Anderson, T.E., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C.E., Thomas, R., Yelick, K.A.: A case for intelligent RAM. IEEE Micro 17(2), 34–44 (1997). https://doi.org/10.1109/40.592312
Article Google Scholar
Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., Snelting, G.: Invasive computing: an overview. In: Multiprocessor System-on-Chip, pp. 241–268 (2011). https://doi.org/10.1007/978-1-4419-6460-1_11
Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao III, C., Brown, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro 27(5), 15–31 (2007). https://doi.org/10.1109/MM.2007.89
Bell, S., Edwards, B., Amann, J., Conlin, R., Joyce, K., Leung, V., MacKay, J., Reif, M., Bao, L., III JFB, Mattina, M., Miao, C., Ramey, C., Wentzlaff, D., Anderson, W., Berger, E., Fairbanks, N., Khan, D., Montenegro, F., Stickney, J., Zook, J.: TILE64 - processor: a 64-Core SoC with mesh interconnect. In: 2008 IEEE International Solid-State Circuits Conference, ISSCC 2008, Digest of Technical Papers, San Francisco, CA, USA, February 3–7, 2008, IEEE, San Francisco, CA, pp 88–89 (2008). https://doi.org/10.1109/ISSCC.2008.4523070
Lotfi-Kamran, P., Grot, B., Ferdman, M., Volos, S., Kocberber, O., Picorel, J., Adileh, A., Jevdjic, D., Idgunji, S., Ozer, E., Falsafi, B.: Scale-out Processors. In: Proceedings of the 39th Annual International Symposium on Computer Architecture, IEEE Computer Society, USA, ISCA ’12, pp. 500–511 (2012)
Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., Jenkins, D., Wilson, H., Borkar, N., Schrom, G., Pailet, F., Jain, S., Jacob, T., Yada, S., Marella, S., Salihundam, P., Erraguntla, V., Konow, M., Riepen, M., Droege, G., Lindemann, J., Gries, M., Apel, T., Henriss, K., Lund-Larsen, T., Steibl, S., Borkar, S., De, V., Wijngaart, R.V.D., Mattson, T.: A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: 2010 IEEE International Solid-State Circuits Conference—(ISSCC), pp. 108–109 (2010). https://doi.org/10.1109/ISSCC.2010.5434077
Mittal, S.: A survey on evaluating and optimizing performance of Intel Xeon Phi. Practice and Experience, Concurrency and Computation (2020)
Siegl, P., Buchty, R., Berekovic, M.: Data-centric computing frontiers: a survey on processing-in-memory. In: Jacob, B. (ed) Proceedings of the Second International Symposium on Memory Systems, MEMSYS 2016, Alexandria, VA, USA, 2016, ACM, pp. 295–308 (2016). https://doi.org/10.1145/2989081.2989087
Kogge, P.: Memory Intensive Computing, the 3rd Wall, and the Need for Innovation in Architecture. (2017) https://memsys.io/wp-content/uploads/2017/12/The_Wall.pdf
Oechslein, B., Schedel, J., Kleinöder, J., Bauer, L., Henkel, J., Lohmann, D., Schröder-Preikschat, W.: OctoPOS: a parallel operating system for invasive computing. In: Proceedings of the International Workshop on Systems for Future Multi-Core Architectures. EuroSys, pp. 9–14 (2011)
Kranz, D.A., Johnson, K.L., Agarwal, A., Kubiatowicz, J., Lim, B.: Integrating message-passing and shared-memory: early experience. In: Chen, M.C., Halstead, R. (eds) Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), San Diego, California, USA, 1993, ACM, pp. 54–63, (1993). https://doi.org/10.1145/155332.155338
Moir, M., Shavit, N.: Concurrent data structures. In: Handbook of Data Structures and Applications (2004)
MPI Forum: MPI: A Message Passing Interface Standard Version 3.1 (2015). https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
Corbet, J.: Ringing in a new asynchronous I/O API. (2019) https://lwn.net/Articles/776703/
Michael, M.M., Scott, M.L.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: ACM Symposium on Principles of Distributed Computing, pp. 267–275 (1996). https://doi.org/10.1145/248052.248106
Wang, Y., Wang, R., Herdrich, A., Tsai, J., Solihin, Y.: CAF: core to core communication acceleration framework. In: Conference on Parallel Architectures and Compilation (PACT), pp. 351–362 (2016). https://doi.org/10.1145/2967938.2967954
Lee, S., Tiwari, D., Solihin, Y., Tuck, J.: HAQu: hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In: Conference on High-Performance Computer Architecture (HPCA), pp. 99–110 (2011). https://doi.org/10.1109/HPCA.2011.5749720
Petrovic, D., Ropars, T., Schiper, A.: Leveraging hardware message passing for efficient thread synchronization. TOPC 2(4), 24:1–24:26 (2016). https://doi.org/10.1145/2858652
Article Google Scholar
Sánchez, D., Yoo, R.M., Kozyrakis, C.: Flexible architectural support for fine-grain scheduling. In: ASPLOS Conference Proceedings, pp. 311–322 (2010). https://doi.org/10.1145/1736020.1736055
Lee, J., Nicopoulos, C., Lee, H.G., Panth, S., Lim, S.K., Kim, J.: IsoNet: hardware-based job queue management for many-core architectures. IEEE Trans. VLSI Syst. 21(6), 1080–1093 (2013). https://doi.org/10.1109/TVLSI.2012.2202699
Article Google Scholar
Pujari, R.K., Wild, T., Herkersdorf, A.: TCU: a multi-objective hardware thread mapping unit for HPC clusters. In: High Performance Computing, ISC, pp. 39–58 (2016). https://doi.org/10.1007/978-3-319-41321-1_3
Kumar, S., Hughes, C.J., Nguyen, A.D.: Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In: Symposium on Computer Architecture (ISCA), pp. 162–173 (2007). https://doi.org/10.1145/1250662.1250683
Sharma, R.R., Rajasekhar, Y., Sass, R.: Exploring hardware work queue support for lightweight threads in MPSoCs. In: Conference on Reconfigurable Computing and FPGAs (ReConFig), pp. 1–6 (2012). https://doi.org/10.1109/ReConFig.2012.6416747
Brewer, E.A., Chong, F.T., Liu, L.T., Sharma, S.D., Kubiatowicz, J.: Remote queues: exposing message queues for optimization and atomicity. In: ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 42–53 (1995). https://doi.org/10.1145/215399.215416
Rheindt, S., Schenk, A., Srivatsa, A., Wild, T., Herkersdorf, A.: CaCAO: complex and compositional atomic operations for NoC-based manycore platforms. In: Conference on Architecture of Computing Systems (ARCS), pp 139–152 (2018). https://doi.org/10.1007/978-3-319-77610-1_11
Rheindt, S., Maier, S., Schmaus, F., Wild, T., Schröder-Preikschat, W., Herkersdorf, A.: SHARQ: software-defined hardware-managed queues for tile-based manycore architectures. In: International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIX), Springer, Samos, Greece, pp. 212–225 (2019). https://doi.org/10.1007/978-3-030-27562-4_15
Schmaus, F., Maier, S., Langer, T., Rabenstein, J., Hönig, T., Bauer, L., Henkel, J., Schröder-Preikschat, W.: System software for resource arbitration on future many-* architectures. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp. 967–975 (2020). https://doi.org/10.1109/IPDPSW50202.2020.00160
Moerman, F.: Open event machine: a multi-core run-time designed for performance. In: 2014 6th European Embedded Design in Education and Research Conference (EDERC), pp. 41–45 (2014)
Cataldo, R., Fernandes, R., Martin, K.J.M., Sepulveda, J., Susin, A., Marcon, C., Diguet, J.: Subutai: distributed synchronization primitives in NoC interfaces for legacy parallel-applications. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC) (2018)
Zaib, A., Wild, T., Herkersdorf, A., Heisswolf, J., Becker, J., Weichslgartner, A., Teich, J.: Efficient task spawning for shared memory and message passing in many-core architectures. J. Syst. Archit. - Embed. Syst. Des. 77, 72–82 (2017). https://doi.org/10.1016/j.sysarc.2017.03.004
Article Google Scholar
Heisswolf, J., Zaib. A., Weichslgartner, A., Karle, M., Singh, M., Wild, T., Teich, J., Herkersdorf, A., Becker, J.: The invasive network on chip: a multi-objective many-core communication infrastructure. In: Conference on Architecture of Computing Systems (ARCS), Workshop Proceedings, pp. 1–8 (2014)
HkJ, Chu, et al.: Zero-copy TCP in solaris. USENIX Annu. Tech. Conf. 1, 253–264 (1996)
Google Scholar
Intel Corporation: Intel 82574 GbE Controller Family—Datasheet. www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf, rev. 3.4 (2014)
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)
Google Scholar
Subhlok, J., Venkataramaiah, S., Singh, A.: Characterizing NAS benchmark performance on shared heterogeneous networks. In: Parallel and Distributed Processing Symposium (IPDPS) (2002). https://doi.org/10.1109/IPDPS.2002.1015659
Maier, S., Hönig, T., Wägemann, P., Schröder-Preikschat, W.: Asynchronous abstract machines: anti-noise system software for many-core processors. In: Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), ACM, pp. 19–26 (2019). https://doi.org/10.1145/3322789.3328744

Download references