An optimized RDMA QP communication mechanism for hyperscale AI infrastructure (original) (raw)
References
Chen, G., Lu, Y., Li, B., et al.: MP-RDMA: enabling RDMA with multi-path transport in datacenters. IEEE/ACM Trans. Netw. 27(6), 2308–2323 (2019) ArticleMATH Google Scholar
Choi, M., Lee, S., Kim, Y.: UD-assisted multi-path transport in RDMA. In: 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), 2022, pp. 127–129. IEEE (2022a)
Choi, M., Lee, S., Kim, Y.: UD-assisted multi-path transport in RDMA. In: 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), 2022, pp. 127–129 (2022b). https://doi.org/10.1109/ICTC55196.2022.9952631
Fent, P., van Renen. A., Kipf, A., et al.: Low-latency communication for fast DBMS using RDMA and shared memory. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), 2020, pp. 1477–1488. IEEE (2020)
Guo, Z., Liu, S., Zhang, Z.L.: Traffic control for RDMA-enabled data center networks: a survey. IEEE Syst. J. 14(1), 677–688 (2019) ArticleMATH Google Scholar
He, Z., Chen, Y., Hua, B.: RoUD: scalable RDMA over UD in lossy data center networks. In: 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 2023, pp. 36–46. IEEE (2023)
Hu, J., Zeng, C., Wang, Z., et al.: Load balancing in PFC-enabled datacenter networks. In: Proceedings of the 6th Asia–Pacific Workshop on Networking, 2022, pp. 21–28 (2022)
Jia, C., Liu, J., Jin, X., et al.: Improving the performance of distributed tensorflow with RDMA. Int. J. Parallel Program. 46, 674–685 (2018) ArticleMATH Google Scholar
Kang, N., Wang, Z., Yang, F., et al.: csRNA: connection-scalable RDMA NIC architecture in datacenter environment. In: 2022 IEEE 40th International Conference on Computer Design (ICCD), 2022, pp. 398–406. IEEE (2022)
Lee, S., Kim, Y., Woo, H., et al.: Efficient user-level multi-path utilization in RDMA networks. IEEE Access 9, 127619–127629 (2021) ArticleMATH Google Scholar
Lu, Y., Chen, G., Li, B., et al.: Multi-path transport for RDMA in datacenters. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), 2018, pp. 357–371 (2018)
Ma, S., Ma, T., Chen, K., et al.: A survey of storage systems in the RDMA era. IEEE Trans. Parallel Distrib. Syst. 33(12), 4395–4409 (2022) ArticleMATH Google Scholar
Ma, T., Chen, K., Ma, S., et al.: Thinking more about RDMA memory semantics. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER), 2021, pp. 456–467. IEEE (2021)
Park, J., Son, Y., Yeom, H.Y., et al.: SoftDC: software-based dynamically connected transport. Clust. Comput. 23, 347–357 (2020) Article Google Scholar
Pathak, A.R., Pandey, M., Rautaray, S.S.: Approaches of enhancing interoperations among high performance computing and big data analytics via augmentation. Clust. Comput. J. Netw. Softw. Tools Appl. 23(2), 953–988 (2020). https://doi.org/10.1007/s10586-019-02960-y ArticleMATH Google Scholar
Shen, D., Luo, J., Dong, F., et al.: Enabling distributed and optimal RDMA resource sharing in large-scale data center networks: modeling, analysis, and implementation. IEEE/ACM Trans. Netw. 31(6), 2745–2760 (2023) ArticleMATH Google Scholar
Shu, J., Chen, Y., Wang, Q., et al.: TH-DPMS: design and implementation of an RDMA-enabled distributed persistent memory storage system. ACM Trans. Storage 16(4), 1–31 (2020) ArticleMATH Google Scholar
Sur, S., Jin, H.W., Chai, L., et al.: RDMA read based rendezvous protocol for MPI over Infiniband: design alternatives and benefits. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2006, pp. 32–39 (2006)
Tang, J., Wang, X., Dai, H.: Scalable RDMA transport with efficient connection sharing. In: IEEE INFOCOM 2023—IEEE Conference on Computer Communications, 2023, pp 1–10. IEEE (2023)
Taranov, K., Di Girolamo, S., Hoefler, T.: CoRM: compactable remote memory over RDMA. In: Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 1811–1824 (2021)
Wang, J., Lin, B.: RDMA reliability evaluation model for large-scale data center networks. In: 2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI), 2023, pp. 342–347. IEEE (2023)
Wang, X., Chen, G., Yin, X., et al.: StaR: Breaking the scalability limit for RDMA. In: 2021 IEEE 29th International Conference on Network Protocols (ICNP), 2021, pp. 1–11. IEEE (2021)
Wang, Z., Luo, L., Ning, Q., et al.: SRNIC: a scalable architecture for RDMA NICs. In: 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 1–14 (2023)