TY - GEN
T1 - Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects
AU - Pearson, Carl
AU - Dakkak, Abdul
AU - Hashash, Sarah
AU - Li, Cheng
AU - Chung, I. Hsin
AU - Xiong, Jinjun
AU - Hwu, Wen Mei
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/4/4
Y1 - 2019/4/4
N2 - Data-intensive applications such as machine learning and analyt-ics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.
AB - Data-intensive applications such as machine learning and analyt-ics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.
KW - Benchmarking
KW - CUDA
KW - GPU
KW - NUMA
KW - NVLink
KW - POWER
KW - X86
UR - https://www.scopus.com/pages/publications/85064824155
U2 - 10.1145/3297663.3310299
DO - 10.1145/3297663.3310299
M3 - Conference contribution
AN - SCOPUS:85064824155
T3 - ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
SP - 209
EP - 218
BT - ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
PB - Association for Computing Machinery, Inc
T2 - 10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019
Y2 - 7 April 2019 through 11 April 2019
ER -