TY - GEN
T1 - Cloud Services Enable Efficient AI-Guided Simulation Workflows across Heterogeneous Resources
AU - Ward, Logan
AU - Pauloski, J. Gregory
AU - Hayot-Sasson, Valerie
AU - Chard, Ryan
AU - Babuji, Yadu
AU - Sivaraman, Ganesh
AU - Choudhury, Sutanay
AU - Chard, Kyle
AU - Thakur, Rajeev
AU - Foster, Ian
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Applications that fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and inference tasks on specialized accelerators. Here, we present our experiences deploying two AI-guided simulation workflows across such heterogeneous systems. A unique aspect of our approach is our use of cloud-hosted management services to manage challenging aspects of cross-resource authentication and authorization, function-as-a-service (FaaS) function invocation, and data transfer. We show that these methods can achieve performance parity with systems that rely on direct connection between resources. We achieve parity by integrating the FaaS system and data transfer capabilities with a system that passes data by reference among managers and workers, and a user-configurable steering algorithm to hide data transfer latencies. We anticipate that this ease of use can enable routine use of heterogeneous resources in computational science.
AB - Applications that fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and inference tasks on specialized accelerators. Here, we present our experiences deploying two AI-guided simulation workflows across such heterogeneous systems. A unique aspect of our approach is our use of cloud-hosted management services to manage challenging aspects of cross-resource authentication and authorization, function-as-a-service (FaaS) function invocation, and data transfer. We show that these methods can achieve performance parity with systems that rely on direct connection between resources. We achieve parity by integrating the FaaS system and data transfer capabilities with a system that passes data by reference among managers and workers, and a user-configurable steering algorithm to hide data transfer latencies. We anticipate that this ease of use can enable routine use of heterogeneous resources in computational science.
KW - Computational Steering
KW - Distributed Systems
KW - Function-as-a-Service
KW - Heterogeneous Computing
KW - Machine Learning
UR - https://www.scopus.com/pages/publications/85169299864
U2 - 10.1109/IPDPSW59300.2023.00018
DO - 10.1109/IPDPSW59300.2023.00018
M3 - Conference contribution
AN - SCOPUS:85169299864
T3 - 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023
SP - 32
EP - 41
BT - 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023
Y2 - 15 May 2023 through 19 May 2023
ER -