TY - GEN
T1 - Predictive Modeling of HPC Job Queue Times
T2 - 2025 Practice and Experience in Advanced Research Computing, PEARC 2025
AU - Gaikwad, Bipin
AU - Simakov, Nikolay A.
AU - Furlani, Thomas
AU - White, Joseph Patrick
AU - Patra, Abani
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s)
PY - 2025/7/18
Y1 - 2025/7/18
N2 - This work presents a framework for estimating job wait times in High-Performance Computing (HPC) scheduling queues, leveraging historical job scheduling data and real-time system metrics. Using machine learning techniques, specifically Random Forest and Multi-Layer Perceptron (MLP) models, we demonstrate high accuracy in predicting wait times, achieving 94.2% reliability within a 10-minute error margin. The framework incorporates key features such as requested resources, queue occupancy, and system utilization, with ablation studies revealing the significance of these features. Additionally, the framework offers users wait time estimates for different resource configurations, enabling them to select optimal resources, reduce delays, and accelerate computational workloads. Our approach provides valuable insights for both users and administrators to optimize job scheduling, contributing to more efficient resource management and faster time to scientific results.
AB - This work presents a framework for estimating job wait times in High-Performance Computing (HPC) scheduling queues, leveraging historical job scheduling data and real-time system metrics. Using machine learning techniques, specifically Random Forest and Multi-Layer Perceptron (MLP) models, we demonstrate high accuracy in predicting wait times, achieving 94.2% reliability within a 10-minute error margin. The framework incorporates key features such as requested resources, queue occupancy, and system utilization, with ablation studies revealing the significance of these features. Additionally, the framework offers users wait time estimates for different resource configurations, enabling them to select optimal resources, reduce delays, and accelerate computational workloads. Our approach provides valuable insights for both users and administrators to optimize job scheduling, contributing to more efficient resource management and faster time to scientific results.
UR - https://www.scopus.com/pages/publications/105013079306
U2 - 10.1145/3708035.3736067
DO - 10.1145/3708035.3736067
M3 - Conference contribution
AN - SCOPUS:105013079306
T3 - PEARC 2025 - Practice and Experience in Advanced Research Computing 2025: The Power of Collaboration
BT - PEARC 2025 - Practice and Experience in Advanced Research Computing 2025
PB - Association for Computing Machinery, Inc
Y2 - 20 July 2025 through 24 July 2025
ER -