TY - GEN
T1 - Continuous HPC Performance Monitoring
T2 - 2025 Practice and Experience in Advanced Research Computing, PEARC 2025
AU - Joseph, Jyothismaria
AU - Simakov, Nikolay A.
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s)
PY - 2025/7/18
Y1 - 2025/7/18
N2 - High-performance computing (HPC) resources are used in a wide range of scientific and engineering calculations. These resources have high initial and running costs. Thus, their optimal performance is crucial. There are a number of strategies to ensure the optimal state. One of them is continuous performance monitoring, where a set of applications and input parameters are executed regularly to identify performance issues proactively. Some sites hesitate to use such a strategy as it takes away the CPU cycles from actual users. The goal of this work is to identify node availability, both size- and time-wise, on busy HPC systems. Such availability spots can be used to tailor test jobs to minimize user impact. Two systems were analyzed: small - 118 nodes from the Center for Computational Research at the University at Buffalo and large - 1,160 nodes from the Texas Advanced Computing Center. It was found that for days with 90% utilization and above, there are plenty of opportunities for test jobs. For example, on a small cluster, 8 nodes for 30 minutes are available for an average of 2.3 hours throughout the day. That is 9.6% of the day the scheduler has the opportunity to schedule such a job. On a large system, 32 nodes for 30 minutes were available on average 9.2 hours a day (or 38% of day). Thus, there is a space for test jobs, but it is not evident that the scheduler can benefit from it, and a proper strategy must be used, for example, by lowering test job priorities.
AB - High-performance computing (HPC) resources are used in a wide range of scientific and engineering calculations. These resources have high initial and running costs. Thus, their optimal performance is crucial. There are a number of strategies to ensure the optimal state. One of them is continuous performance monitoring, where a set of applications and input parameters are executed regularly to identify performance issues proactively. Some sites hesitate to use such a strategy as it takes away the CPU cycles from actual users. The goal of this work is to identify node availability, both size- and time-wise, on busy HPC systems. Such availability spots can be used to tailor test jobs to minimize user impact. Two systems were analyzed: small - 118 nodes from the Center for Computational Research at the University at Buffalo and large - 1,160 nodes from the Texas Advanced Computing Center. It was found that for days with 90% utilization and above, there are plenty of opportunities for test jobs. For example, on a small cluster, 8 nodes for 30 minutes are available for an average of 2.3 hours throughout the day. That is 9.6% of the day the scheduler has the opportunity to schedule such a job. On a large system, 32 nodes for 30 minutes were available on average 9.2 hours a day (or 38% of day). Thus, there is a space for test jobs, but it is not evident that the scheduler can benefit from it, and a proper strategy must be used, for example, by lowering test job priorities.
KW - HPC
KW - Scheduling
KW - Simulation
KW - Slurm
KW - Workload
UR - https://www.scopus.com/pages/publications/105013080679
U2 - 10.1145/3708035.3736036
DO - 10.1145/3708035.3736036
M3 - Conference contribution
AN - SCOPUS:105013080679
T3 - PEARC 2025 - Practice and Experience in Advanced Research Computing 2025: The Power of Collaboration
BT - PEARC 2025 - Practice and Experience in Advanced Research Computing 2025
PB - Association for Computing Machinery, Inc
Y2 - 20 July 2025 through 24 July 2025
ER -