Skip to main navigation Skip to search Skip to main content

Continuous HPC Performance Monitoring: Can It Run Without Affecting User Jobs?

  • SUNY Buffalo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

High-performance computing (HPC) resources are used in a wide range of scientific and engineering calculations. These resources have high initial and running costs. Thus, their optimal performance is crucial. There are a number of strategies to ensure the optimal state. One of them is continuous performance monitoring, where a set of applications and input parameters are executed regularly to identify performance issues proactively. Some sites hesitate to use such a strategy as it takes away the CPU cycles from actual users. The goal of this work is to identify node availability, both size- and time-wise, on busy HPC systems. Such availability spots can be used to tailor test jobs to minimize user impact. Two systems were analyzed: small - 118 nodes from the Center for Computational Research at the University at Buffalo and large - 1,160 nodes from the Texas Advanced Computing Center. It was found that for days with 90% utilization and above, there are plenty of opportunities for test jobs. For example, on a small cluster, 8 nodes for 30 minutes are available for an average of 2.3 hours throughout the day. That is 9.6% of the day the scheduler has the opportunity to schedule such a job. On a large system, 32 nodes for 30 minutes were available on average 9.2 hours a day (or 38% of day). Thus, there is a space for test jobs, but it is not evident that the scheduler can benefit from it, and a proper strategy must be used, for example, by lowering test job priorities.

Original languageEnglish
Title of host publicationPEARC 2025 - Practice and Experience in Advanced Research Computing 2025
Subtitle of host publicationThe Power of Collaboration
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9798400713989
DOIs
StatePublished - Jul 18 2025
Event2025 Practice and Experience in Advanced Research Computing, PEARC 2025 - Columbus, United States
Duration: Jul 20 2025Jul 24 2025

Publication series

NamePEARC 2025 - Practice and Experience in Advanced Research Computing 2025: The Power of Collaboration

Conference

Conference2025 Practice and Experience in Advanced Research Computing, PEARC 2025
Country/TerritoryUnited States
CityColumbus
Period07/20/2507/24/25

Keywords

  • HPC
  • Scheduling
  • Simulation
  • Slurm
  • Workload

Fingerprint

Dive into the research topics of 'Continuous HPC Performance Monitoring: Can It Run Without Affecting User Jobs?'. Together they form a unique fingerprint.

Cite this