Skip to main navigation Skip to search Skip to main content

Comprehensive, open-source resource usage measurement and analysis for HPC systems

  • University of Texas at Austin
  • SUNY Buffalo
  • National Institute of Standards and Technology
  • Indiana University Bloomington

Research output: Contribution to journalArticlepeer-review

16 Scopus citations

Abstract

The important role high-performance computing (HPC) resources play in science and engineering research, coupled with its high cost (capital, power and manpower), short life and oversubscription, requires us to optimize its usage - an outcome that is only possible if adequate analytical data are collected and used to drive systems management at different granularities - job, application, user and system. This paper presents a method for comprehensive job, application and system-level resource use measurement, and analysis and its implementation. The steps in the method are system-wide collection of comprehensive resource use and performance statistics at the job and node levels in a uniform format across all resources, mapping and storage of the resultant job-wise data to a relational database, which enables further implementation and transformation of the data to the formats required by specific statistical and analytical algorithms. Analyses can be carried out at different levels of granularity: job, user, application or system-wide. Measurements are based on a new lightweight job-centric measurement tool 'TACC-Stats', which gathers a comprehensive set of resource use metrics on all compute nodes and data logged by the system scheduler. The data mapping and analysis tools are an extension of the XDMoD project. The method is illustrated with analyses of resource use for the Texas Advanced Computing Center's Lonestar4, Ranger and Stampede supercomputers and the HPC cluster at the Center for Computational Research. The illustrations are focused on resource use at the system, job and application levels and reveal many interesting insights into system usage patterns and also anomalous behavior due to failure/misuse. The method can be applied to any system that runs the TACC-Stats measurement tool and a tool to extract job execution environment data from the system scheduler.

Original languageEnglish
Pages (from-to)2191-2209
Number of pages19
JournalConcurrency and Computation: Practice and Experience
Volume26
Issue number13
DOIs
StatePublished - Sep 10 2014

Keywords

  • HPC resource management
  • performance analysis
  • SUPReMM
  • system management
  • TACC-Stats
  • usage analysis
  • XDMoD
  • XSEDE

Fingerprint

Dive into the research topics of 'Comprehensive, open-source resource usage measurement and analysis for HPC systems'. Together they form a unique fingerprint.

Cite this