Skip to main navigation Skip to search Skip to main content

Slurm simulator: Improving Slurm scheduler performance on large HPC systems by utilization of multiple controllers and node sharing

  • SUNY Buffalo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

A Slurm simulator was used to study the potential benefits of using multiple Slurm controllers and node-sharing on the TACC Stampede 2 system. Splitting a large cluster into smaller sub-clusters with separate Slurm controllers can offer better scheduling performance and better responsiveness due to an increased computational capability which increases the backfill scheduler efficiency. The disadvantage is additional hardware, more maintenance and an incapability to run jobs across the sub-clusters. Node sharing can increase system throughput by allowing several sub-node jobs to be executed on the same node. However, node sharing is more computationally demanding and might not be advantageous on larger systems. The Slurm simulator allows an estimation of the potential benefits from these configurations and provides information on the advantages to be expected from such a configuration deployment. In this work, multiple Slurm controllers and node-sharing were tested on a TACC Stampede 2 system consisting of two distinct node types: 4,200 Intel Xeon Phi Knights Landing (KNL) nodes and 1,736 Intel Xeon Skylake-X (SLX) nodes. For this system utilization of separate controllers for KNL and SLX nodes with node sharing allowed on SLX nodes resulted in a 40% reduction in waiting times for jobs executed on the SLX nodes. This improvement can be attributed to the better performance of the backfill scheduler. It scheduled 30% more SLX jobs, has a 30% reduction in the fraction of cycles that hit the time-limit and nearly doubles the jobs scheduling attempts.

Original languageEnglish
Title of host publicationPractice and Experience in Advanced Research Computing 2018
Subtitle of host publicationSeamless Creativity, PEARC 2018
PublisherAssociation for Computing Machinery
ISBN (Print)9781450364461
DOIs
StatePublished - Jul 22 2018
Event2018 Practice and Experience in Advanced Research Computing Conference: Seamless Creativity, PEARC 2018 - Pittsburgh, United States
Duration: Jul 22 2017Jul 26 2017

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2018 Practice and Experience in Advanced Research Computing Conference: Seamless Creativity, PEARC 2018
Country/TerritoryUnited States
CityPittsburgh
Period07/22/1707/26/17

Keywords

  • HPC
  • Scheduling
  • Simulation
  • Workload

Fingerprint

Dive into the research topics of 'Slurm simulator: Improving Slurm scheduler performance on large HPC systems by utilization of multiple controllers and node sharing'. Together they form a unique fingerprint.

Cite this