Skip to main navigation Skip to search Skip to main content

A Stagewise Hyperparameter Scheduler to Improve Generalization

  • University of Virginia
  • University of Michigan, Ann Arbor

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

Stochastic gradient descent (SGD) augmented with various momentum variants (e.g. heavy ball momentum (SHB) and Nesterov's accelerated gradient (NAG)) has been the default optimizer for many learning tasks. Tuning the optimizer's hyperparameters is arguably the most time-consuming part of model training. Many new momentum variants, despite their empirical advantage over classical SHB/NAG, introduce even more hyperparameters to tune. Automating the tedious and error-prone tuning is essential for AutoML. This paper focuses on how to efficiently tune a large class of multistage momentum variants to improve generalization. We use the general formulation of quasi-hyperbolic momentum (QHM) and extend "constant and drop'', the widespread learning rate α scheduler where α is set large initially and then dropped every few epochs, to other hyperparameters (e.g. batch size b, momentum parameter β, instant discount factor ν). Multistage QHM is a unified framework which covers a large family of momentum variants as its special cases (e.g. vanilla SGD/SHB/NAG). Existing works mainly focus on scheduling α's decay, while multistage QHM allows additional varying hyperparameters such as b, β, and ν, and demonstrates better generalization ability than only tuning α. Our tuning strategies have rigorous justifications rather than a blind trial-and-error. We theoretically prove why our tuning strategies could improve generalization. We also show the convergence of multistage QHM for general nonconvex objective functions. Our strategies simplify the tuning process and beat competitive optimizers in test accuracy empirically.

Original languageEnglish
Title of host publicationKDD 2021 - Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages1530-1540
Number of pages11
ISBN (Electronic)9781450383325
DOIs
StatePublished - Aug 14 2021
Event27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2021 - Virtual, Online, Singapore
Duration: Aug 14 2021Aug 18 2021

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

Conference27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2021
Country/TerritorySingapore
CityVirtual, Online
Period08/14/2108/18/21

Keywords

  • deep learning optimization
  • generalization bound
  • hyperparameter tuning
  • multistage qhm
  • sgd momentum

Fingerprint

Dive into the research topics of 'A Stagewise Hyperparameter Scheduler to Improve Generalization'. Together they form a unique fingerprint.

Cite this