TY - GEN
T1 - REVISITING LARGE-SCALE NON-CONVEX DISTRIBUTIONALLY ROBUST OPTIMIZATION
AU - Zhang, Qi
AU - Zhou, Yi
AU - Khan, Simon
AU - Prater-Bennette, Ashley
AU - Shen, Lixin
AU - Zou, Shaofeng
N1 - Publisher Copyright:
© 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Distributionally robust optimization (DRO) is a powerful technique to train robust machine learning models that perform well under distribution shifts. Compared with empirical risk minimization (ERM), DRO optimizes the expected loss under the worst-case distribution in an uncertainty set of distributions. This paper revisits the important problem of DRO with non-convex smooth loss functions. For this problem, Jin et al. (2021) showed that its dual problem is generalized (L0, L1)smooth condition and gradient noise satisfies the affine variance condition, designed an algorithm of mini-batch normalized gradient descent with momentum, and proved its convergence and complexity. In this paper, we show that the dual problem and the gradient noise satisfy simpler yet more precise partially generalized smoothness condition and partially affine variance condition by studying the optimization variable and dual variable separately, which further yields much simpler convergence analysis. We develop a double stochastic gradient descent with clipping (D-SGD-C) algorithm that converges to an ϵ-stationary point with O(ϵ−4) gradient complexity, which matches with results in Jin et al. (2021). Our proof is much simpler, thanks to the more precise characterization of partially generalized smoothness and partially affine variance noise. We further design a variance-reduced method that achieves a lower gradient complexity of O(ϵ−3). Our theoretical results and insights are further verified numerically on a number of tasks, and our algorithms outperform the existing DRO method (Jin et al., 2021).
AB - Distributionally robust optimization (DRO) is a powerful technique to train robust machine learning models that perform well under distribution shifts. Compared with empirical risk minimization (ERM), DRO optimizes the expected loss under the worst-case distribution in an uncertainty set of distributions. This paper revisits the important problem of DRO with non-convex smooth loss functions. For this problem, Jin et al. (2021) showed that its dual problem is generalized (L0, L1)smooth condition and gradient noise satisfies the affine variance condition, designed an algorithm of mini-batch normalized gradient descent with momentum, and proved its convergence and complexity. In this paper, we show that the dual problem and the gradient noise satisfy simpler yet more precise partially generalized smoothness condition and partially affine variance condition by studying the optimization variable and dual variable separately, which further yields much simpler convergence analysis. We develop a double stochastic gradient descent with clipping (D-SGD-C) algorithm that converges to an ϵ-stationary point with O(ϵ−4) gradient complexity, which matches with results in Jin et al. (2021). Our proof is much simpler, thanks to the more precise characterization of partially generalized smoothness and partially affine variance noise. We further design a variance-reduced method that achieves a lower gradient complexity of O(ϵ−3). Our theoretical results and insights are further verified numerically on a number of tasks, and our algorithms outperform the existing DRO method (Jin et al., 2021).
UR - https://www.scopus.com/pages/publications/105010221501
M3 - Conference contribution
AN - SCOPUS:105010221501
T3 - 13th International Conference on Learning Representations, ICLR 2025
SP - 50363
EP - 50389
BT - 13th International Conference on Learning Representations, ICLR 2025
PB - International Conference on Learning Representations, ICLR
T2 - 13th International Conference on Learning Representations, ICLR 2025
Y2 - 24 April 2025 through 28 April 2025
ER -