Skip to main navigation Skip to search Skip to main content

Optimality of training/test size and resampling effectiveness in cross-validation

  • SUNY Buffalo

Research output: Contribution to journalArticlepeer-review

61 Scopus citations

Abstract

An important question in cross-validation (CV) is whether rules can be established to allow optimal sample size selection of the training/test set, for fixed values of the total sample size n. We study the cases of repeated train–test CV and k-fold CV for certain decision rules that are used frequently. We begin by defining the resampling effectiveness of repeated train–test CV estimators of the generalization error and study its relation to optimal training sample size selection. We then define optimality via simple statistical rules that allow us to select the optimal training sample size and the number of folds. We show that: (1) there exist decision rules for which closed form solutions of the optimal training/test sample size can be obtained; (2) in a broad class of loss functions the optimal training sample size equals half of the total sample size, independently of the data distribution and the data analytic task. We study optimal selection of the number of folds in k-fold CV and address the case of classification via logistic regression and support vector machines, substantiating our claims theoretically and empirically in both, small and large sample sizes. We contrast our results with standard practice in the use of CV.

Original languageEnglish
Pages (from-to)286-301
Number of pages16
JournalJournal of Statistical Planning and Inference
Volume199
DOIs
StatePublished - Mar 2019

Keywords

  • Cross-validation estimator
  • Generalization error
  • Optimality
  • Resampling effectiveness
  • Training sample size

Fingerprint

Dive into the research topics of 'Optimality of training/test size and resampling effectiveness in cross-validation'. Together they form a unique fingerprint.

Cite this