Generalization

Introduction

Generalization concerns the question: If a neural network performs well on a data set, does it perform well on previously unseen data which come from the same kind of problem?

To make this question precise, we typically assume that there exists a true underlying distribution μ from which we sample data points (x_i, y_i). We consider empirical risk

and population risk

Depending on the literature, these quantities may also be referred to as loss rather than risk, while we typically reserve the term 'loss' for the loss function ℓ which is used to measure the (dis)-similarity between the true label y_i and the output of the neural network. On the optimization side, we often talk about minimizing loss, which corresponds to risk.

The question becomes: If the empirical risk is low, is the population risk low as well?

A typical assumption (which may or may not be reasonable in practice) is that (x_i, y_i) are iid samples from the distribution μ. This is often 'approximately' satisfied if we obtain data in a natural fashion. The question whether such an object as μ exists, or on which space, is clear at times, and at other times a benevolent assumption.

A priori and a posteriori estimates

There are two types of error estimates in applied mathematics:

  • A priori error estimates are based on theoretical guarantees. They can be derived before running any calculations.

  • A posteriori error estimates are calculated once the numerical solution to a problem is known from the solution (and possibly building on theoretical guarantees).

A typical strategy in deep learning is to divide the given data set into three parts:

    1. The training set, which is used to optimize the parameters.

    2. The test set, which is used to determine whether we should stop training because we are starting to overfit (get higher accuracy on the training set but get worse at data points that the neural network does not see).

    3. The validation set, which is not used during training, but used afterwards to estimate how well the model performs on unseen data.

This is an example of an a posteriori approach. Since the neural network has never seen the validation samples during training, its parameters θ -- which are random variables that depend on the training set -- are independent on the random choice of the validation set. If the samples (x_i, y_i) of the validation set are drawn independently from the data distribution μ, we therefore havethe bound

by the law of large numbers since generally

There are two reasons why this may not be satisfying:

    1. At times, we may wish for a priori estimates to get a good idea how well we would do before we even have the data set.

    2. Since the estimate depends on sqrt(n), we may need an immense number of data points in the holdout set. For example, if ℓ is the zero-one loss in classification, we can bound the variance naively by 1. If we want to have performance guarantee increase of at most 1%, this means that our validation set must be have the size at least n = 1/ 0.01^2 = 10,000. This is pretty large and not always feasible.

The scaling is not just an artifact of our method of proof - the central limit theorem tells us that oscillations of size 1/ sqrt(n) around the mean are standard.

A priori estimates use the size of the training set rather than the validation set. On the other hand, they make stronger assumptions on the neural network we learn - either its number of non-zero weights (which can be reduced by 'pruning', e.g. setting all parameters 'close' to zero to zero exactly) or a measure of the magnitude of its weights. This may be controlled by the implicit bias of an optimization algorithm, or by adding an explicit regularizer (e.g. weight decay).

Statistical learning theory

We need to control the discrepancy between the expectation of loss with respect to the underlying data distribution and the average loss on the data set (the Monte-Carlo integral). Controlling this for individual functions like an empirical risk minimizer is difficult, since the ERM parameters are of course heavily correlated with the data points.

A popular approach towards a priori estimates is to control the 'generalization gap' uniformly over a function class and only considering the ERM inside this class. If we can bound

over a set of parameters Θ and we only consider those to be admissible, then an empirical risk minimizer is guaranteed to have low population risk as well.

We can never bound the generalization gap with certainty (it is always possible that we got exceptionally unlucky with our draw of random points), but we can bound it in expectation or with high probability. The high probability bounds are more meaningful, since they give us a statement about our draw of random numbers rather than just a vague intuition that we are doing something sensible.

To obtain high probability bounds, we need two ingredients:

  1. A measure of the 'complexity' of the function class H = { x -> h(θ, x)θ in Θ} such that we can bound the expected discrepancy between Monte-Carlo integral and expectation and

  2. Concentration inequalities to derive high probability bounds from bounds in expectation.

It is more customary in statistical learning theory to work directly with the (parameterized) function class H rather than the parameter space Θ, and we will often speak of the risk/loss of the function, rather than the loss of the parameters associated to the function.

Wasserstein distances

We wish to control the generalization gap, i.e. discrepancy between empirical risk and and population risk

A classical quantity is found if we replace (h(x), y) by a general function f which we assume to be Lipschitz-continuous:

This is the 1-Wasserstein distance W_1(mu, mu_n) between the data distribution μ and the empirical measure

Wasserstein distances are the foundation of optimal transport theory and are popular ways to quantify (dis-) similarity between data distributions. Unfortunately, they suffer from the curse of dimensionality when comparing continuous distributions and empirical measures: If mu has a density with respect to Lebesgue measure/the uniform distribution on the hypercube [0,1]^d, then W_1(mu, mu_n) >= c n^(-1/d).

Assume for simplicity that mu is the uniform distribution and let c_d be the volume of the d-dimensional unit ball. Then

if r< (2c_dn)^(-1/d). Thus about half of the domain has distance at least ~ n^(-1/d) from all data points. In high dimension, all practically relevant data sets are too small to have samples close to all points, unless the data really lives on a lower-dimensional submanifold of the data space.

This is the low dimensional manifold hypothesis. It is hard to verify in practice, but it is reasonable to assume that reasonable data has an intrinsic structure. If we were to pick pixel values between 0 and 1 randomly, we get white noise, not a sensible image - and this white noise is what the d-dimensional uniform distribution on the unit cube describes.

Still, even if the data only has a few dozen dimensions somehow embedded in a much higher-dimensional space, this may be too much for Wasserstein distances to be a useful measure of similarity. Luckily, the class of neural network functions is much smaller.

Concentration inequalities

We understand well that empirical averages come closer to the true mean as the number of data samples increases. For random variables with bounded variance, the central limit theorem gives precise meaning to this, as the estimate fluctuates around the true mean like a Gaussian with standard deviation scaling like 1/ sqrt(n). From this intuition, we would assume that

However, in general, the convergence to the Gaussian is too slow to prove this -- and often, it is in fact false! If all we know about the variables X_i is that they have a finite variance, then there is no reason for the asymptotic Gaussian tail estimates to be relevant for finite n already.

We can say much more, however, if we make a stronger assumption: Rather than just assuming that the second moments are finite, we assume that the much more restrictive quantity E[ e^(lambda (X_i - EX_i)) ] is finite for all lambda in R (and satisfies a certain growth estimate. This is the class of 'sub-Gaussian' random variables, i.e. random variables whose tails decay faster than those of a Gaussian. In this class of variables, we can prove estimates like the one above, i.e. we can make sure that the probability of large deviations (deviations on a scale much larger than 1/sqrt(n) ) decays rapidly in n.

Luckily for us, if X_i is a bounded random variable, then it is sub-Gaussian. In particular, if the loss function is bounded (or if we can otherwise guarantee that we do not see large values by controlling x_i and h), then we can use concentration of measure.

The main tool here are concentration inequalities. In machine learning, the most relevant concentration inequality is McDiarmid's inequality. Ramon van Handel's lecture notes are a great source on concentration of measure.

Bounding the generalization gap

Our approach to obtaining a priori estimates uses Rademacher complexities. We draw from the following sources:

  • Chapter 26 in the textbook by Shai Shalev-Shwartz and Shai Ben-David (you can find a pdf version for download here). The textbook is an excellent source for machine learning in general, but does not go into great detail on deep learning specifically.

  • Neural network-specific aspects are taken mostly from Matus Telgarsky's lecture notes.

  • A weakness of both sources is that they do not address McDiarmid's concentration inequality, which is crucial when proving high probability bounds. A wonderful source for Rademacher complexity, concentration inequalities and high-dimensional probability theory in general are these lecture notes by Ramon Van Handel. They are not the quickest read, though, and build from topic to topic.

Further topics

  • Uniform accuracy. Risk bounds give accuracy 'on average': If the expectation of the loss function is very small, then at a general point, we also expect loss to be small. This says nothing about accuracy everywhere - the difference is between approximation in L^1 or L^2 on the one hand and L^infty on the other. We have only talked about the first two. This is by design, as the latter is intractable: Neural networks have to face the curse of dimensionality here, see e.g. this article.

  • VC dimension. The Vapnik-Chervonenkis dimension (or VC dimension) of a function class is another measure of complexity for function classes (or 'hypothesis classes' in the parlance of statistical learning theory). As a rule of thumb, we use VC dimension if we are controlling the number of parameters of a neural network and Rademacher complexity if we are controlling their magnituden in a suitable way. The same sources apply: Chapter 6 in this textbook converns VC dimension and this part of the lecture notes treats them for neural networks.

  • The bias-variance tradeoff is a traditional paradigm of statistical learning: If the function class which we use for approximation is too small, then it is giving us a bad predictor in many situations. The choice of function class introduces a high bias. If the function class is too large, on the other hand, there are many functions which can interpolate the training data exactly. Especially if there is a lot of noise in the data, this may not even be desirable. Minimizing empirical risk may barely correlate with minimizing population risk. Here, we have high variance: Functions which perform essentially the same on the training set may have entirely different generalization properties.

  • Double-descent. In deep learning, we work with immensely expressive function classes, yet we often achieve good generalization. The bias-variance tradeoff seems to not fully capture what is going on. The double-descent phenomenon suggests was originally observed by Belkin et.al. in this article. They suggest that there are two regimes at play:

    1. The under-parametrized regime. Here, we have (relatively) high bias, but low variance. As we increase the number of parameters to where we are just interpolating the data, the variance gets higher and generalization becomes worse. At the threshold where we have exactly enough parameters to fit the training data, the 'generalization gap' between empirical risk and population risk is large.

    2. The over-parametrized regime. As we increase the number of parameters beyond ther interpolating threshold, the generalization gap decreases again. This is certainly not true for every minimizer of empirical risk, but it seems to be the case for the ones we find by optimization (implicit bias). If the number of parameters becomes high enough, we achieve lower population risk/higher classification accuracy/... in the over-parameterized regime than in the underparameterized case.

Heuristically, we can picture this as a positive effect of high flexibility: When we have just enough parameters to fit the training data, we need to strain to make everything fit. If we have all the flexibility in the world, we can more comfortably explain the general pattern behind the data in a simpler way.

See also the graphic here for a graphical representation of the idea.

  • Online learning and active learning. We have focussed on the case where we are given (iid data). While this is often more or less realistic, there are two situations when it is not:

    1. Online learning. Here we keep learning while we deploy the model, and we keep getting a stream of incoming data. Data given at similar times may or may not be correlated.

    2. Active learning. Here we can generate training data. This is realistic e.g. in scientific applications. If we vary several variables in the same experiment, we typically do not just randomly select values for the experiment, but either explore a parameter grid, or we try to select the most informative experiment we could perform.