Optimization for Deep Learning

We mostly follow these lecture notes, which will be continuously updated throughout the semester. Some additional sources are listed below. If you have an article or lecture notes which you find particularly well written, insightful, or complementary to these materials, please send me an email and I'll add the reference.

Additional Literature

Adaptive learning rate algorithms

There are a few analyses of adaptive learning rate algorithms.

- You can find the analysis of (radial) Adagrad for non-convex optimization here.
- SignSGD is studied in this article for non-convex optimization. It is mostly motivated by a reduced need for communication between different 'workers' if the optimization is split up over several devices.

Initialization

Much of our analysis concerns the late phase of training. convexity or PL assumptions are reasonable at most once the optimizer has found a 'valley' in the loss landscape. Initialization, on the other hand, is very important in the early phase of training, to cut down the time until we reach a good part of the loss landscape - otherwise, we could be stuck on a plateau for a very long time.

Typically, we choose random initialization schemes to make sure that we get 'diverse' features. This allows us to pick up different properties of the data. Other initialization schemes (such as orthogonal initialization) have also been explored. Diversity is imporant:

If two neurons (a_i, w_i, b_i) and (a_j, w_j, b_j) are the same at initialization, they remain the same throughout training. We have thus wasted a parameter.
When proving that there exists a path of non-increasing energy to the global minimum, the first two path segments were used just to get to a 'diverse' place, from which the path was very simple.
Some global convergence results can be proved for diverse initialization in the infinite width limit, see e.g. these articles by Chizat and Bach (2018, 2020) and myself.

For random initialization, the mean of parameter distribution is zero and the variance is fixed in order to avoid the vanishing and exploding gradients phenomenon, which easily affects deep networks: The chain rule characterizes the gradient as the composition of many derivatives. For very deep layers, the gradients are therefore exponentially large or small (in the depth of the network) if the parameters are initialized incorrectly. This leads to very inefficient training.

The correct initialization depends on the behavior of the activation function at zero. If sigma'(0) is not zero, then this can be used to compute the variance. Typical differentiable activation functions satisfy sigma'(0) = 1 (consider e.g. tanh). ReLU needs a larger activation by a factor of 2, as half the incoming features are set to zero.

The network architecture also plays a role - very deep networks also need to be reasonably wide for efficient training. See e.g. these articles [1], [2] by Boris Hanin for a careful consideration.

Implicit bias of optimization algorithms

Implicit bias (or global minimum selection) studies two related, but separate questions:

1. Of the many empirical minimizers of the loss function in the setting of overparametrized learning, which minimizers is an optimization algorithm most likely to find? This often depends on the initialization as well as the algorithm and its hyperparameters.
2. Do those minimizers have favorable properties from the point of view of generalization, i.e. do they correspond to parameters which work well not just on the training set, but also on previously unseen data? This links the function model as well as generalization to the optimization process and is therefore highly non-trivial.

These phenomena have been studied from many different perspectives, and the list below is not exhaustive. If you have a reference which you find particularly insightful or important, please email me and I will add it.

See Section 5.5 (page 39) of this review article for a discussion of global minimum selection and a simple one-dimensional motivation. In particular, the authors demonstrate empirically that SGD chooses different minima then GD, with better generalization properties. The examples are mostly based on this work.
The earliest work in this field that I am aware of is this article, which suggests that optimization algorithms may select minimizers where the objective function is 'flat' in some sense (or that such algorithms may be benefitial). Their concept of 'flatness' is defined on a small positive length scale: They look at minimizers θ such that L(θ') is low for all θ' in a small box around θ.

The authors further link flatness to generalization properties. The argument is based on the MDL (minimum description length) principle: If you need to specify fewer bits of the parameter vector of a function to get good accuracy, then the function is more likely to generalize well. You can think of this as a version of Occam's razor.

The right notion of 'flatness' is hard to pin down and varies considerably. In some examples, an explicit infinitesimal notion (depends only on derivatives of some order at the point) can be found.

In a continuous time model for SGD which captures the vanishing noise at the global minimum (but not the low rank of real SGD noise), I find one notion of flatness in this article (Theorem 2.11). For additive noise, another notion of flatness is found in the vanishing noise limit in Theorem 2.12 in the same article.

In a model with persistent noise on the set of global minimizers (induced by 'label noise', i.e. adding small random perturbations to the values y_i we try to fit), another notion of flatness is found in this article on a discrete time-stepping scheme and this continuous time analysis.

All these notions of flatness depend only on the Hessian. In the label noise case, the dependence is simply through the Laplacian, whereas a considerably more involved dependence is observed in the vanishing noise case.

In a simpler model, there is an explicit path-dependent regularizer which describes the bias towards 'simple' minimizers, as explored here. The authors find that a longer training time corresponds to better generalization. This is not surprising - if we immediately find parameters which interpolate the training data right next to our initialization, then main bias is from the initial condition, and the optimization algorithm barely matters.

"However, it must be kept in mind that, as explained in [36], there is a tension between generalisation and optimisation: a longer training time might improve generalisation but comes at the cost of... a longer training time."

Many questions in this area remain open.

Deep linear networks

A deep linear neural network is a neural network with linear activation function f(x) = W^L ... W^1x. The resulting function is linear in its data x and non-linear in its parameters, since the matrices W^1, ..., W^L are all multiplied together. In particular, the expressivity of the function class does not change and deep linear networks are just linear function. Such models are nevertheless interesting for two reasons:

Matrix factorization (see below)
As a toy models for real neural networks.

For the second problem, they serve as more tractable models where we can study optimization algorithms more explicitly. An overview of important results can be found here.

Optimization in other areas of machine learning

In some deep learning models, optimal parameters can be found directly by solving a linear system of equations (e.g. least squares regression), but most problems require optimization to find good parameters.

A version of LASSO regression (least absolute shrinkage and selection operator) is based on minimizing L(a) = ||Xa - y||_2^2 + lambda * ||a||_1 for some regularization parameter lambda>0. This is a classical tool in statistics if the number of data points is smaller than the number of variables per data point = number of variables in the vector a, i.e. in the overparametrized regime. The first term forces compliance with the data, while the second one forces us to use a vector a which is "not too large" to prevent overfitting.

The main technical complication is that the regularizer is the l^1-norm, not the Euclidean norm. This has benefits: the optimal vector a for the LASSO is sparse, while the optimal vector for L'(a) = ||Xa - y||_2^2 + lambda * ||a||_2^2 with an l^2-norm penalty (''ridge regression'') usually is not. Sparsity allows compressing data to the important variables and makes the model more interpretable: We can get an intuitive feeling for linear combinations of five variables, but not 500, and it may be worth knowing which markers are the strongest predictors for an outcome if we can only take a few measurements.

Finding the minimizer requires solving an optimization problem which is convex, but not smooth. The l^1-norm is Lipschitz-continuous, but its gradients are discontinuous. See Section 3.5.2 in the notes for an analysis of gradient descent in such problems. Other versions of LASSO with hard constraints rather than a 'soft' penalty are common and require constrained optimization techniques (see below).

Many data sets have a matrix structure (or more generally tensor structure). Consider e.g. a data set where rows are users of a service and columns are products of the service. The matrix entry would be a score corresponding to the rating that the user gave the product (e.g. film reviews on Netflix or product ratings on Amazon). Similarly, the rows could be patients and the columns may correspond to different pieces of medical information.

Typically, most entries of these matrices are unknown - very few people order and/or rate every product that is available on amazon... But for the medical provider or service provider, it may be attractive to know what the unknown entries should be (e.g. to flag high risk patients or to recommend the next movie). This is a matrix completion problem.

Without further assumptions, matrix completion is of course impossible. It is often reasonable to assume that the data matrix is well approximated by a matrix that has low rank = a matrix which only has a few linearly independent rows/columns. Intuitively, this would mean that there are some archetypes such that most users are a mix between these (people who love cooking shows/sports/true crime/science fiction/...). The number of archetypes may be 'large' (say 100) but remains several orders of magnitude smaller than the number of users.

Even if the data matrix X is known, it is an interesting question: What is a good low rank approximation of a matrix? Rank is a lower semi-continuous function on the space of matrices, so the set of matrices of rank at most k is closed. It even has a manifold structure on 'most' of the set, but it is not particularly easy to work with. Instead, we encode the low rank condition in the construction of an approximating matrix itself:

Minimize || X - UV|| where U is an n by k matrix (tall and skinny) and V is a k by n matrix (short and wide). The matrix UV has dimension n by n and rank at most k << n by construction. On the other hand, we need to solve a non-convex optimization problem to find U and V. The same techniques that we use for neural networks can be applied here. The energy landscape is nice in this setting: All local minimizers are global minimizers, so optimization algorithms should converge to something good.

Some specialized tools may be available in this slightly more tractable setting (such as quasi-Newton type methods - see below for a discussion and here for a reference).

There are extensions to the case where we want to interpret the factors U and V in a certain way (for example, we may want all entries of one or both matrices to be non-negative).