My current research interests lie in supervised machine learning, in particular the theory of artificial neural networks. The review article [6] contains many results of my research.
The continuous perspective on machine learningConsider two timestepping schemes x_{t+1} = x_{t} + h f(x_{t}) and x_{t+1} = x_{t} + h/6 (z_{t,1} + 2z_{t,2} + 2z_{t,3} + z_{t,4}) where
z_{t,1} = f(x_{t}), z_{t,2} = f(x_{t}+ h/2 z_{t,1}), z_{t,3} = f(x_{t}+ h/2 z_{t,2}), z_{t,3} = f(x_{t}+ h z_{t,3}).
While the second scheme looks significantly more complicated, both are timestepping schemes for the same ODE dx/dt = f(x). To analyze the schemes appropriately, we need to understand the ODE and then analyze the approximation properties of the different methods. This is the subject of classical numerical analysis.
Machine learning historically developed the other way around. Much study has been devoted to analyzing problems involving a finite number of data points and tunable parameters, and possibly their statistical generalization properties. Considering continuous problems first and their discretization second is a more classical approach which may give rise to new persepectives and methods. Developing this point of view is initially a question of function spaces and function representation.
Approximation theory and function spacesIt has been known since the late eighties/early nineties that neural networks with a single hidden layer can approximate any continuous function on a compact set uniformly, but the layer may have to be very wide and the coefficients very large. In this situation, the function represented by the neural network is the difference of two functions (partial sums for positive and negative weights in the outermost layer), each of which may be orders of magnitude larger than the target function. Even if the function behaves well on a set of training samples, it may not generalize well (i.e. it behaves poorly at unseen data points).
There are examples of functions which can be approximated by deep networks with much fewer parameters compared to shallow networks. In [1], we use results on function representation to show that functions which can be expressed by a neural network with a hidden single layer are differentiable except on a countable union of affine subspaces. In particular, the distance function from a curved manifold can never be represented exactly by a neural network with one hidden layer, even if it is infinitely wide. Adding a second hidden layer, many such functions are easy to represent [5].
If the weights of the neural network small in a suitable sense, the function does not depend too much on cancellations between the partial sums and we can use tools from statistical learning theory to ensure that it generalizes well. We therefore need 'few' data points to decide whether a function behaves well also on new data. On the other hand, we show that reasonably smooth problems cannot be solved in any function class which has these favorable properties from the learning perspective [4].
Training neural networksWhile it is understood that neural networks can approximate functions, it is not easy in practice to choose appropriate parameters. Current strategies use (stochastic) gradient descent based methods to optimize parameters iteratively. This works (surprisingly) well in practice (with suitable initial conditions), even though the objective function is nonconvex.
There are several explanations for this performance. Some rely on stochasticity, but they generally assume a different type of noise than is derived rigorously from stochastic gradient descent. These works do not match practical observations in that the initial condition has no impact and the same limit is achieved for any initial parameter distribution. Other works show that stochastic noise may not be needed to achieve convergence. If a network has an excessive number of tunable parameters compared to the number of data samples, it can be shown that it behaves like the linearization around its initial condition for all time. For such linear models, it is often possible to show global convergence. This approach cannot explain why neural networks often generalize well, or why they outperform linear models in practice.
In [2], I use a method developed by Chizat and Bach to show that for neural networks with one hidden layer, ReLU activation and suitable initialization, gradient descent reduces a risk functional to its infimum value if and only if its gradient converges to some fixed value. The result is obtained in the nonlinear regime and shows that it suffices to show the uniqueness of a limiting object, while convergence along subsequences is guaranteed by compactness. The result gives insight into how to choose suitable initial conditions in practice. The approach is insensitive to whether or not the risk functional has a minimizer, and in [3] we show that if a minimizer does not exist, risk decay may be very slow in high dimension. In an upcoming preprint, we extend the result to infinitely deep residual neural networks (ResNets). In [10], I demonstrate that SGD with noise which scales like realistic noise in machine learning converges to a minimum exponentially fast under certain assumptions, albeit with a potentially prohibitively large random coefficient. For a similar noise model, I investigate in [11] which kind of minimizers are chosen by SGD.
Use of symmetryMany current neural network architectures, especially in image classification, use convolutional layers. This corresponds to a prior belief that the function which is to be approximated should be invariant under certain symmetries (when determining whether an image shows a cat, it does not matter where the cat is in the image). Convolutional networks outperform fully connected ones in many applications, but the mechanism is not fully understood.
The relative importance of the four statements above is not yet understood.
Articles
Other sourcesRecordings of my talks on function spaces for multilayer neural networks, the training of shallow neural networks in the meanfield regime and stochastic gradient descent for overparametrized regression problems can be found online. A presentation on the competition between approximation power and learnability of function classes will become available shortly.
