# Artificial Neural Networks

My current research interests lie in supervised machine learning, in particular the theory of artificial neural networks. The review article [6] contains many results of my research.

## The continuous perspective on machine learning

Consider two time-stepping schemes x_{t+1} = x_{t} + h f(x_{t}) and x_{t+1} = x_{t} + h/6 (z_{t,1} + 2z_{t,2} + 2z_{t,3} + z_{t,4}) where

z_{t,1} = f(x_{t}), z_{t,2} = f(x_{t}+ h/2 z_{t,1}), z_{t,3} = f(x_{t}+ h/2 z_{t,2}), z_{t,3} = f(x_{t}+ h z_{t,3}).

While the second scheme looks significantly more complicated, both are time-stepping schemes for the same ODE dx/dt = f(x). To analyze the schemes appropriately, we need to understand the ODE and then analyze the approximation properties of the different methods. This is the subject of classical numerical analysis.

Machine learning historically developed the other way around. Much study has been devoted to analyzing problems involving a finite number of data points and tunable parameters, and possibly their statistical generalization properties. Considering continuous problems first and their discretization second is a more classical approach which may give rise to new persepectives and methods. Developing this point of view is initially a question of function spaces and function representation.

## Approximation theory and function spaces

It has been known since the late eighties/early nineties that neural networks with a single hidden layer can approximate any continuous function on a compact set uniformly, but the layer may have to be very wide and the coefficients very large. In this situation, the function represented by the neural network is the difference of two functions (partial sums for positive and negative weights in the outermost layer), each of which may be orders of magnitude larger than the target function. Even if the function behaves well on a set of training samples, it may not generalize well (i.e. it behaves poorly at unseen data points).

There are examples of functions which can be approximated by deep networks with much fewer parameters compared to shallow networks. In [1], we use results on function representation to show that functions which can be expressed by a neural network with a hidden single layer are differentiable except on a countable union of affine subspaces. In particular, the distance function from a curved manifold can never be represented exactly by a neural network with one hidden layer, even if it is infinitely wide. Adding a second hidden layer, many such functions are easy to represent [5].

If the weights of the neural network small in a suitable sense, the function does not depend too much on cancellations between the partial sums and we can use tools from statistical learning theory to ensure that it generalizes well. We therefore need 'few' data points to decide whether a function behaves well also on new data. On the other hand, we show that reasonably smooth problems cannot be solved in any function class which has these favorable properties from the learning perspective [4].

## Training neural networks

While it is understood that neural networks can approximate functions, it is not easy in practice to choose appropriate parameters. Current strategies use (stochastic) gradient descent based methods to optimize parameters iteratively. This works (surprisingly) well in practice (with suitable initial conditions), even though the objective function is non-convex.

There are several explanations for this performance. Some rely on stochasticity, but they generally assume a different type of noise than is derived rigorously from stochastic gradient descent. These works do not match practical observations in that the initial condition has no impact and the same limit is achieved for any initial parameter distribution. Other works show that stochastic noise may not be needed to achieve convergence. If a network has an excessive number of tunable parameters compared to the number of data samples, it can be shown that it behaves like the linearization around its initial condition for all time. For such linear models, it is often possible to show global convergence. This approach cannot explain why neural networks often generalize well, or why they outperform linear models in practice.

In [2], I use a method developed by Chizat and Bach to show that for neural networks with one hidden layer, ReLU activation and suitable initialization, gradient descent reduces a risk functional to its infimum value if and only if its gradient converges to some fixed value. The result is obtained in the non-linear regime and shows that it suffices to show the uniqueness of a limiting object, while convergence along subsequences is guaranteed by compactness. The result gives insight into how to choose suitable initial conditions in practice. The approach is insensitive to whether or not the risk functional has a minimizer, and in [3] we show that if a minimizer does not exist, risk decay may be very slow in high dimension. In an upcoming preprint, we extend the result to infinitely deep residual neural networks (ResNets).

In [10], I demonstrate that SGD with noise which scales like realistic noise in machine learning converges to a minimum exponentially fast under certain assumptions, albeit with a potentially prohibitively large random coefficient. For a similar noise model, I investigate in [11] which kind of minimizers are chosen by SGD.

## Use of symmetry

Many current neural network architectures, especially in image classification, use convolutional layers. This corresponds to a prior belief that the function which is to be approximated should be invariant under certain symmetries (when determining whether an image shows a cat, it does not matter where the cat is in the image). Convolutional networks outperform fully connected ones in many applications, but the mechanism is not fully understood.

Any convolutional network can be understood as a fully connected network where parameters have a band structure. The approximation power of fully connected networks is thus greater than that of convolutional networks, but the approximation power of convolutional networks with a given number of parameters is greater. It is not clear, how much approximation power we gain from the architecture.

The class of convolutional neural networks with a certain path norm is smaller than the class of fully connected neural networks with the same path norm. To estimate the generalization error, fewer data points may be required (or in the class of fully connected networks, a given data sample may correspond to multiple data points by symmetry).

Optimization algorithms explore a much smaller parameter space when choosing parameters in convolutional architectures. The restricted energy landscape may exhibit fewer stationary points which the parameters can get stuck in.

Training a fully connected network on a finite training set which is typically not invariant under the symmetry group action, the result of training does not have to exhibit the correct symmetry. Initial conditions for training algorithms are typically random, making convergence to a symmetric function very unlikely. Imposing the symmetry a priori thus corresponds to a projection on the correct subspace which reduces the generalization error.

Convolutional networks typically have small filters, i.e. information is passed iteratively through different scales where pixels first interact only with their neighbors and in deeper layers with larger and larger neighborhoods. Small filter sizes may have an effect on the learnability of convolutional neural networks.

The relative importance of the four statements above is not yet understood.

## Articles

**Representation Formulas and Pointwise Properties for Barron Functions**

(with Weinan E) 2020 ArXiv

**On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime**

2020 ArXiv

**Can shallow neural networks beat the curse of dimensionality? A mean field training perspective**

(with Weinan E), 2020 ArXiv

**Kolmogorov Width Decay and Poor Approximators in Machine Learning: Shallow Neural Networks, Random Feature Models and Neural Tangent Kernels**

(with Weinan E), 2020 ArXiv

**On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics**

(with Weinan E) 2020 ArXiv

**Towards a Mathematical Understanding of Neural Network-Based Machine Learning: what we know and what we don't**

(with Weinan E, Chao Ma and Lei Wu) 2020 ArXiv

**A priori estimates for classification problems using neural networks**

(with Weinan E) 2020 ArXiv

**On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers**

(with Weinan E) 2020 ArXiv

**Some observations on partial differential equations in Barron and multi-layer spaces**

(with Weinan E) 2020 ArXiv

**Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis**

2021 ArXiv

**Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis**

2021 ArXiv

## Other sources

Recordings of my talks on function spaces for multi-layer neural networks, the training of shallow neural networks in the mean-field regime and stochastic gradient descent for overparametrized regression problems can be found online. A presentation on the competition between approximation power and learnability of function classes will become available shortly.