What are artificial neural networks?

The class of artificial neural networks is a collection of parametrized functions. In the simplest case, such a function is given by

f(x) = A^(L) o sigma o A^(L-1) o sigma o ... o sigma o A^(1) x

where x in R^d is the input and the A^l : R^(d_(l-1)) -> R^(d_l) are affine linear maps. d_0 = d is the input dimension and d_L the output dimension.

  • The parameters of the function are the entries of the affine linear maps A^(l). We refer to them as weights (for the linear part) and biases (for the affine shift).

  • The function sigma : R-> R is called the activation function of the neural network. By an abuse of notation, we also consider sigma : R^k -> R^k with sigma(z_1, ..., z_k) = ( sigma(z_1), ..., sigma(z_k) ).

  • The architecture of a neural network refers to the choice of neural network type (we have chosen a standard feedforward network), the depth L, the choice of widths = intermediate dimensions d_l and the activation function, i.e. all important information which is not weights or biases.

Note that in the image, data flows from input (left) to the output right, opposite of the notational convention for the composition of functions in mathematics.

To a mathematician, this structure may look exotic and very different from the linearly parametrized function classes that we are used to (or even mildly non-linear classes such as free knot splines). Indeed, neural networks are notoriously hard to understand and it is often considered a 'black box' what happens between input and output. We will see later that this non-linearity is both an asset and a challenge.

Artificial neural networks are inspired by a simple model of the human brain. In the model, a set of neurons is arranged in layers.

  • Every neuron can be on or off (values 0 or 1), which in the biological motivation would correspond to sending an impulse or not sending it.

  • A neuron determines whether to send the impulse based on a weighted average of the output of the neurons from the previous layer.

This model corresponds exactly to functions as described above if the activation function sigma is the Heaviside function sigma(z) which takes values 1 is z>0 and 0 if z<0. In order to be able to optimize the parameters of a neural network, smoother activation functions are selected in modern applications.

Deep learning is the study of deep neural networks (i.e. neural networks with many layers).

When are neural networks used?

Neural networks are used in many practical applications, from image classification to machine translation to AIs for strategy games. Typical applications have data that is ostensibly very high-dimensional (number of pixels in an image, number of words in a language), but may have some low-dimensional structure (sensible images are few compared to random noise assignments of pixel values).

High-dimensional analysis is intrinsically hard. For example, if we want to cover the d-dimensional cube [0,1]^d by smaller cubes of side-length eps>0, then we need approximatly eps^(-d) smaller cubes. Whenever constants grow exponentially in the dimension d, we speak of the curse of dimensionality.

Neural networks appear to be able to adapt to 'hidden' low-dimensional structures in many high-dimensional problems and avoid the curse of dimensionality. The exact conditions and mechanisms are not understood.

A very brief history of Deep Learning

Artificial neural networks were initially considered in the 1940es. Initial studies focused on neural networks with a single hidden layer and Heaviside activation, which was dubbed a 'perceptron'. This model was found to be lacking in several aspects in the late 60es (for example, it is not possible to express the function which takes the maximum of three or more variables this way).

In the late 80es and early 90es, a burst of research appeared in theoretical machine learning, demonstrating that artificial neural networks are indeed able to approximate any (reasonable) function, if the architecture is chosen right. Convolutional neural networks (neural networks whose linear maps have a convolutional structure with low range spatial dependencies) performed well on simple image classification tasks. However, techniques and computational power were lacking to work with large and in particular deep neural networks.

Around 2012, increase in computational power, progress in GPU computing, and model design (ResNets) lead to rapid advances with neural networks, which began to outperform any other model on a variety of tasks. In 2017, a neural network based AI for the strategy game Go for the first time beat a top human player - a benchmark long believed to be unatainable for decades.

Since then, neural networks have been used in science (e.g. protein folding), stock trading, medicine, self-driving cars, processing loan applications, criminal justice, machine translation, super-resolution image reconstruction, and many other applications.

Topics in Deep Learning theory

  • Approximation. Which problems can be (efficiently) solved by neural networks? Which functions can be (efficiently) approximated in a certain sense? How wide/deep should the neural network be?

  • Generalization. When using a finite data set to design a good function for a purpose, will it also perform well on previously unseen data?

  • Training/Optimization. How do we decide on a neural network architecture for a given purpose? Given the architecture, how do we find the right weights and biases?

These will be the major chapters for this class. There are many others, which we will not be able to touch on, such as domain transfer (if I train a self-driving car in Texas, can it drive in Pennsylvania?), reinforcement learning (development of AI for strategy games) and practical aspects of high-performance computing (e.g. training on multiple GPUs in parallel). The chart below is not a full representation of the vast area of deep learning.

You can find the file here for a higher resolution version.

Problems for Deep Learning and its applications

  • Adversarial examples. While neural networks are immensely successful, they can also be frighteningly sensitive to small changes in the given data. Most prominently, an image which is classified correctly by a neural network with high confindence can be perturbed in ways imperceptible to the human eye such that it is classified incorrectly, also with high confidence.

Using such fragile models in high impact fields like self-driving vehicles or medical imaging can be mildly terrifying. An early analysis with some visual examples was given by Szegedy et.al. in 2013.

  • Lack of guarantees and guidelines. In applied and numerical mathematics, an algorithm is only as good as we can prove it is. If we solve a differential equation for a fluid density, can we guarantee that the density is always non-negative? Can we guarantee that the numerical solution is within a certain distance of the true solution? This is the business of obtaining a priori and/or a posteriori error estimates. Very little can be said in this spirit in the context of deep learning. This lack of rigorous understanding of algorithms and models led to the claim that machine learning has become alchemy in 2017, at the same time that AlphaGo led to one of the major public successes of Deep Learning.

  • Energy consumption. As there are no guarantees or error bounds, there is little rigorous guidance on how to choose a neural network architecture, requiring the intuition and experience of the user to choose hyperparameters appropriately, and quite possibly a healthy amount of trial and error. Some claim that Deep Learning is an art, rather than a science. Training a neural network (finding the right weights and biases) can take days or weeks on multiple GPUs, and there is no guarantee that it will give a good enough result with the specified architecture. Computational cost and energy consumption can be considerable.

  • Interpretability. The inner workings of a neural network remain a mystery. The map from parameters to function is highly non-injective and hard to understand. A neural network may provide an answer to the question we ask, but it can be very hard to understand what it is basing the answer on. Depending on the application, this may be prohibitive - imagine getting a medical diagnosis, but none of your doctors understand why you were diagnosed this way.

  • Bad data and false confidence. On the most basic level, machine learning algorithms find patterns in data sets. These can be meaningful relationships or spurious correlations. Machine learning does not discern the truth of the world beyond the data set. This can lead to terrible errors if the data set does not reflect the real world well.

For example, blonde women are likely overrepresented in the collection of online images. Data sets of celebrity images have a bias towards symmetrical facial features. Russian economic data from the 70es and 90es is entirely incomparable due to the collapse of the Soviet Union. Depending on the application, a particular bias may not matter - but using data sets with historic racial and sexist biases in banking, medicine, or criminal justice is clearly a terrible idea.

What is worse, computer algorithms give decisions an air of rationality and unbiased scientific clarity, giving us a false confidence in models that merely propagate the biases of not just the time they were created, but possibly a distant past if old data is used.

  • Scientific and social progress. When scientific process outpaces social or legal progress, ethical and social issues may arise. Deep Learning, like any other tool, can be used in ways that we may find troubling. Facial recognition can be used by authoritarian regimes to track and prosecute minorities and opposition. Closer to home, neural networks have recently been used to develop a face search engine in sets of online images.

Some more rigorous discussion of the ethical and social challenges of progress in machine learning happens in the Data Justice Lab of the Texas A&M Institute of Data Science (TAMIDS).

When (not) to use deep learning

Deep learning has been immensely successful in applications where other strategies have failed. This is its strength: Sometimes, when there is no other workable method, it succeeds. This has been particularly true in high-dimensional problems in data science where other methods suffer from the curse of dimensionality (deterioration of performance which is exponential in the dimension).

On the other hand, it may take a lot of effort to get it to work, and there is no guarantee that it will. Without rigorous guidelines for the design of algorithms, it may boil down to trial and error, so if it doesn't work, that it did not work this way, not that it cannot work. The effort may be worth it in some problems. If there is a reliable efficient method which provably works and gives reliable error estimates, it is quite possibly better to use that method instead.

Additionally, neural networks generally need large amounts of data. Other methods are often preferable in applications where data is unavailable or hard to produce.