Why neural networks are called Universal Approximators- Graphical proof

Hey Guys,

We are reading lot about AI/Deep learning to do some really smart stuff such as playing G0 game, recognizing images and speech. The core part of the Deep learning is the deep neural network (DNN). Before we understand DNN, we need look at the shallow neural network and its property called the universal approximation, which means that given input variable $x$ and a output variable $y$ (labels) and there exits a function $f$ that maps $x$ to $y$ : $y=f(x)$ .We also call the $f(x)$ as model. For example , the $x$ could be a image with 1-million pixels and the $y$ could be binary variable that says if the cat is there in the image or not as shown in the below image.

Problem in hand:

Before predicting if the image has cat or not, we need to build the model by showing many cat and not cat images to the model. We have a set of samples (cat/non-cat images) of the input $x_{i=1:N}$ and the corresponding labels $y_i$ (if the image is cat or not). The problem is to learn the function $\hat{f}(x)$ (model) using the given $x_{i=1:N}$ and $y_{i=1:N}$ which approximates actual underlying function $f(x)$ . We can divide the process of learning $\hat{f}(x)$ into 2 main sub-problems:

Expressiveness of the model: The underlying function $f(x)$ is unknown and we need to select a model or the function that is represented using some parameters $\theta$ . For example we can simply say the $\hat{f}(x)=w\times x+b$ , which represents the line with the $\theta=\{w,b\}=\{slope,bias\}$ . But the line can represent only some functions from $x$ to $y$ , It means that no-matter what is the value of $\theta=\{w,b\}$ , we will not be able to approximate some functions $f(x)$ using $\hat{f}(x)$ (ex: $f(x)=x^2$ ) . For example, see the figure below. It shows the coffee consumption in cups vs time of the day, you see that the function that is required to predict the cups of coffe given the day is not a line and non-linear function is required to approximate that underlying function. In many cases, the input data dimension will be more than two and we will not be able to visualize data and decide the function. So we need a function that has the capability to express broad class of function. But having flexible function also can have problem of over-fitting in case of less data settings, we will not be discussing in this post. We assume that we have lot of data.

Learning method: Learning methods deals with learning the parameters $\theta$ given the data $x_{i=1:N}$ and $y_{i=1:N}$ . For example, given data there are several method to fit the line to the data. We will not dealing with this in this post.

We assume that we have the best learning method and try to understand the expressiveness alone. There are several papers that handles from the mathematical perspective that i included in the reference. But here, we take graphical perspective to intuitively understand the expressiveness of the neural network using some simple function to very complex functions in one dimension input and one dimension output.

Before we understand the expressiveness of the neural network, let us quickly review the neural network (single layer) and also called as shallow networks.

What are Neural Networks:

The shallow networks are the function which takes the input $x$ and maps it to the the output using the function as follows:

$y=W_{2}h+b_{2}$

where, $h=\sigma(W_{1}x+b_{1})$

$W_{2},W_{1},b_1,b_2$ are the parameters of the network and $\sigma(x)$ is called the activation function (There are many such functions are used in neural networks).where $x$ and $y$ can be multi dimensional vectors. The number of dimension in h is also called as number of neurons in the neural network. Each h can be further mapped to the one/more hidden representation. Since the network has only one hidden representation, it is called as single layer neural network. It can be graphically represented as below:

How Neural Network can approximate any function:

To understand the neural networks approximation property, Let us fix the activation function to sigmoid $\sigma(x)$ , which takes vector $x$ as input and ouput a scalar. (It can be generalized to any other activation).

$\sigma(a\times x+b)=\frac{1}{1+e^{-(a\times x+b)}}$

If we consider x as one dimension, the value of the function for different values of a,b is shown below.

left: b=0 and right: b=-.5. Different color shows the different values of a.

You can observe from the figure that the value of changes the slope of the function for small interval around the bias b. You can play with the function here to understand more.

Now let us consider a function $f(x)$ and randomly select some points along the variable denote it as $x_i$ and evaluate the corresponding function value $f(x_i)$ . (We don’t consider the observation noise).

FUNCTION-1: Line x+.5 in the interval [0,1].

The under lying function and function approximation learned by the NN with single neuron is shown below. As you can see that the neural network is able to approximate well with the line using single neuron in the observed data interval [0,1].

x-axis indicate the input x and y-axis indicates the output y. blue line indicate the underlying function [latex]f=x+.5[/latex], green points indicate the observed data points [latex](x_i,y_i)[/latex] and green curve indicates the approximate function learned by the NN with single neuron.

FUNCTION-2 : LINE X+.5 IN THE INTERVAL [-0.5,1.5]

Let us consider the same function and increase the interval of the observed data to [-.5 – 1.5]. Learn the neural network from the data and you see that the single neuron is able to approximate this function where the data is observed. (of course different learnt parameters).

Ok, Let us little formal how that is possible with single neuron. The function of single neuron is given by $y=a_2\sigma(a_{1}x+b_{1})+b_{2}$ . Let us take $a_2=1,b_2=0$ and $y=\sigma(a_{1}x+b_{1})$ . As you can see from the sigmoid property the $a_1$ can control the slope of the linear part of the sigmoid and bias can control the center point of the linear part. So no matter what the observed data interval/slope of the line is the single neuron can approximate it.

FUnction-3: Two lines

Let us consider little more complicated function where two different line for different interval of x-axis. The function is $f(x)=\begin{cases} x+.5 & 0\le x\le\infty\\ -2x+.5 -\infty\le x\le0 \end{cases}$ and the graph is shown below. The read curve indicates the function learned by the 2 neuron neural network. You imagine that the each line is approximated by a single neuron. The neural network can’t approximate the discontinues part of function and we can observe more error near zero.

FUNCTION-4: THERE LINES

In this example we try to approximate 3 lines using 3 neurons. I am showing the each sigmoid functions learned by the neural network below the three line curve. As you can clearly see that the each of the sigmoid is trying to approximate the one part of the line.

FUNCTION-5: SINE wave

Let us look at little more complex function such as sinosoidal function $y=sin(x)$ in the interval [0,1]

The above graph shows the approximate function learned by the neural network for different number of neurons. As you can see the one trying to fit the center part of the sine way where the long interval of linear region the added neurons are trying to fit to the other part of the sine wave.

MANY MORE functions

parabola with linear trend

2nd degree polynomial

3rd degree polynomial

Complex function

In all the above functions, you can see that the NN is trying to approximate the some interval of the function using sigmoids.

No matter what function is there the input can be divided into small chunks where the function is continues and approximate the part by a sigmoid function. This why the neural networks are called universal approximators.

So adding more neurons will increase the expressive power of the NN. You can think of NN as a picewise interpolation method using the kernel as its activation function.

Now you can generalize this concept to any activation. The NN will try to approximate the small chunk of the function using its activation. Same concept can be understood in this nice blog using threshold activation instead of smooth activation.

If neural networks are universal approximators , why do we need to explore any other models?

What are the problems:

Let us consider a function $y=sin(.8\pi (x^2+0.8x+1)+.5$ . Now try to apply the above concept of approximating the function piece by piece using sigmoid. How much neurons do we need?

Approximately we need 60 neurons to approximate this simple function. So the universal approximation does not specify the number of neurons required to approximate some with function with some error. In practice this number will be really high for marginally complex function. The neural network is failing to exploit the nice compositional structure present in the function.

In my next post we see how the DNN can overcome this problem.

Thanks,

Achuth

References:

[1] Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. “Multilayer feedforward networks are universal approximators.” Neural networks 2.5 (1989): 359-366. (Main and first paper)

[2]Scarselli, Franco, and Ah Chung Tsoi. “Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results.” Neural networks 11.1 (1998): 15-37.

[3] Cybenko, George. “Approximation by superpositions of a sigmoidal function.” Mathematics of Control, Signals, and Systems (MCSS) 2.4 (1989): 303-314.

One thought on “Why neural networks are called Universal Approximators- Graphical proof”

Pradyumna B Suresha

August 31, 2017 at 8:55 pm

Great Article!

“No matter what function is there the input can be divided into small chunks where the function is continues and approximate the part by a sigmoid function. This why the neural networks are called universal approximators.”

This summed up the article really well!

Found one error:

In the figure description under the heading: “FUNCTION-1: LINE X+.5 IN THE INTERVAL [0,1].”
you meant “x-axis indicate the input x………and red curve indicates the approximate function….” but you wrote “green curve” instead of “red curve”!

LikeLike

Achuth Rao M V

imagination is better than knowledge -Albert Einstein