Universal Approximation of Neural Networks

Universal approximation beyond continuous functions

Neural Networks are well known as universal approximators.

What does this mean?

Well, it means that neural networks with at least one hidden layer, have the capability to approximate certain functions to any desired degree of accuracy.

A very obvious question is, What is this class of functions?

One might immediately think of continuous functions. However, classification tasks are not continuous problems. Classification involves discrete categories or classes, such as classification of integers into even-odd integers. The outputs of these tasks are not smooth or continuous but rather distinct and categorical. How can a neural network, which excels at approximating continuous functions, effectively handle classification tasks? In this article, I will address the application of universal approximations beyond the class of continuous functions.

Formally, we state the main theorem below:

Universal Approximation Functions:

In other words, if a function is Borel measurable, then there must exists a standard multilayer feedforward network that can approximate it given that:

There is at least one hidden layer.
A squashing function is used.
The approximation is between finite-dimensional spaces.
There are sufficiently many hidden units (neurons).

There are two main questions that you might have at this point:

What is a Borel measurable function?
What is a squashing function?

Borel Measurable Functions:

A Borel set is any set that can be constructed from open sets using three operations: countable unions, countable intersections, and complements. The collection of all such sets is called Borel sigma-algebra. More formally,

A Borel measurable function maintains the Borel structure when mapping from the input space to the output space. Borel structure refers to Borel sigma-algebra.

Squashing Functions:

Because squashing functions have at most countably many discontinuities, they are Borel measurable. This means they can be integrated and used in probabilistic contexts.

Classification of integers into even-odd integers:

Let's now take an example of classifciation of integers into even and odd integers.

Exercise

Ex. Train a neural network to classify integers into even and odd integers.**

** In case you can't do it, I will provide a notebook in the next post.

Conclusion

In theory, the universal approximation theorem assures that neural networks can approximate any Borel measurable function to any desired degree of accuracy, provided they have a sufficient number of neurons and layers. This powerful result extends beyond merely continuous functions, encompassing a vast class of functions that arise in various real-world scenarios.

However, translating this theoretical guarantee into practical application is fraught with challenges. One significant issue is the complexity involved in determining the optimal architecture of the neural network. While the theorem provides a theoretical foundation, it does not offer guidance on the specific structure and parameters required to approximate a given function effectively. As a result, practitioners often rely on empirical methods, such as trial and error or heuristic algorithms, to design and tune their networks, which can be time-consuming and computationally expensive.

By Sadiah Z.

[References]: