No Free Lunch theorem

There ain’t no such thing as a free lunch!

If you are familiar with the world of machine learning in general, you have probably heard about the well-known “No free lunch” theorem. My goal with this article is to present this theorem in the simplest possible way and emphasize why understanding the implications are important when formulating an AIOPS strategy.

The “No Free Lunch” theorem for ML explains that there is no single machine learning algorithm that will perform best for all problems. So there is not just one optimized algorithm for solving all problems or even a given problem in different situations.

There are generally two No Free Lunch theorems, one for Supervised Machine Learning (Wolpert 1996), and the other for search and optimization (Wolpert and Macready 1997).

In this article, we are going to focus more on the one written by David Wolpert for supervised machine learning.

The beginnings of this theorem dates back to 1996, where through a paper called The lack of a Priori Distinctions between Learning Algorithms, David Wolpert studies the possibilities to get useful theoretical results with a training data set and a learning algorithm with no assumptions about the target variable. Through various mathematical models, Wolpert demonstrates that for every two algorithms A and B, there are as many scenarios that A will perform worse than B as there are scenarios that A will perform better than B. This is true even if one of the provided algorithms is guessed randomly. Wolpert developed a mathematical proof demonstrating that for all possible problem instances drawn from a uniform probability distribution, the average performance of A and B algorithms is the same.

Since machine learning algorithms are built differently for different issues, it is clear that the assumptions generated by these algorithms will not fit all the data sets perfectly. By definition, it also implies that there will be as many data sets that a given algorithm will not be able to model effectively.

For example, you want to go out for a drink on a Saturday night with your friends and you decide to meet at 8.30. You get ready and get in the car, but when you look at GPS, it shows 40 mins of drive time. You choose another route, then it shows only 30 mins, but you need to go the extra 5 miles and choose this route to reach the bar. This option results successfully because you arrive on time. So, “Can we always choose an extra 5 miles route?” as default settings, the answer is No. The reason behind this is that if you choose to travel during non-peak hours, the probability that you will be paying more for gas is pretty high.

So, finding a machine learning algorithm that works “for free” is almost impossible. The reason behind it is that you must use knowledge about your data and to analyse the context of the environment we and our data live in, in order to come up with an appropriate machine learning problem. This also means that there is no such thing as an universally best machine learning algorithm, and there is no usage independent reason to favor an algorithm over the others.

In solving the key AIOPS problems like Incident Prediction and Problem Identification, one must build algorithm selection processes as part of the design process, just as you would for finding optimal hyper-parameters for a given algorithm. So if any vendor is selling a black-box algorithm or set of algorithms you can be sure that, at least for a portion of the potential users and use cases, it will not be the best algorithm for the job.