The Curse of Dimensionality

I recently had a discussion with a colleague. He was of the opinion that nowadays more and more people are disclosing data about themselves on the Internet and that it is therefore becoming easier and easier to make very targeted offers to a person who visits a website.

Then I explained the Curse of Dimensionality to him. Everyone who has dealt with data mining knows it. Interestingly, the phenomenon is extremely easy to explain- and yet it always amazes many people. That’s why I decided to dedicate a few lines to the topic in my blog.

The point is: There is a small problem with all the available data: Not only the amount of data increases (e.g. the number of people), but also its dimension. And the dimensions increase much faster than e.g. the number of Internet users. This can easily be explained with an example.

Case 1

A provider has a product that is well received by all female visitors to his website. Therefore, he wants to offer this product to all women. For the sake of simplicity, we assume that as many men as women visit his website (and that he knows the gender of all visitors). It is obvious that he will offer the product to 50% of his visitors.

Only 1 dimension is known (i.e. relevant), the product is therefore offered to 50% of the visitors (represented by the green area in the figure above)

Case 2

Let us now assume that the same provider knows an additional dimension, namely the age of his visitors. And that’s why an evaluation shows him that his product is particularly appealing to women aged 30 – 50. As another simplification, we assume here that the ages of his visitors are equally distributed from 15 – 75 years. He will no longer offer the product to all women, but only to the third between 30 and 50 years of age. That is now only 33% of the 50% women, so he will offer the product to only 17% of the visitors.

With 2 known (relevant) dimensions, the portion of the relevant visitors of our example decreases to 17% (represented by the green area in the figure above)

Today it is the reality that data with much more than 2 dimensions about most of us is more or less publicly available. Including e.g.

  • Where we live
  • Where we work
  • Languages we speak
  • Our travel destinations
  • Topics we are interested in
  • Our habits
  • With whom we meet
  • etc.

So if I have a very specific offer, then the proportion of people I’m interested in will decrease much faster than the number of people I can reach with my offer increases. Even if, on the other hand, my offer reaches the relevant target group very precisely: The higher the relevant dimensionality of the data, the smaller is the probability that I will find a suitable customer. This is called the Curse of Dimensionality.

The good thing for all those who want to sell me something: Usually not all dimensions are relevant. Sometimes very little data is enough. For example, if a colleague of my wife is looking for sights in Stockholm, Google will show me cheap flight tickets to Sweden. And if one of my colleagues buys a Harley, I find out because before his birthday I am told that I could give him motorcycle gloves.

That is how simple the world works!