Regression from Bayesian perspective - The Coin Perspective

Date: 29.08.2024

In this post, I present a small analysis of the regression problem—the task of approximating a mapping from inputs to outputs. Specifically, I focus on the correct formulation of the problem using a Bayesian perspective. To do so, I simplify the problem by considering a collection of coins. This simplification allows us to explore the problem in a more intuitive manner.

What is the Problem?

We are tasked with estimating the output yy given the input xx and historical data containing pairs of inputs and outputs. The goal is to find a function ff that maps inputs to outputs.

This regression problem is ubiquitous in machine learning and data science applications. A simple example would be estimating the cost of insurance based on features of a prospective customer.

Imagine an insurance company where a new customer requests a quote for car insurance. The customer and their car have certain features that also appear in the historical data. The goal is to estimate the cost of insurance for this customer.

Naive Approach

This problem is often treated as a standard regression task. A straightforward approach might involve using a common regression model to approximate the relationship.

The model is trained using historical data and relevant features. Once trained, it can make predictions for new, unseen data.

While this approach typically works well, let’s delve a bit deeper.

The Assumptions

Standard regression models inherently rely on certain assumptions. The key assumption in these models is that the data is generated according to the model's structure. For example, in linear regression, we assume that the data is generated as follows:

There exists a model with fixed parameters $\beta$.
The features $X$ are generated from some underlying distribution.
Given the features, the target $Y$ is generated from a normal distribution with a mean of $X\beta$ and some variance.

Based on these assumptions, we can apply maximum likelihood estimation to estimate the parameters $\beta$. Once estimated, the model can be used to predict the target for new data. Additionally, with a sufficiently large number of observations, we can prove that the method will converge to the true parameters.

The Problem

The problem with this approach is that the model assumption is just false. The data isn't generated by the model but rather by real-life events. For example, in the insurance context, the observation that a customer incurred a particular cost is the result of numerous factors, and if we knew all of them, we could predict the cost of insurance perfectly.

Unfortunately, we don't have access to all those factors, and the problem remains untraceable. As a result, we use regression models in the hope that they will generalize well to new data—and they often do perform reasonably well.

However, the fundamental flaw in this approximation remains, and it’s important to be aware of it. In the following sections, we will attempt to reformulate the problem from a Bayesian perspective and possibly find a reasoning behind using regression models in this context.

The Bayesian Perspective

The Bayesian perspective offers a different approach to the problem. Instead of assuming that there is a specific model generating the data, we will focus on our beliefs about the data. Let’s consider the insurance problem and the available data.

The data consists of a collection of claims that customers have made in the past. What does it mean when a customer files a claim? It indicates that the customer experienced an accident, resulting in damage to their property or that of another party.

Now, this is a starting point. A customer had an accident. What is the probability of that event? Initially, we know nothing. This is a person we have never met, and we have no insight into their driving skills. Perhaps they never drive, or maybe there are no other cars in the world. They could be the only person on the road.

Let’s delve deeper: What is the cost of the claim? We have no information about that either. Perhaps the customer is driving a Ferrari, or they could be riding a bicycle. They might even be operating a tank.

While this reasoning may seem a bit absurd, it highlights the reality that we often lack information about the customer. Consequently, we must make assumptions about both the world and the customer.

One reasonable assumption is that the new customer is "similar" to those we have encountered in the past. But what does it mean to be "similar"? This question introduces complexity. To clarify, let’s simplify the problem and explore a more straightforward scenario first.

Rephrasing as a Coin Problem

Let’s first imagine a collection of coins. The coins look identical, weigh the same, and feel the same. Additionally, we can only toss each coin once.

What Can We Do?

The first reasonable assumption is that the order of the coins doesn’t matter. Since we select the coins at random, there’s no reason to believe that the order is significant. The only observable characteristic is whether a toss results in heads or tails. Therefore, the best we can do is estimate the probability of heads for the entire collection. Whether the coins are actually identical is something we cannot determine from just this information.

Now, imagine we have a new collection of coins. This time, the coins are not identical; each one is unique in its own specific way. No two coins are the same. What can we do now? Not much more than before. Even though each coin might have a different probability of landing heads, we can still only estimate the overall probability of heads for the entire collection. Here, we continue to assume that each coin is independent of the others and that the order of the coins doesn’t matter.

Adding Informative Features to Coins

Now, let's consider a new collection of coins. This time, the coins have features, and some coins share the same attributes. What can we do with this new information? We should take advantage of it, but how?

To simplify, let’s imagine we have only one binary feature: the coins are either blue or red. What can we do now? We can treat each color as a separate collection of identical coins. It’s important to note that there is no reason to assume that the blue coins are identical to one another; they could all be different or drawn from a distribution. The same applies to the red coins.

Furthermore, if we assume that blue coins are completely different from red coins, we discard the possibility that the coins could be similar. The only visible difference is their color, so in real-life terms, color should not matter.

At this point, we can see how our prior beliefs influence the model. Essentially, the color acts as additional information - leading us to implicitly conclude that the coins should be similar. If we were to switch the blue coins for a six-sided die, where one face shows "heads" and the others show "tails," we would hesitate to assume that the probabilities are similar. Instead, we would treat the dice as a separate collection of coins, which would be a more natural approach.

It’s important to recognize that our prior beliefs are inherently biased, but they matter. We cannot make our beliefs completely objective unless we either ignore the information or treat the two sets as entirely distinct.

What Should We Do?

Let’s explore the possible alternatives:

The blue and red coins are identical, and the color is merely a label.
The blue and red coins are different, and the probability of one is completely independent of the other.
The blue and red coins are different, but the probability of one is dependent on the other.

Among these options, the first is the most reasonable but also the only one that can be disproved with data. The second option is the most conservative, but it is quite useless in practice. The third option is a middle ground, and we will proceed with that.

We can formulate a prior belief that specifies:

There is a high probability that the blue and red coins are identical and have the same probability of heads.
There is a small probability that the blue and red coins are different and have distinct probabilities of heads.

As we gather more observations, we can update our beliefs regarding the probabilities of both coin types as needed. By adopting a Bayesian perspective and assigning a non-zero probability to every possibility, we allow the Bayesian computation to adjust our beliefs toward the most likely outcome. At the same time, since we start with a prior that favors a reasonable assumption, we require less data to make a good estimate, ultimately leading to greater profit.

What About Continuous Features?

We must again make assumptions. Previously, we assumed it is highly probable that color has no effect, but there is still a chance that the coins might differ. We need to adopt a similar approach here. We must make reasonable assumptions about the effects of the continuous features while allowing our prior information to account for all possibilities. How to achieve that, however, is a question for another post.

Thank you for reading! I hope you found this post insightful. If you have any questions or comments, feel free to reach out at lukasz@ls314.com.