An oft-repeated rule of thumb in any sort of statistical model fitting is "you can't fit a model with more parameters than data points". This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints, and this is much more than mere trivia with which you can impress the nerdiest of your friends: as I will show here, this fact can prove to be very useful in real-world scientific applications.
A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. In this post I will consider this misconception, which I call the model complexity myth. I'll start by showing where this model complexity myth holds true, first from from an intuitive point of view, and then from a more mathematically-heavy point of view. I'll build from this mathematical treatment and discuss how underdetermined models may be addressed from a frequentist standpoint, and then from a Bayesian standpoint. (If you're unclear about the general differences between frequentist and Bayesian approaches, I might suggest reading my posts on the subject). Finally, I'll discuss some practical examples of where such an underdetermined model can be useful, and demonstrate one of these examples: quantitatively accounting for measurement biases in scientific data.
While the model complexity myth is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive. In this section I first want to motivate the reason for the underdetermination issue in simple linear models, first from an intuitive view, and then from a more mathematical view.
I'll start by defining some functions to create plots for the examples below; you can skip reading this code block for now:
# Code to create figures
%matplotlib inline import matplotlib.pyplot as plt import numpy as np plt.style.use('ggplot') def plot_simple_line(): rng = np.random.RandomState(42) x = 10 * rng.rand(20) y = 2 * x + 5 + rng.randn(20) p = np.polyfit(x, y, 1) xfit = np.linspace(0, 10) yfit = np.polyval(p, xfit) plt.plot(x, y, 'ok') plt.plot(xfit, yfit, color='gray') plt.text(9.8, 1, "y = {0:.2f}x + {1:.2f}".format(*p), ha='right', size=14); def plot_underdetermined_fits(p, brange=(-0.5, 1.5), xlim=(-3, 3), plot_conditioned=False): rng = np.random.RandomState(42) x, y = rng.rand(2, p).round(2) xfit = np.linspace(xlim[0], xlim[1]) for r in rng.rand(20): # add a datapoint to make model specified b = brange[0] + r * (brange[1] - brange[0]) xx = np.concatenate([x, [0]]) yy = np