This article is by Sydney Firmin and originally appeared on the Alteryx Data Science Blog here: https://community.alteryx.com/t5/Data-Science-Blog/All-Models-Are-Wrong/ba-p/348080
All Models Are Wrong
The three broad categories of assumptions made by statistical models are distributional assumptions (assumptions about the distribution of values in a variable or the distribution of observational errors), structural assumptions (assumptions about the functional relationship between variables), and cross-variation assumptions (joint probability distribution).
For example, a linear regression model assumes that the relationships between variables in a data set are linear (and only linear). In the eyes of a linear model, any distance between the observations that make up the data set and the modeled line is just noise (i.e., random or unexplained fluctuations in the data) and can ultimately be ignored.
George Box stated that all models are wrong specifically in the context of statistical models. Because the very nature of a model is a simplified and idealized representation of something, all models will be wrong in some sense. Models will never be “the truth” if truth means entirely representative of reality. It is very important to consider the assumptions made in generating a model because models are only truly helpful when the assumptions are held up.
Maps and Miniatures
There is an aphorism that references the map-territory relation, attributed to Alfred Korzybski:
A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness.
Maps are useful because they are abstractions of a real object at a more manageable scale, but they will always exclude some level of detail. Depending on how much area a map includes, there may also be some distortion due to the projection of the map (caused by the tricky process of converting a spherical globe to a flat representation).
The only truly accurate map would be a 1:1 replication of the territory it represents; however, a map like that would be no more helpful that navigating the territory itself.
Consider the quote from poet Paul Valery:
Everything simple is false. Everything which is complex is unusable.
If All Models Are Wrong, Why Bother?
George Box’s aphorism is not without its critics.
The problem many statisticians have with this quote seem to broadly fall into two categories:
- Models being wrong is an obvious statement. Of course all models are wrong, they’re models.
- This quote is used as an excuse for bad models.
Statistician J. Michael Steele has been critical of the adage (see this personal essay). Steele’s primary argument is that “wrong” only comes into play if the model does not correctly answer the question that it claims to answer (e.g., that a building on a map is mislabeled, not that the building is represented by a little square). Steele goes on to state:
The majority of published statistical methods hunger for one honest example.
Steele argues that statistical models are often not up to an adequate fitness measure, and many models developed by statisticians are not sufficient for their intended use cases.
In the article Statistics as a Science, Not an Art: The Way to Survive in Data Science Mark van der Laan (Statistics at UC Berkeley) attributes the Box quote as a contributing cause of bad statistical models and dismisses it as “complete nonsense.” He goes on to write:
The foundation of statistics (…) could not have been to arbitrarily select a “convenient” statistical model. However, that is precisely what most statisticians blithely do, proudly referring to the quote, “All models are wrong, but some are useful.” Due to this, models that are so unrealistic that they are indexed by a finite dimensional parameter are still the status quo, even though everybody agrees they are known to be false.
As a solution, Van der Laan calls statisticians to stop using Box’s quote, and make a commitment to take data, statistics, and the scientific method seriously. He calls on statisticians to spend time learning how data in a given data set were generated and commit to developing realistic statistical models using machine learning and data-adaptive estimation techniques over more traditional parametric models.
But Some Models Are Useful
Despite the limitations of models, many models can be very useful. Because they are simplified, models are often helpful in understanding a certain component or facet of a system.
In the context of data science, machine learning and statistical models can be useful to estimate (predict) unknown values. In many contexts, if the model’s assumptions hold up, an uncertain estimate provided by a strong statistical model can still be very helpful for making decisions.