# Small Samples and the Overuse of Hypothesis Tests

With powerful computers and statistical packages, modelers can now run an enormous number of tests effortlessly. But should they? This article discusses how bank risk modelers should approach statistical testing when faced with tiny data sets.

In the stress testing endeavor, most notably in PPNR modeling, bank risk modelers often try to do a lot with a very small quantity of data. It is not uncommon for stress testing teams to forecast portfolio origination volume, for instance, with as few as 40 quarterly observations. Because data resources are so thin, this must have a profound impact on the data modeling approaches.

The econometrics discipline, whose history extends back only to the 1930s, was developed in concert with embryonic efforts at economic data collection. Protocols for dealing with very small data sets, established by the pioneers of econometrics, can easily be accessed by modern modelers. In the era of big data, in which models using billions of observations are fairly common, one wonders whether some of these econometric founding principles have been forgotten.

## The overuse and misuse of statistical tests

The issue at hand is the overuse and misuse of statistical tests in constructing stress testing models. While it is tempting to believe that it is always better to run more and more tests, statistical theory and practice consistently warn of the dangers of such an attitude. In general, given a paucity of resources, the key for modelers is to remain “humble” and retain realistic expectations of the number and quality of insights that can be gleaned from the data. This process also involves using strong, sound, and well-thought-out prior expectations, as well as intuition while using the data sparingly and efficiently to help guide the analysis. It also involves taking action behind the scenes to source more data.

An article by Helen Walker, published in 1940, defines degrees of freedom as “the number of observations minus the number of necessary relations among these observations.” Alternatively, we can say that the concept measures the number of observations minus the number of pieces of information on which our understanding of the data has been conditioned. Estimating a sample standard deviation, for example, will have (n-1) degrees of freedom because the calculation is conditioned on an estimate of the population mean. If the calculation relies on the estimation of k separate entities, I will have (n-k) degrees of freedom available in constructing my model.

Now suppose that I run a string of 1,000 tests and I am interested in the properties of the 1,001st test. Because, technically, the 1,001st test is conditional on these 1,000 previously implemented tests, I have only (n-1,000) degrees of freedom available for the next test. If, in building my stress test model, n=40, I have a distinct logical problem in implementing the test. Technically, I cannot conduct it.

Most applied econometricians, however, take a slightly less puritanical view of their craft. It is common for statisticians to run a few key tests without worrying too much about the consequences of constructing a sequence of tests. That said, good econometricians tip their hat to the theory and try to show restraint in conducting an egregious number of tests.

## The power and size of tests is also a critical concern

When setting out to conduct diagnostic tests,even very well-built statistical tests yield errors. Some of these error rates can usually be well controlled (typically the probability of a false positive result, known as the “size” of the test), so long as the assumptions on which the test is built are maintained. Some error rates (the rate of false negatives) are typically not controlled but depend critically on the amount of data brought to bear on the question at hand. The probability of a correct positive test (one minus the rate of false negatives) is known as the “power” of the test. Statisticians try to control the size while maximizing the power. Power is, unsurprisingly, typically low in very small samples.

If I choose to run a statistical test, am I required to act on what the test finds? Does this remain true if I know that the test has poor size and power properties?

Suppose I estimate a model with 40 observations and then run a diagnostic test for, say, normality. The test was developed using asymptotic principles (basically an infinitely large data set) and because I have such a small series, this means that the test’s size is unlikely to be well approximated by its stated nominal significance level (which is usually set to 5%).Suppose the test indicates non-normality. Was this result caused by the size distortion (the probability of erroneously finding non-normality), or does the test truly indicate that the residuals of the model follow some other (unspecified) distribution?

If I had a large amount of data, I would be able to answer this question accurately and the result of the test would be reliable and useful. With 40 observations, the most prudent response would be to doubt the result of the test, regardless of what it actually indicates.

## Finding non-normality

Suppose instead that you are confident that the test has sound properties. You have found non-normality: Now what? In modeling literature, there are usually no suggestions about which actions you should take to resolve the situation. Most estimators retain sound asymptotic properties under non-normality. In small samples, a finding of non-normality typically acts only as a beacon – warning estimators to guard against problems in calculating other statistics. Even if the test is sound, it is difficult to ascertain exactly how our research is furthered by knowledge of the result. In this case, given the tiny sample, it is unlikely that the test actually is sound.

If a diagnostic test has dubious small sample properties, and if the outcome will have no influence over our subsequent decision-making, in our view, the test simply should not be applied. Only construct a test if the result will actually affect the subsequent analysis.

## Dealing with strong prior views

The next question concerns the use and interpretation of tests when strong prior views exist regarding the likely underlying reality. This type of concept may relate to a particular statistical feature of the data – like issues of stationarity – or to the inclusion of a given set of economic variables in the specification of the regression equation. In these cases, even though we have little data, and even though our tests may have poor size and power properties, we really have no choice but to run some tests in order to convince the model user that our specification is a reasonable one.

Ideally, the tests performed will merely confirm the veracity of our prior views based on our previously established intuitive understanding of the problem.

If the result is confounding, however, given that we have only 40 observations, the tests are unlikely to shake our previously stated prior views. If, for example, our behavioral model states that term deposit volume really must be driven by the observed term spread, and if this variable yields a p-value of 9%, should we drop the variable from our regression? The evidence on which this result is based is very weak. In cases where the prior view is well thought out and appropriate, like this example, we would typically not need to shift ground until considerably more confounding evidence were to surface.

If, instead, the prior suggested a “toss-up” between a range of hypotheses, the test result would be our guiding light. We would not bet the house on the outcome, but the test result would be better than nothing. Toss-ups, however, are very rare in situations where the behavioral model structure has been carefully thought out before any data has been interrogated.

## Running tests with limited data

With the advent of fast computers and powerful statistical packages, modelers now have the ability to run a huge number of tests effortlessly. Early econometricians, like the aforementioned Ms. Walker, would look on in envy at the ease with which quite elaborate testing schemes can now be performed.

Just because tests can be implemented does not mean that they necessarily should be. Modern modelers, faced with tiny data sets, should follow the lead of the ancients (many of whom are still alive) and limit themselves to running only a few carefully chosen tests on very deliberately specified models.

Regulators, likewise, should not expect model development teams to blindly run every diagnostic test that has ever been conceived.

###### Featured Experts

#### Jamie Stark

Risk model validation consultant; product strategist; data intelligence and artificial intelligence scientist and researcher

#### Douglas Dwyer

Credit risk modeling expert; machine learning researcher; quantitative and statistical data analyst

###### As Published In:

### Looks at the best practices of today that will form the successful risk management practices of the future.

###### Previous Article

Global Banking Regulatory Radar###### Related Articles

Article
## How Will Climate Change Impact Banks?We look at climate risk and consider how a heating planet might impact a bank's performance |

Presentation
## Expanding Roles of Artificial Intelligence and Machine Learning in Lending and Credit Risk Management With ever-expanding and improving AI and Machine Learning available, we explore how a lending officer can make good decisions faster and cheaper through AI. Will AI/ML refine existing processes? Or lead to completely new approaches? Or Both? What is the promise? And what is the risk? |

Article
## Conservative Banks Do Not Need Conservative ModelsWhen banks manage risk, conservatism is a virtue. We, as citizens, want banks to hold slightly more capital than strictly necessary and to make, at the margin, more provisions for potential loan losses. Moreover, we want them to be generally cautious in their underwriting. But what is the best way to arrive at these conservative calculations? |

Article
## Model Validation Need Not Be a Blood SportThe traditional build-and-validate modeling approach is expensive and taxing. A more positive and productive validation experience entails competing models developed by independent teams. |

Article
## Will CECL Ultimately Be Worth All the Fuss?The industry is currently a hive of CECL-related activity. Many banks are busily testing their systems or finalizing their preparations for the go-live date, which is either in January 2020 or somewhat later, depending on the organization. Some are still making plans for implementation, and the rest are worried that they should be. |

Article
## The Real Value of Stress Testing: Has CCAR Been Validated?The theory that banks are now safer because of CCAR, though, has not yet been tested. |

Article
## CECL, IFRS 9 and the Demand for Forecast StabilityLoan-loss provisioning models must take a variety of economic and client factors into account, but, with the right approach, banks can develop sensible loss forecasts that are more accurate and less susceptible to volatility. |

Article
## Climate Change Stress TestingAs evidence of climate change builds and threats materialize,data will be invaluable in creating a framework for making future credit decisions. |

Article
## Human Versus Machine: The Pros and Cons of AI in CreditIn recent years, attention has increasingly turned to the promise of artificial intelligence (AI) to further increase credit availability and to improve the profitability of banks and other lenders. But what is AI? |

Article
## Finding a CECL Solution for Smaller BanksGood-quality CECL projections can be developed using high-quality data that is available free of charge. |