Recently I have found that many statisticians are speaking of moving away from significance.
I understand that many studies base their conclusions on p-values, which I agree can be misleading at times. Things like sample size play a big role in the outcome if we use significance.
I have been taught to use significance when fitting ARIMA models. For example when a parameter has a p-value > 0.05 it is said to be insignificant and should be excluded from the model. This is the way that I been taught to fit models. This applies in regression as well.
Since we want to move away from significance, are we using metrics like MSE instead? I have also found that more people are starting to use confidence intervals instead of performing hypothesis tests.
This answer has been of great help, but I still don’t understand how we will go about selecting models?
We will rather perform cross validation and use bootstrapping for model selection than looking at significance? (I think this might be part of what answers my question).
Warning pessimistic/cynical post
We do not want to move away from significance.
That is a false premises that lead to your question.
Recently I have found that many statisticians are speaking of moving away from significance.
Since we want to move away from significance…
We do not want to move away from significance. Significance is important. It is an indicator that a data set is large/significant enough in order for some observed effect to be unlikely due to random noise. We still want experimenters to aim for experiments that will be significant. Insignificant experiments, those which likely reflect noise, are not very useful; the interpretation of the outcome is uncertain (is it a ‘true’ effect or is it noise?). Significance means that the experiment is able to give outcomes with relatively more certain interpretations (the outcome is likely not noise but instead some true falsification of the null hypothesis).
A problem with significance is in the wrong focus of research.
What we want to move away from is the trend in science to perform and report about experiments only for the sake of being significant. The problem with significance is that it can be fake. The expression of significance is only as good as the model that has been used to compute it.
That means that, even though significance means that it is something unlikely to occur given the present hypothesis predicting no effect, it is likely for a researcher to find a significant result while it isn’t there.
This makes that we now have a big noodle-soup of reports on research data with only tiny effects. If something is a big effect then it is likely to have already been proven. But, we are now having an enormous army of eager (and pressurized) scientists trying to find something new, so they will focus on something (anything) small and by doing a significant experiment make it big.
A problem with significance is in the methodology to express errors occurring between experiments only based on the error occurring within an experiment.
The current experimental scientific ‘world’ is being driven by these incentives to publish significant (it doesn’t matter what) rather than meaningful. The problem with that is that due to technological developments we have been able to increase the scale of experimental work and do massive testing, allowing to make small effects significantly visible. This places a focus on finding small differences in parameters of the population distributions (it’s resourceful niche for many researchers), while the individual people within those populations have much more variation and differences.
We have a focus on the average, rather than the specific/individual, because differences between averages, no matter how small, can easily be made significant (in practice not always easy, but the principle is simple, it is just increasing the quantity of testing).
For example: Say we sample the height of 10 thousand male people in Paris and 10 thousand in Berlin. If we find approximations for the distribution mean and standard deviation by $(\mu = 173.31 \,cm, \sigma = 5.29 \,cm)$ in Paris and $(\mu = 173.09 \,cm, \sigma = 5.74 \,cm)$ in Berlin, then a t-test might lead us to conclude that we found a significant effect and male people are on average taller in Paris than in Berlin.
But look at the histograms of the (made up) samples below. The distributions are much the same; because of the large spread/variance we may consider the small difference in the means not so important (and also we should be careful in our expression of the standard error, because the methodology may have a relatively large influence on small effects). It is only the large sampling that makes our estimate of the standard error very tiny, and as a consequence we get to conclude that there is a significant difference. However for such tiny differences we can not really know so well whether the difference is to be ascribed to a true effect that causes people in Berlin to be different from people in Paris, or whether the difference is actually due to some systematic effect in our experiment (for instance the sampling might be biased and have different bias in Paris than in Berlin).
The difference of $173.31-173.09 = 0.22$ might be for some given experiment statistically significant if you just sample sufficiently (increase your ‘magnification’ or ‘research power’). But the difference between the populations is incredibly small, this makes that simplifying assumptions about the distributions are not negligible anymore. True when you just wish to compare means (about which you may wonder whether it is really the most useful, but hey it is the thing that we can make significant). For comparing means the sample means will approach a normal distribution so assumptions about the underlying population distribution do not matter. However, when you get into these tiny effects then sampling and other systematical effects may become an issue.
All models are wrong, but some are useful. Expressions about significance are estimates and typically wrong, but often not so bad and therefore still useful. They are not so bad because the assumption that sampling error is dominating systematic error often works and the latter can be neglected. However, recently, these (previously) useful models for estimating errors are becoming less correct and less useful. This is because more and more research is able to zoom in on small effects occurring in populations with large variations. The small effects are being magnified by cranking up the sampling size. But when we look at small effects and small sampling noise (due to large samples) then the systematic error can not be neglected anymore.
how we will go about selecting models?
If you measure tiny effects, and make them significant purely by increasing the sample size, then your are not anymore certain that the determined effect is due to a discrepancy in the null model, it can also be the sampling procedure (When a significance test fails we tend to say that the null hypothesis is falsified, but we should say that the null hypothesis plus the experiment is falsified. However we do not normally say that because for large enough effects we tend to ignore the systematic effects).
So significance is often determined only based on the variance/residuals in sample data (by estimating the spread in the measurements within our single experiment). But, it is false to assume that this is a good estimate for the error of the outcome (especially when determining small effects). We should also estimate/guess/assume the variation of our instruments/methods from experiment to experiment. That is actually how I learned it in my high school physics classes. There was no mentioning of formula’s to compute standard deviations and have experiment based estimates of the error, but instead we had to make sane logical guesses about the error (e.g. when measuring some volume of water using some volumetric glassware then we used some rule of thumb, e.g. the error is 1/10 of the smallest division of the scale).
Significance is not really a tool in model selection. Significance is a tool in hypothesis testing and in verifying the (statistical) validity of conclusions that may stem from such test (a conclusion should, with reasonable probability, not be due to random noise).
With significance testing you often have a preference for the null hypothesis/model. The goal of the experiment is not model selection, but instead model rejection. Significance testing is done to trial/test whether the null hypothesis is correct (and often the test is made with an alternative hypothesis in hindsight such that the test has a high probability/power to reject the null hypothesis if the specific alternative is true).
In these kind of trials you do get the situation that there might be multiple models against which the null hypothesis can be tested and the idea might be to see which of these models make most sense. This does resemble a lot model selection and the concepts can performed in a mixed way, but from my point of view they should not be considered mixed. E.g. one may test multiple factors and see whether any of them has a significant effect. You could see this as model selection, seeing which factor is the best model… However, it is in principle more like performing multiple null hypothesis tests (each hypothesis being that a specific factor has no effect).
Model selection is an optimization which can be worked out without significance (if you have an appropriate loss function). If your are doing some optimization, e.g. predicting, then bootstrapping might indeed be a good way to not only test the variance of the estimates, but also the bias.