What are the consequences of “copying” a data set for OLS?

Suppose I have a random sample {Xi,Yi}ni=1. Assume this sample is such that the Gauss-Markov assumptions are satisfied such that I can construct an OLS estimator where


Now suppose I take my data set and double it, meaning there is an exact copy for each of the n (Xi,Yi) pairs.

My Question

How does this affect my ability to use OLS? Is it still consistent and identified?


Do you have a good reason to do the doubling (or duplication?) It doesn’t make much statistical sense, but still it is interesting to see what happens algebraically. In matrix form your linear model is
the least square estimator is ˆβols=(XTX)1XTY and the variance matrix is Vˆβols=σ2(XtX)1. “Doubling the data” means that Y is replaced by (YY) and X is replaced by (XX). The ordinary least squares estimator then becomes
so the calculated estimator doesn’t change at all. But the calculated variance matrix becomes wrong: Using the same kind of algebra as above, we get the variance matrix σ22(XTX)1, half of the correct value. A consequence is that confidence intervals will shrink with a factor of 12.

The reason is that we have calculated as if we still have iid data, which is untrue: the pair of doubled values obviously have a correlation equal to 1.0. If we take this into account and use weighted least squares correctly, we will find the correct variance matrix.

From this, more consequences of the doubling will be easy to find as an exercise, for instance, the value of R-squared will not change.

Source : Link , Question Author : Stan Shunpike , Answer Author : kjetil b halvorsen

Leave a Comment