# Confused about Autoregressive AR(1) process

I create an autoregressive process “from scratch” and I set the stochastic part (noise) equal to 0. In R:

Y <- vector()

c = 0.1
phi = 0.9

Y[1] = 5

for (i in 2:100) {
Y[i] = c + phi*Y[i-1]
}


Then I ask R to fit an AR(1) process using the arima() function :

ar <- arima(Y, order = c(1,0,0))


It estimates the ar1 coefficient to be ar1 = 0.9989 with standard error 0.0015.

Why is R not finding ar1 = 0.9 (= phi) with overwhelming small standard error?

You have two problems–and one of them is interesting.

Without a noise term, the series is no longer stationary. Its value is increasing asymptotically, but definitely, toward $$1:1:$$

ARIMA applies only to stationary models–and these data are obviously not from a stationary model. That’s not terribly interesting. What is interesting is that the problem persists even with noise!

What, then, happens when we add just a tiny bit of noise?

It’s still obviously not stationary–but the reason is that the initial values are inconsistent with everything that follows.

You need to remove a “burn-in period” during which the simulated values are starting to behave like the rest of the series will. Here’s what this one looks like when we strip out the first $$n0=30n_0=30$$ values:

What does arima return?

Coefficients:
ar1  intercept
0.9074     0.9872
s.e.  0.0309     0.0088


$$0.9074±0.03090.9074 \pm 0.0309$$ is a great estimate of $$ϕ=0.9.\phi=0.9.$$

I repeated this process 99 more times, producing 100 estimates of $$ϕ\phi$$ along with their standard errors. Here is a plot of those estimates and crude $$90%90\%$$ confidence limits (set at $$1.6451.645$$ standard errors above and below the estimates):

The horizontal gray line is located at $$ϕ=0.9\phi=0.9$$ for reference. The red confidence intervals are those that do not overlap the reference: there are $$1212$$ of them, indicating the confidence level is around $$88%,88\%,$$ agreeing (within sampling error) with the intended value of $$90%.90\%.$$ The horizontal black line is the average estimate. It’s a little lower than $$ϕ,\phi,$$ perhaps because after even $$200200$$ time steps the series still isn’t quite stationary. (One also doesn’t expect the distribution of the estimate to be symmetric: $$11$$ is an important boundary and will cause the distribution to be skewed toward the smaller values.)

Here is the same study but with $$20002000$$ time steps in each iteration:

The bias in the estimate has nearly disappeared.

Another solution is to start generating the series at its asymptotic mean value (equal to cnst/(1-phi) in the R code below). But that requires knowing the asymptote, which might be harder to come by in more complex models, so it’s good to know about the technique of discarding the initial segment of a simulated series.

BTW, here’s a reasonably efficient and compact way to generate these datasets:

phi <- 0.9    # AR(1) coefficient
n <- 200      # Total number of time steps after the initial value
cnst <- 0.1   # Intercept
sigma <- 0.01 # Error variance

Y <- Reduce(function(y, e) y * phi + e, rnorm(n, cnst, sigma), 0, accumulate=TRUE)
n0 <- which.max(abs(Y) >= quantile(abs(Y), 0.5)) # Estimate where Y levels off
Y <- Y[-seq_len(n0)]                             # Strip the initial values
plot(Y)                                          # LOOK at Y before doing anything else...