# Influence functions and OLS

I am trying to understand how influence functions work. Could someone explain in the context of a simple OLS regression

where I want the influence function for $\beta$.

Influence functions are basically an analytical tool that can be used to assess the effect (or “influence”) of removing an observation on the value of a statistic without having to re-calculate that statistic. They can also be used to create asymptotic variance estimates. If influence equals $I$ then asymptotic variance is $\frac{I^2}{n}$.

The way I understand influence functions is as follows. You have some sort of theoretical CDF, denoted by $F_{i}(y)=Pr(Y_{i}. For simple OLS, you have

Where $\Phi(z)$ is the standard normal CDF, and $\sigma^2$ is the error variance. Now you can show that any statistic will be a function of this CDF, hence the notation $S(F)$ (i.e. some function of $F$). Now suppose we change the function $F$ by a "little bit", to $F_{(i)}(z)=(1+\zeta)F(z)-\zeta \delta_{(i)}(z)$ Where $\delta_{i}(z)=I(y_{i}, and $\zeta=\frac{1}{n-1}$. Thus $F_{(i)}$ represents the CDF of the data with the "ith" data point removed. We can do a taylor series of $F_{(i)}(z)$ about $\zeta=0$. This gives:

Note that $F_{(i)}(z,0)=F(z)$ so we get:

The partial derivative here is called the influence function. So this represents an approximate "first order" correction to be made to a statistic due to deleting the "ith" observation. Note that in regression the remainder does not go to zero asymtotically, so that this is an approximation to the changes you may actually get. Now write $\beta$ as:

Thus beta is a function of two statistics: the variance of X and covariance between X and Y. These two statistics have representations in terms of the CDF as:

and

where

To remove the ith observation we replace $F\rightarrow F_{(i)}=(1+\zeta)F-\zeta \delta_{(i)}$ in both integrals to give:

ignoring terms of $\zeta^{2}$ and simplifying we get:

Similarly for the covariance

So we can now express $\beta_{(i)}$ as a function of $\zeta$. This is:

We can now use the Taylor series:

Simplifying this gives:

And plugging in the values of the statistics $\mu_y$, $\mu_x$, $var(X)$, and $\zeta=\frac{1}{n-1}$ we get:

And you can see how the effect of removing a single observation can be approximated without having to re-fit the model. You can also see how an x equal to the average has no influence on the slope of the line. Think about this and you will see how it makes sense. You can also write this more succinctly in terms of the standardised values $\tilde{x}=\frac{x-\overline{x}}{s_{x}}$ (similarly for y):