I’ve got a worked example (in R), that I’m trying to understand further. I’m using Limma to create a linear model and I’m trying to understand what’s happening step by step in the fold change calculations. I’m mostly trying to figure out what happens to calculate the coefficients. From what I can figure out, QR decomposition is used to get the coefficients, so I’m essentially looking for an explanation or a way to see step-by-step the equations being calculation, or or the source code for qr() in R to trace it myself.
Using the following Data:
expression_data <- c(1.27135202935009, 1.41816160331787, 1.2572772420417, 1.70943398046296, 1.30290218641586, 0.632660015122616, 1.73084258791384, 0.863826352944684, 0.62481665344628, 0.356064235030147, 1.31542028558644, 0.30549909383238, 0.464963176430548, 0.132181421105667, -0.284799809563931, 0.216198538884642, -0.0841133304341238, -0.00184472290008803, -0.0924271878885008, -0.340291804468472, -0.236829711453303, 0.0529690806587626, 0.16321956624511, -0.310513510587778, -0.12970035111176, -0.126398635780533, 0.152550803185228, -0.458542514769473, 0.00243517688116406, -0.0190192219685527, 0.199329876859774, 0.0493831375210439, -0.30903829000185, -0.289604319193543, -0.110019942085281, -0.220289950537685, 0.0680403723818882, -0.210977291862137, 0.253649629045288, 0.0740109953273042, 0.115109148186167, 0.187043445057404, 0.705155251555554, 0.105479342752451, 0.344672919872447, 0.303316487542805, 0.332595721664644, 0.0512213943473417, 0.440756755046719, 0.091642538588249, 0.477236022595909, 0.109140019847968, 0.685001267317616, 0.183154080053337, 0.314190891668279, -0.123285017407119, 0.603094973500324, 1.53723917249845, 0.180518835745199, 1.5520102749957, -0.339656677699664, 0.888791974821514, 0.321402618155527, 1.31133008668306, 0.287587853884556, -0.513896569786498, 1.01400498573403, -0.145552182640197, -0.0466811491949621, 1.34418631328095, -0.188666887863983, 0.920227741574566, -0.0182196762358299, 1.18398082848213, 0.0680539755381465, 0.389472802053599, 1.14920099633956, 1.35363045061024, -0.0400907708395635, 1.14405154287124, 0.365672853509181, -0.0742688460368051, 1.60927415300638, -0.0312210890874907, -0.302097025523754, 0.214897201115632, 2.029775196118, 1.46210810601113, -0.126836819148653, -0.0799005522761045, 0.958505775644153, -0.209758749029421, 0.273568395649965, 0.488150388217536, -0.230312627718208, -0.0115780974342431, 0.351708198671371, 0.11803520077305, -0.201488605868396, 0.0814169684941098, 1.32266103732873, 1.9077004570343, 1.34748531668521, 1.37847539147601, 1.85761827653095, 1.11327229058024, 1.21377936983249, 1.167867701785, 1.3119314966728, 1.01502530573911, 1.22109375841952, 1.23026951795161, 1.30638557237133, 1.02569437924906, 0.812852833149196) treatment <- c('A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'A', 'B', 'A', 'C', 'A', 'C', 'A', 'B', 'C', 'B', 'C', 'C', 'A', 'C', 'A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A', 'C', 'C', 'A', 'C', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'C', 'C', 'B', 'B', 'C', 'C', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A') variation <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
…and the following model design
design <- model.matrix(~0 + factor(treatment, levels=unique(treatment)) + factor(variation)) colnames(design) <- c(unique(treatment), paste0("b", unique(variation)[-1])) #expression_data consists of more than the data given. The data given is just one row from the object fit <- lmFit((expression_data), design) cont_mat <- makeContrasts(B-A, levels=design) fit2 <- contrasts.fit(fit, contrasts=cont_mat) fit2 <- eBayes(fit2)
Gives me a fold change of -0.8709646.
Getting the coefficients can be done via:
qr.solve(design, expression_data)
Then it’s a simple case of B-A to get the fold change.
Now the bit that’s perplexing me is how
qr.solve
actually works, it calls theqr
function, but I can’t seem to find the source for that.Does anyone have a good explanation of qr decomposition, or a way for me to trace exactly what’s happening to derive the coefficients?
Thanks for any help!
Answer
The idea of the QR decomposition as a procedure to get OLS estimates is already explained in the post linked by @MatthewDrury.
The source code of the function qr
is written in Fortran and may be hard to follow. Here I show a minimal implementation that reproduces the main results for a model fitted by OLS. Hopefully the steps are easier to follow.
Recap: The QR procedure is used to decompose the matrix of regressor variables $X$ into an orthonormal matrix $Q$ and a non-singular upper-triangular matrix $R$. Substituting $X = QR$ in the normal equations $X’X\hat\beta = X’y$ yields:
$$
R’Q’QR\hat\beta = R’Q’y \,.
$$
Premultipying by $R^{-1}$ and using the fact that $Q’Q$ is a diagonal matrix gives:
$$
R\hat\beta = Q’y \,. \tag 1
$$
The point of this result is that, since $R$ is an upper-triangular matrix, this equation is easy to solve for $\hat\beta$ by backwards substitutions.
Now, how to we get the matrices $Q$ and $R$? We can Householder transformation, Givens rotations or the Gram-Schmidt procedure.
Below I use Householder transformations. See details for example
here. The code below is based on the Pascal code described in the book Pollock (1999) Chapters 7 and 8. The matrix of regressors is used to store the matrix $R$ of the QR decomposition. The dependent variable $Y$ is overwritten with the results of $Q’y$ (right-hand-side of equation (1) above). Notice also that in the last step the residual sum of squares can be obtained from this vector.
QR.regression <- function(y, X)
{
nr <- length(y)
nc <- NCOL(X)
# Householder transformations
for (j in seq_len(nc))
{
id <- seq.int(j, nr)
sigma <- sum(X[id,j]^2)
s <- sqrt(sigma)
diag_ej <- X[j,j]
gamma <- 1.0 / (sigma + abs(s * diag_ej))
kappa <- if (diag_ej < 0) s else -s
X[j,j] <- X[j,j] - kappa
if (j < nc)
for (k in seq.int(j+1, nc))
{
yPrime <- sum(X[id,j] * X[id,k]) * gamma
X[id,k] <- X[id,k] - X[id,j] * yPrime
}
yPrime <- sum(X[id,j] * y[id]) * gamma
y[id] <- y[id] - X[id,j] * yPrime
X[j,j] <- kappa
} # end Householder
# residual sum of squares
rss <- sum(y[seq.int(nc+1, nr)]^2)
# Backsolve
beta <- rep(NA, nc)
for (j in seq.int(nc, 1))
{
beta[j] <- y[j]
if (j < nc)
for (i in seq.int(j+1, nc))
beta[j] <- beta[j] - X[j,i] * beta[i]
beta[j] <- beta[j] / X[j,j]
}
# set zeros in the lower triangular side of X (which stores)
# not really necessary, this is just to return R for illustration
for (i in seq_len(ncol(X)))
X[seq.int(i+1, nr),i] <- 0
list(R=X[1:nc,1:nc], y=y, beta=beta, rss=rss)
}
We can check that the same estimates than lm
are obtained.
# benchmark results
fit <- lm(expression_data ~ 0+design)
# OLS by QR decomposition
y <- expression_data
X <- design
res <- QR.regression(y, X)
res$beta
# [1] 1.43235881 0.56139421 0.07744044 -0.15611038 -0.15021796
all.equal(res$beta, coef(fit), check.attributes=FALSE)
# [1] TRUE
all.equal(res$rss, sum(residuals(fit)^2))
# [1] TRUE
We can also get the matrix $Q$ and check that it is orthogonal:
Q <- X %*% solve(res$R)
round(crossprod(Q), 3)
# 1 2 3 4 5
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 1 0
# 5 0 0 0 0 1
The residuals can be obtained as y - X %*% res$beta
.
References
D.S.G. Pollock (1999)
A handbook of time series analysis, signal processing and dynamics, Academic Press.
Attribution
Source : Link , Question Author : A_Skelton73 , Answer Author : javlacalle