I am trying to compare computational complexity / estimation speed of three groups of methods for linear regression as distinguished in Hastie et al. “Elements of Statistical Learning” (2nd ed.), Chapter 3:
- Subset selection
- Shrinkage methods
- Methods using derived input directions (PCR, PLS)
The comparison can be very rough, just to give some idea. I gather that the answers may depend on the dimension of the problem and how that fits the computer architecture, so for a concrete example one may consider a sample size of 500 and 50 candidate regressors. I am mostly interested in motivation behind the computational complexity / estimation speed but not in how long it would take on a certain processor for the given example.
The complexity/speed of group 1. seems not too difficult to figure out if brute force algorithms are used (although there may be more efficient alternatives such as the “leaps and bounds” algorithm). For example, full subset selection will require 2K regressions to be fit given a pool of K candidate features. An OLS fit of one linear regression has the complexity of O(K2n) (as per this post) where n is the sample size. Hence, the total complexity of brute-force full subset selection should be O(2KK2n).
The complexity/speed of group 2. is discussed in sections 3.8 and 3.9 of the book. For example, ridge regression with a given penalty λ has the same computational complexity as a regular regression. Since λ needs to be found using cross validation, the computational load increases linearly in the number of data splits used in cross-validation (say, S). If the λ grid has L points, the total complexity of ridge regression with tuning the λ parameter will be O(LSK2n).
There is quite some talk about LASSO in the book, but I could not find quite what I need. However, I found on p. 443 of Efron et al. “Least Angle Regression” (2004) that LASSO complexity for a given λ is the same as the complexity of an OLS fit of linear regression if LARS method is used. Then the total complexity of LASSO with tuning the λ parameter will be O(LSK2n). (I did not read that paper carefully, so please correct me if I got this one wrong.)
Elastic net combines ridge and LASSO; the two have the same computational complexity; hence, the complexity of elastic net should be O(ALSK2n) where A is the grid size of the tuning parameter α that balances the weights of ridge versus LASSO.
I still miss any note on the complexity/speed for group 3. which consists of principal components regression (PCR) and partial least squares (PLS).