One of the motivations for the elastic net was the following limitation of LASSO:
In the p>n case, the lasso selects at most n variables before it
saturates, because of the nature of the convex optimization problem.
This seems to be a limiting feature for a variable selection method.
Moreover, the lasso is not well defined unless the bound on the
L1-norm of the coefficients is smaller than a certain value.
I understand that LASSO is a quadratic programming problem but also can be solved via LARS or element-wise gradient descent. But I do not understand where in these algorithms I encounter a problem if p>n where p is the number of predictors and n is the sample size. And why is this problem solved using elastic net where I augment the problem to p+n variables which clearly exceeds p.
As said, this is not a property of an algorithm but of the optimization problem. The KKT conditions basically give that for coefficient βj to be non-zero it has to correspond to a fixed correlation with the residual |Xtj(y−Xβ)|=λ (λ is the regularization parameter).
After resolving the various complications with absolute value etc, you are left with a linear equation for each non-zero coefficient. Since the rank of the matrix X is at most n when p>n, this is the number of equations that can be solved, and therefore there are at most n non-zeros (unless there are redundancies).
By the way, this is true for any loss function, not only the standard lasso with L2 loss. So it is in fact a property of the lasso penalty. There are many papers that show this KKT view and the resulting conclusions, I can point to our paper:
Rosset and Zhu, Piecewise Linear Regularized Solutions Paths, Annals of Stats 2007
and refs therein.