Reversing ridge regression: given response matrix and regression coefficients, find suitable predictors

Consider a standard OLS regression problem\newcommand{\Y}{\mathbf Y}\newcommand{\X}{\mathbf X}\newcommand{\B}{\boldsymbol\beta}\DeclareMathOperator*{argmin}{argmin}: I have matrices \Y and \X and I want to find \B to minimize L=\|\Y-\X\B\|^2.
The solution is given by \hat\B=\argmin_\B\{L\} = (\X^\top\X)^+\X^\top \Y.

I can also pose a “reverse” problem: given \Y and \B^*, find \hat\X that would yield \hat\B\approx \B^*, i.e. would minimize \|\argmin_\B\{L\}-\B^*\|^2. In words, I have the response matrix \Y and the coefficient vector \B^* and I want to find the predictor matrix that would yield coefficients close to \B^*. This is, of course, also an OLS regression problem with solution \hat\X = \argmin_\X\Big\{\|\argmin_\B\{L\}-\B^*\|^2\Big\} = \Y\B^\top(\B\B^\top)^{+}.

Clarification update: As @GeoMatt22 explained in his answer, if \Y is a vector (i.e. if there is only one response variable), then this \hat \X will be rank one, and the reverse problem is massively underdetermined. In my case, \Y is actually a matrix (i.e. there are many response variables, it is a multivariate regression). So \X is n\times p, \Y is n\times q and \B is p\times q.


I am interested in solving a “reverse” problem for ridge regression. Namely, my loss function now is L=\|\Y-\X\B\|^2+\mu\|\B\|^2
and the solution is \hat\B=\argmin_\B\{L\}=(\X^\top \X+\mu\mathbf I)^{-1}\X^\top \Y.

The “reverse” problem is to find \hat\X = \argmin_\X\Big\{\|\argmin_\B\{L\}-\B^*\|^2\Big\} = \;?

Again, I have a response matrix \Y and a coefficient vector \B^* and I want to find a predictor matrix that would yield coefficients close to \B^*.

Actually there are two related formulations:

  1. Find \hat\X given \Y and \B^* and \mu.
  2. Find \hat\X and \hat \mu given \Y and \B^*.

Does either of them have a direct solution?


Here is a brief Matlab excerpt to illustrate the problem:

% generate some data
n = 10; % number of samples
p = 20; % number of predictors
q = 30; % number of responses
Y = rand(n,q);
X = rand(n,p);
mu = 0;
I = eye(p);

% solve the forward problem: find beta given y,X,mu
betahat = pinv(X'*X + mu*I) * X'*Y;

% backward problem: find X given y,beta,mu
% this formula works correctly only when mu=0
Xhat =  Y*betahat'*pinv(betahat*betahat');

% verify if Xhat indeed yields betahat
betahathat = pinv(Xhat'*Xhat + mu*I)*Xhat'*Y;
max(abs(betahathat(:) - betahat(:)))

This code outputs zero if mu=0 but not otherwise.

Answer

Now that the question has converged on a more precise formulation of the problem of interest, I have found a solution for case 1 (known ridge parameter). This should also help for case 2 (not an analytical solution exactly, but a simple formula and some constraints).

Summary: Neither of the two inverse problem formulations has a unique answer. In case 2, where the ridge parameter \mu\equiv\omega^2 is unknown, there are infinitely many solutions X_\omega, for \omega\in[0,\omega_\max]. In case 1, where \omega is given, there are a finite number of solutions for X_\omega, due to ambiguity in the singular-value spectrum.

(The derivation is a bit lengthy, so TL,DR: there is a working Matlab code at the end.)


Under-determined Case (“OLS”)

The forward problem is
\min_B\|XB-Y\|^2
where X\in\mathbb{R}^{n\times p}, B\in\mathbb{R}^{p\times q}, and Y\in\mathbb{R}^{n\times q}.

Based on the updated question, we will assume n<p<q, so B is under determined given X and Y. As in the question, we will assume the "default" (minimum L_2-norm) solution
B=X^+Y
where X^+ is the pseudoinverse of X.

From the singular value decomposition (SVD) of X, given by*
X=USV^T=US_0V_0^T
the pseudoinverse can be computed as**
X^+=VS^+U^T=V_0S_0^{-1}U^T
(*The first expressions use the full SVD, while the second expressions use the reduced SVD. **For simplicity I assume X has full rank, i.e. S_0^{-1} exists.)

So the forward problem has solution
B\equiv X^+Y=\left(V_0S_0^{-1}U^T\right)Y
For future reference, I note that S_0=\mathrm{diag}(\sigma_0), where \sigma_0>0 is the vector of singular values.

In the inverse problem, we are given Y and B. We know that B came from the above process, but we do not know X. The task is then to determine the appropriate X.

As noted in the updated question, in this case we can recover X using essentially the same approach, i.e.
X_0=YB^+
now using the pseudoinverse of B.


Over-determined Case (Ridge estimator)

In the "OLS" case, the under-determined problem was solved by choosing the minimum-norm solution, i.e. our "unique" solution was implicitly regularized.

Rather than choose the minimum norm solution, here we introduce a parameter \omega to control "how small" the norm should be, i.e. we use ridge regression.

In this case, we have a series of forward problems for \beta_k, k=1,\ldots,q, that are given by
\min_\beta\|X\beta-y_k\|^2+\omega^2\|\beta\|^2
Collecting the different left and right hand side vectors into
B_{\omega}=[\beta_1,\ldots,\beta_k] \quad,\quad Y=[y_1,\ldots,y_k]
this collection of problems can be reduced to the following "OLS" problem
\min_B\|\mathsf{X}_\omega B-\mathsf{Y}\|^2
where we have introduced the augmented matrices
\mathsf{X}_\omega=\begin{bmatrix}X \\ \omega I\end{bmatrix}
\quad , \quad \mathsf{Y}=\begin{bmatrix}Y \\ 0 \end{bmatrix}

In this over-determined case, the solution is still given by the pseudo-inverse
B_\omega = \mathsf{X}^+\mathsf{Y}
but the pseudo-inverse is now changed, resulting in*
B_\omega = \left(V_0S_\omega^{-2}U^T\right) Y
where the new "singularity spectrum" matrix has (inverse) diagonal**
\sigma_\omega^2 = \frac{\sigma_0^2+\omega^2}{\sigma_0}
(*The somewhat involved calculation required to derive this has been omitted for the sake of brevity. It is similar to the exposition here for the p\leq n case. **Here the entries of the \sigma_\omega vector are expressed in terms of the \sigma_0 vector, where all operations are entry-wise.)

Now in this problem we can still formally recover a "base solution" as
X_\omega=YB_\omega^+
but this is not a true solution anymore.

However, the analogy still holds in that this "solution" has SVD
X_\omega=US_\omega^2V_0^T
with the singular values \sigma_\omega^2 given above.

So we can derive a quadratic equation relating the desired singular values \sigma_0 to the recoverable singular values \sigma_\omega^2 and the regularization parameter \omega. The solution is then
\sigma_0=\bar{\sigma} \pm \Delta\sigma \quad , \quad \bar{\sigma} = \tfrac{1}{2}\sigma_\omega^2 \quad , \quad \Delta\sigma = \sqrt{\left(\bar{\sigma}+\omega\right)\left(\bar{\sigma}-\omega\right)}


The Matlab demo below (tested online via Octave) shows that this solution method appears to work in practice as well as theory. The last line shows that all the singular values of X are in the reconstruction \bar{\sigma}\pm\Delta\sigma, but I have not completely figured out which root to take (sgn = + vs. -). For \omega=0 it will always be the + root. This generally seems to hold for "small" \omega, whereas for "large" \omega the - root seems to take over. (Demo below is set to "large" case currently.)

% Matlab demo of "Reverse Ridge Regression"
n = 3; p = 5; q = 8; w = 1*sqrt(1e+1); sgn = -1;
Y = rand(n,q); X = rand(n,p);
I = eye(p); Z = zeros(p,q);
err = @(a,b)norm(a(:)-b(:),Inf);

B = pinv([X;w*I])*[Y;Z];
Xhat0 = Y*pinv(B);
dBres0 = err( pinv([Xhat0;w*I])*[Y;Z] , B )

[Uw,Sw2,Vw0] = svd(Xhat0, 'econ');

sw2 = diag(Sw2); s0mid = sw2/2;
ds0 = sqrt(max( 0 , s0mid.^2 - w^2 ));
s0 = s0mid + sgn * ds0;
Xhat = Uw*diag(s0)*Vw0';

dBres = err( pinv([Xhat;w*I])*[Y;Z] , B )
dXerr = err( Xhat , X )
sigX = svd(X)', sigHat = [s0mid+ds0,s0mid-ds0]' % all there, but which sign?

I cannot say how robust this solution is, as inverse problems are generally ill-posed, and analytical solutions can be very fragile. However cursory experiments polluting B with Gaussian noise (i.e. so it has full rank p vs. reduced rank n) seem to indicate the method is reasonably well behaved.

As for problem 2 (i.e. \omega unknown), the above gives at least an upper bound on \omega. For the quadratic discriminant to be non-negative we must have
\omega \leq \omega_{\max} = \bar{\sigma}_n =
\min[\tfrac{1}{2}\sigma_\omega^2]


For the quadratic-root sign ambiguity, the following code snippet shows that independent of sign, any \hat{X} will give the same forward B ridge-solution, even when \sigma_0 differs from \mathrm{SVD}[X].

Xrnd=Uw*diag(s0mid+sign(randn(n,1)).*ds0)*Vw0'; % random signs
dBrnd=err(pinv([Xrnd;w*I])*[Y;Z],B) % B is always consistent ...
dXrnd=err(Xrnd,X) % ... even when X is not

Attribution
Source : Link , Question Author : amoeba , Answer Author : Community

Leave a Comment