# Reversing ridge regression: given response matrix and regression coefficients, find suitable predictors

The solution is given by

I can also pose a “reverse” problem: given $\Y$ and $\B^*$, find $\hat\X$ that would yield $\hat\B\approx \B^*$, i.e. would minimize $\|\argmin_\B\{L\}-\B^*\|^2$. In words, I have the response matrix $\Y$ and the coefficient vector $\B^*$ and I want to find the predictor matrix that would yield coefficients close to $\B^*$. This is, of course, also an OLS regression problem with solution

Clarification update: As @GeoMatt22 explained in his answer, if $\Y$ is a vector (i.e. if there is only one response variable), then this $\hat \X$ will be rank one, and the reverse problem is massively underdetermined. In my case, $\Y$ is actually a matrix (i.e. there are many response variables, it is a multivariate regression). So $\X$ is $n\times p$, $\Y$ is $n\times q$ and $\B$ is $p\times q$.

I am interested in solving a “reverse” problem for ridge regression. Namely, my loss function now is
and the solution is

The “reverse” problem is to find

Again, I have a response matrix $\Y$ and a coefficient vector $\B^*$ and I want to find a predictor matrix that would yield coefficients close to $\B^*$.

Actually there are two related formulations:

1. Find $\hat\X$ given $\Y$ and $\B^*$ and $\mu$.
2. Find $\hat\X$ and $\hat \mu$ given $\Y$ and $\B^*$.

Does either of them have a direct solution?

Here is a brief Matlab excerpt to illustrate the problem:

% generate some data
n = 10; % number of samples
p = 20; % number of predictors
q = 30; % number of responses
Y = rand(n,q);
X = rand(n,p);
mu = 0;
I = eye(p);

% solve the forward problem: find beta given y,X,mu
betahat = pinv(X'*X + mu*I) * X'*Y;

% backward problem: find X given y,beta,mu
% this formula works correctly only when mu=0
Xhat =  Y*betahat'*pinv(betahat*betahat');

% verify if Xhat indeed yields betahat
betahathat = pinv(Xhat'*Xhat + mu*I)*Xhat'*Y;
max(abs(betahathat(:) - betahat(:)))

This code outputs zero if mu=0 but not otherwise.

Now that the question has converged on a more precise formulation of the problem of interest, I have found a solution for case 1 (known ridge parameter). This should also help for case 2 (not an analytical solution exactly, but a simple formula and some constraints).

Summary: Neither of the two inverse problem formulations has a unique answer. In case 2, where the ridge parameter $\mu\equiv\omega^2$ is unknown, there are infinitely many solutions $X_\omega$, for $\omega\in[0,\omega_\max]$. In case 1, where $\omega$ is given, there are a finite number of solutions for $X_\omega$, due to ambiguity in the singular-value spectrum.

(The derivation is a bit lengthy, so TL,DR: there is a working Matlab code at the end.)

Under-determined Case (“OLS”)

The forward problem is

where $X\in\mathbb{R}^{n\times p}$, $B\in\mathbb{R}^{p\times q}$, and $Y\in\mathbb{R}^{n\times q}$.

Based on the updated question, we will assume $n, so $B$ is under determined given $X$ and $Y$. As in the question, we will assume the "default" (minimum $L_2$-norm) solution

where $X^+$ is the pseudoinverse of $X$.

From the singular value decomposition (SVD) of $X$, given by*

the pseudoinverse can be computed as**

(*The first expressions use the full SVD, while the second expressions use the reduced SVD. **For simplicity I assume $X$ has full rank, i.e. $S_0^{-1}$ exists.)

So the forward problem has solution

For future reference, I note that $S_0=\mathrm{diag}(\sigma_0)$, where $\sigma_0>0$ is the vector of singular values.

In the inverse problem, we are given $Y$ and $B$. We know that $B$ came from the above process, but we do not know $X$. The task is then to determine the appropriate $X$.

As noted in the updated question, in this case we can recover $X$ using essentially the same approach, i.e.

now using the pseudoinverse of $B$.

Over-determined Case (Ridge estimator)

In the "OLS" case, the under-determined problem was solved by choosing the minimum-norm solution, i.e. our "unique" solution was implicitly regularized.

Rather than choose the minimum norm solution, here we introduce a parameter $\omega$ to control "how small" the norm should be, i.e. we use ridge regression.

In this case, we have a series of forward problems for $\beta_k$, $k=1,\ldots,q$, that are given by

Collecting the different left and right hand side vectors into

this collection of problems can be reduced to the following "OLS" problem

where we have introduced the augmented matrices

In this over-determined case, the solution is still given by the pseudo-inverse

but the pseudo-inverse is now changed, resulting in*

where the new "singularity spectrum" matrix has (inverse) diagonal**

(*The somewhat involved calculation required to derive this has been omitted for the sake of brevity. It is similar to the exposition here for the $p\leq n$ case. **Here the entries of the $\sigma_\omega$ vector are expressed in terms of the $\sigma_0$ vector, where all operations are entry-wise.)

Now in this problem we can still formally recover a "base solution" as

but this is not a true solution anymore.

However, the analogy still holds in that this "solution" has SVD

with the singular values $\sigma_\omega^2$ given above.

So we can derive a quadratic equation relating the desired singular values $\sigma_0$ to the recoverable singular values $\sigma_\omega^2$ and the regularization parameter $\omega$. The solution is then

The Matlab demo below (tested online via Octave) shows that this solution method appears to work in practice as well as theory. The last line shows that all the singular values of $X$ are in the reconstruction $\bar{\sigma}\pm\Delta\sigma$, but I have not completely figured out which root to take (sgn = $+$ vs. $-$). For $\omega=0$ it will always be the $+$ root. This generally seems to hold for "small" $\omega$, whereas for "large" $\omega$ the $-$ root seems to take over. (Demo below is set to "large" case currently.)

% Matlab demo of "Reverse Ridge Regression"
n = 3; p = 5; q = 8; w = 1*sqrt(1e+1); sgn = -1;
Y = rand(n,q); X = rand(n,p);
I = eye(p); Z = zeros(p,q);
err = @(a,b)norm(a(:)-b(:),Inf);

B = pinv([X;w*I])*[Y;Z];
Xhat0 = Y*pinv(B);
dBres0 = err( pinv([Xhat0;w*I])*[Y;Z] , B )

[Uw,Sw2,Vw0] = svd(Xhat0, 'econ');

sw2 = diag(Sw2); s0mid = sw2/2;
ds0 = sqrt(max( 0 , s0mid.^2 - w^2 ));
s0 = s0mid + sgn * ds0;
Xhat = Uw*diag(s0)*Vw0';

dBres = err( pinv([Xhat;w*I])*[Y;Z] , B )
dXerr = err( Xhat , X )
sigX = svd(X)', sigHat = [s0mid+ds0,s0mid-ds0]' % all there, but which sign?

I cannot say how robust this solution is, as inverse problems are generally ill-posed, and analytical solutions can be very fragile. However cursory experiments polluting $B$ with Gaussian noise (i.e. so it has full rank $p$ vs. reduced rank $n$) seem to indicate the method is reasonably well behaved.

As for problem 2 (i.e. $\omega$ unknown), the above gives at least an upper bound on $\omega$. For the quadratic discriminant to be non-negative we must have

For the quadratic-root sign ambiguity, the following code snippet shows that independent of sign, any $\hat{X}$ will give the same forward $B$ ridge-solution, even when $\sigma_0$ differs from $\mathrm{SVD}[X]$.

Xrnd=Uw*diag(s0mid+sign(randn(n,1)).*ds0)*Vw0'; % random signs
dBrnd=err(pinv([Xrnd;w*I])*[Y;Z],B) % B is always consistent ...
dXrnd=err(Xrnd,X) % ... even when X is not