# Matrix notation for logistic regression

In linear regression (squared loss), using matrix we have a very concise notation for the objective

Where $A$ is the data matrix, $x$ is the coefficients, and $b$ is the response.

Is there similar a matrix notation for logistic regression objective? All the notations I have seen cannot get rid of the sum over all data points (something like $\sum_{\text data} \text{L}_\text{logistic}(y,\beta^Tx)$).

EDIT: thanks for joceratops and AdamO’s great answer. Their answer helped me to realize that another reason linear regression have a more concise notation is because the definition of the norm, which encapsulate the square and the sum or $e^\top e$. But in logistic loss, there is not such definition, which makes notation a little bit more complicated.

In linear regression the Maximize Likelihood Estimation (MLE) solution for estimating $$xx$$ has the following closed form solution (assuming that A is a matrix with full column rank):

$$\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb$$

This is read as “find the $$xx$$ that minimizes the objective function, $$\|Ax-b\|_2^2\|Ax-b\|_2^2$$“. The nice thing about representing the linear regression objective function in this way is that we can keep everything in matrix notation and solve for $$\hat{x}_\text{lin}\hat{x}_\text{lin}$$ by hand. As Alex R. mentions, in practice we often don’t consider $$(A^TA)^{-1}(A^TA)^{-1}$$ directly because it is computationally inefficient and $$AA$$ often does not meet the full rank criteria. Instead, we turn to the Moore-Penrose pseudoinverse. The details of computationally solving for the pseudo-inverse can involve the Cholesky decomposition or the Singular Value Decomposition.

Alternatively, the MLE solution for estimating the coefficients in logistic regression is:

$$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})$$

where (assuming each sample of data is stored row-wise):

$$xx$$ is a vector represents regression coefficients

$$a^{(i)}a^{(i)}$$ is a vector represents the $$i^{th}i^{th}$$ sample/ row in data matrix $$AA$$

$$y^{(i)}y^{(i)}$$ is a scalar in $$\{0, 1\}\{0, 1\}$$, and the $$i^{th}i^{th}$$ label corresponding to the $$i^{th}i^{th}$$ sample

$$NN$$ is the number of data samples / number of rows in data matrix $$AA$$.

Again, this is read as “find the $$xx$$ that minimizes the objective function”.

If you wanted to, you could take it a step further and represent $$\hat{x}_\text{log}\hat{x}_\text{log}$$ in matrix notation as follows:

$$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\ \vdots & \vdots \\ 1 & (1-y^{(N)})\\\end{bmatrix} \begin{bmatrix} \log(1+e^{-x^Ta^{(1)}}) & … & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & … & \log(1+e^{x^Ta^{(N)}}) \end{bmatrix} \hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\ \vdots & \vdots \\ 1 & (1-y^{(N)})\\\end{bmatrix} \begin{bmatrix} \log(1+e^{-x^Ta^{(1)}}) & ... & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & ... & \log(1+e^{x^Ta^{(N)}}) \end{bmatrix}$$

but you don’t gain anything from doing this. Logistic regression does not have a closed form solution and does not gain the same benefits as linear regression does by representing it in matrix notation. To solve for $$\hat{x}_\text{log}\hat{x}_\text{log}$$ estimation techniques such as gradient descent and the Newton-Raphson method are used. Through using some of these techniques (i.e. Newton-Raphson), $$\hat{x}_\text{log}\hat{x}_\text{log}$$ is approximated and is represented in matrix notation (see link provided by Alex R.).