In linear regression (squared loss), using matrix we have a very concise notation for the objective

\text{minimize}~~ \|Ax-b\|^2

Where A is the data matrix, x is the coefficients, and b is the response.

Is there similar a matrix notation for logistic regression objective? All the notations I have seen cannot get rid of the sum over all data points (something like \sum_{\text data} \text{L}_\text{logistic}(y,\beta^Tx)).

EDIT: thanks for joceratops and AdamO’s great answer. Their answer helped me to realize that another reason linear regression have a more concise notation is because the definition of the norm, which encapsulate the square and the sum or e^\top e. But in logistic loss, there is not such definition, which makes notation a little bit more complicated.

**Answer**

In linear regression the Maximize Likelihood Estimation (MLE) solution for estimating x has the following closed form solution (assuming that A is a matrix with full column rank):

\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb

This is read as “find the x that minimizes the objective function, \|Ax-b\|_2^2“. The nice thing about representing the linear regression objective function in this way is that we can keep everything in matrix notation and solve for \hat{x}_\text{lin} by hand. As Alex R. mentions, in practice we often don’t consider (A^TA)^{-1} directly because it is computationally inefficient and A often does not meet the full rank criteria. Instead, we turn to the Moore-Penrose pseudoinverse. The details of computationally solving for the pseudo-inverse can involve the Cholesky decomposition or the Singular Value Decomposition.

Alternatively, the MLE solution for estimating the coefficients in logistic regression is:

\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})

where (assuming each sample of data is stored row-wise):

x is a vector represents regression coefficients

a^{(i)} is a vector represents the i^{th} sample/ row in data matrix A

y^{(i)} is a scalar in \{0, 1\}, and the i^{th} label corresponding to the i^{th} sample

N is the number of data samples / number of rows in data matrix A.

Again, this is read as “find the x that minimizes the objective function”.

If you wanted to, you could take it a step further and represent \hat{x}_\text{log} in matrix notation as follows:

\hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\

\vdots & \vdots \\

1 & (1-y^{(N)})\\\end{bmatrix}

\begin{bmatrix}

\log(1+e^{-x^Ta^{(1)}}) & … & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & … & \log(1+e^{x^Ta^{(N)}})

\end{bmatrix}

but you don’t gain anything from doing this. Logistic regression does not have a closed form solution and does not gain the same benefits as linear regression does by representing it in matrix notation. To solve for \hat{x}_\text{log} estimation techniques such as gradient descent and the Newton-Raphson method are used. Through using some of these techniques (i.e. Newton-Raphson), \hat{x}_\text{log} is approximated and is represented in matrix notation (see link provided by Alex R.).

**Attribution***Source : Link , Question Author : Haitao Du , Answer Author : Community*