# Is it correct to say the Neural Networks are an alternative way of performing Maximum Likelihood Estimation? if not, why? [duplicate]

We often say that minimizing the (negative) cross-entropy error is the same as maximizing the likelihood. So can we say that NN are just an alternative way of performing Maximum Likelihood Estimation? if not, why?

There seems to be a misunderstanding concerning the actual question behind. There are two questions that OP possibly wants to ask:

1. Given a fixed other parametrized model class that are formulated in a probabilistic way, can we somehow use NNs to very concretely optimize the Likelihood of the parameters? Then as @Cliff AB posted: This seems strange and unnatural for me. NNs are there for approximizing functions. However, I strongly believe that this was not the question.

2. Given a concrete dataset consisting of ‘real’ answers $$y(i)y^{(i)}$$ and real $$dd$$-dimensional data vectors $$x(i)=(x(i)1,...,x(i)d)x^{(i)} = (x^{(i)}_1, ..., x^{(i)}_d)$$, and given a fixed architecture of a NN, we can use the cross entropy function in order to find the best parameters. Question: Is this the same as maximizing the likelihood of some probabilistic model (this is the question in the post linked in the comments by @Sycorax).

Since the answer in the linked thread is also somewhat missing insight let me try to answer that again. We are going to consider the following very simple neural network with just one node and sigmoid activation function (and no bias term), i.e. the weights $$w=(w1,...,wd)w = (w_1, ..., w_d)$$ are the parameters and the function is:
$$fw(x)=σ(d∑j=1wjxj)f_w(x) = \sigma\left(\sum_{j=1}^d w_j x_j\right)$$
The cross entropy loss function is
$$l(ˆy,y)=−[ylog(ˆy)+(1−y)log(1−ˆy)]l(\hat{y}, y) = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$

So given the dataset $$y(i),x(i)y^{(i)}, x^{(i)}$$ as above, we form
$$n∑i=1l(y(i),fw(x(i)))\sum_{i=1}^n l(y^{(i)}, f_w(x^{(i)}))$$
and minimize that in order to find the parameters $$ww$$ for the neural network. Let us put that aside for a moment and go for a completely different model.

We assume that there are random variables $$(X(i),Y(i))i=1,...,n(X^{(i)}, Y^{(i)})_{i=1,...,n}$$ such that $$(X(i),Y(i))(X^{(i)}, Y^{(i)})$$ are iid. and such that
$$P[Y(i)=1|X(i)=x(i)]=fw(x(i))P[Y^{(i)}=1|X^{(i)}=x^{(i)}] = f_w(x^{(i)})$$
where again, $$θ=w=(w1,...,wd)\theta=w=(w_1,...,w_d)$$ are the parameters of the model. Let us setup the likelihood: Put $$Y=(Y(1),...,Y(n))Y = (Y^{(1)}, ..., Y^{(n)})$$ and $$X=(X(1),...,X(n))X = (X^{(1)}, ..., X^{(n)})$$ and $$y=(y(1),...,y(n))y = (y^{(1)}, ..., y^{(n)})$$ and $$x=(x(1),...,x(n))x = (x^{(1)}, ..., x^{(n)})$$. Since the $$Z(i)=(X(i),Y(i))Z^{(i)} = (X^{(i)}, Y^{(i)})$$ are independent,
P[Y=y|X=x]=n∏i=1P[Y(i)=y(i)|X(i)=x(i)]=∏{i:y(i)=1}P[Y(i)=1|X(i)=x(i)]∏{i:y(i)=0}(1−P[Y(i)=1|X(i)=x(i)])=∏{i:y(i)=1}fw(x(i))∏{i:y(i)=0}(1−fw(x(i)))=n∏i=1(fw(x(i)))y(i)(1−fw(x(i)))1−y(i)\begin{align*} P[Y=y|X=x] &= \prod_{i=1}^n P[Y^{(i)}=y^{(i)}|X^{(i)}=x^{(i)}] \\ &= \prod_{\{i : y^{(i)}=1\}} P[Y^{(i)}=1|X^{(i)}=x^{(i)}] \prod_{\{i:y^{(i)}=0\}} (1 - P[Y^{(i)}=1|X^{(i)}=x^{(i)}]) \\ &= \prod_{\{i : y^{(i)}=1\}} f_w(x^{(i)}) \prod_{\{i:y^{(i)}=0\}} (1 - f_w(x^{(i)})) \\ &= \prod_{i=1}^n \left(f_w(x^{(i)})\right)^{y^{(i)}} \left(1 - f_w(x^{(i)})\right)^{1 - y^{(i)}} \end{align*}
So this is the likelihood. We would need to maximize that, i.e. most probably we need to compute some gradients of that expression with respect to $$ww$$. Uuuh, there is an ugly product in front… The rule $$(fg)′=f′g+fg′(fg)' = f'g + fg'$$ does not look very appealing. Hence we do the following (usual) trick: We do not maximize the likelihood but we compute the log of it and maximize this instead. For technical reasons we actually compute $$−log(likelihood)-\log(\text{likelihood})$$ and minimize that… So let us compute $$−log(likelihood)-\log(\text{likelihood})$$: Using $$log(ab)=log(a)+log(b)\log(ab) = \log(a) + \log(b)$$ and $$log(ab)=blog(a)\log(a^b) = b\log(a)$$ we obtain

−log(likelihood)=−log(n∏i=1(fw(x(i)))y(i)(1−fw(x(i)))1−y(i))=−n∑i=1y(i)log(fw(x(i)))+(1−y(i))log(1−fw(x(i)))\begin{align*} -\log(\text{likelihood}) &= -\log \left( \prod_{i=1}^n \left(f_w(x^{(i)})\right)^{y^{(i)}} \left(1 - f_w(x^{(i)})\right)^{1 - y^{(i)}} \right) \\ &= - \sum_{i=1}^n y^{(i)} \log(f_w(x^{(i)})) + (1-y^{(i)}) \log(1-f_w(x^{(i)})) \end{align*}

and if you now compare carefully to the NN model above you will see that this is actually nothing else than $$∑ni=1l(y(i),fw(x(i)))\sum_{i=1}^n l(y^{(i)}, f_w(x^{(i)}))$$.

So yes, in this case these two concepts (maximizing a likelihood of a probabilistic model and minimizing the loss function w.r.t. a model parameter) actually coincide. This is a more general pattern that occurs with other models as well. The connection is always

$$−log(likelihood)=loss function-\log(\text{likelihood}) = \text{loss function}$$
and
$$e−loss function=likelihoode^{-\text{loss function}} = \text{likelihood}$$

In that sense, statistics and machine learning are the same thing, just reformulated in a quirky way. Another example would be linear regression: There also exists a precise mathematical description of the probabilistic model behind it, see for example Likelihood in Linear Regression.

Notice that it may be pretty hard to figure out a natural explanation for the probabilistic version of a model. For example: in case of SVMs, the probabilistic description seems to be Gaussian Processes: see here.

The case above however was simple and what I have basically shown you is logistic regression (because a NN with one node and sigmoid output function is exactly logistic regression!). It may be a lot harder to interpret complicated architectures (with tweaks like CNNs etc) as a probabilistic model.