Problem is:

Derive the gradient with respect to the input layer for a a single

hidden layer neural network using sigmoid for input -> hidden, softmax

for hidden -> output, with a cross entropy loss.I can get through most of the derivation using the chain rule but I am uncertain on how to actually “chain” them together.

Define some notations

r=xW1+b1

h=σ(r), σ is sigmoid function

θ=hW2+b2,

ˆy=S(θ), S is softmax function

J(ˆy)=∑iylogˆyi , y is real label one-hot vector

Then by the chain rule,

\frac{\partial J}{\partial \boldsymbol{x}}

= \frac{\partial J}{\partial \boldsymbol{\theta}} \cdot \frac{\partial \boldsymbol{\theta}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{h}}{\partial \boldsymbol{r}} \cdot \frac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}}

Individual gradients are:

\frac{\partial J}{\partial \boldsymbol{\theta}} = \left( \hat{\boldsymbol{y}} – \boldsymbol{y} \right)

\frac{\partial \boldsymbol{\theta}}{\partial \boldsymbol{h}} = \frac{\partial}{\partial \boldsymbol{h}} \left[ \boldsymbol{h}W_2 + \boldsymbol{b_2}\right] = W_2^T

\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{r}} = h \cdot \left(1-h\right)

\frac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}} = \frac{\partial}{\partial \boldsymbol{x}} \left[ \boldsymbol{x}W_1 + \boldsymbol{b_1}\right] = W_1^T

Now we have to chain the definitions together. In single-variable this is easy, we just multiply everything together. In vectors, I’m not sure whether to use element-wise multiplication or matrix multiplication.

\frac{\partial J}{\partial \boldsymbol{x}}

= \left( \hat{\boldsymbol{y}} – \boldsymbol{y} \right) * W_2^T \cdot \left[\boldsymbol{h} \cdot \left(1-\boldsymbol{h}\right)\right] * W_1^T

Where \cdot is element-wise multiplication of vectors, and * is a matrix multiply. This combination of operations is the only way I could seem to string these together to get a 1 \cdot D_x dimension vector, which I know \frac{\partial J}{\partial \boldsymbol{x}} has to be.

My question is: what is the principled way for me to figure out which operator to use? I’m specifically confused by the need for the element-wise one between W_2^T and h.

Thanks!

**Answer**

I believe that the key to answering this question is to point out that the element-wise multiplication is actually **shorthand** and therefore when you derive the equations you *never* actually use it.

The actual operation is not an element-wise multiplication but instead a standard matrix multiplication of a gradient with a Jacobian, **always**.

In the case of the nonlinearity, the Jacobian of the vector output of the non-linearity with respect to the vector input of the non-linearity happens to be a diagonal matrix. It’s therefore true that the gradient multiplied by this matrix is equivalent to the gradient of the output of the nonlinearity with respect to the loss element-wise multiplied by a vector containing all the partial derivatives of the nonlinearity with respect to the input of the nonlinearity, but this *follows* from the Jacobian being diagonal. You must pass through the Jacobian step to get to the element-wise multiplication, which might explain your confusion.

In math, we have some nonlinearity s, a loss L, and an input to the nonlinearity x \in \mathbb{R}^{n \times 1} (this could be any tensor). The output of the nonlinearity has the same dimension s(x) \in \mathbb{R}^{n \times 1}—as @Logan says, the activation function are defined as element-wise.

We want \nabla_{x}L=\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L

Where \dfrac{\partial s(x)}{\partial x} is the Jacobian of s. Expanding this Jacobian, we get

\begin{bmatrix}

\dfrac{\partial{s(x_{1})}}{\partial{x_1}} & \dots & \dfrac{\partial{s(x_{1})}}{\partial{x_{n}}} \\

\vdots & \ddots & \vdots \\

\dfrac{\partial{s(x_{n})}}{x_{1}} & \dots & \dfrac{\partial{s(x_{n})}}{\partial{x_{n}}}

\end{bmatrix}

We see that it is everywhere zero except for the diagonal. We can make a vector of all its diagonal elements Diag\left(\dfrac{\partial s(x)}{\partial x}\right)

And then use the element-wise operator.

\nabla_{x}L

=\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L

=Diag\left(\dfrac{\partial s(x)}{\partial x}\right) \circ \nabla_{s(x)}L

**Attribution***Source : Link , Question Author : amatsukawa , Answer Author : Leonard2*