I think I have some fundamental confusion about how the functions in Logistic regression work (or maybe just functions as a whole).

How is it that the function h(x) produces the curve seen in the left of the image?

I see that this is a plot of two variables but then these two variables (x1 & x2) are also arguments of the function itself. I know standard functions of one variable map to one output but this function clearly isn’t doing that– and I’m not totally sure why.

My intuition is that the blue/pink curve isn’t really plotted on this graph but rather is a representation (circles and X’s) that get mapped to values in the next dimension (3rd) of the graph. Is this reasoning faulty and am I just missing something? Thanks for any insight/intuition.

**Answer**

This is an example of overfitting on the Coursera course on ML by Andrew Ng in the case of a classification model with two features $(x_1, x_2)$, in which the true values are symbolized by $\color{red}{\large \times}$ and $\color{blue}{\large\circ},$ and the decision boundary is precisely tailored to the training set through the use of high order polynomial terms.

The problem that it tries to illustrate relates to the fact that, although the boundary decision line (curvilinear line in blue) doesn’t mis-classify any examples, its ability to generalize out of the training set will be compromised. Andrew Ng goes on to explain that regularization can mitigate this effect, and draws the magenta curve as a decision boundary less tight to the training set, and more likely to generalize.

With regards to your specific question:

My intuition is that the blue/pink curve isn’t really plotted on this graph but rather is a representation (circles and X’s) that get mapped to values in the next dimension (3rd) of the graph.

There is no height (third dimension): there are two categories, $(\large\times$ and $\large\circ),$ and the decision line shows how the model is separating them. In the simpler model

$$h_\theta(x)=g\left(\theta_0 + \theta_1 \, x_1 + \theta_2 \, x_2 \right)$$

the decision boundary will be linear.

Perhaps you have in mind something like this, for example:

$$5 + 2 x – 1.3 x^2 -1.2 x^2 y + 1 x^2 y^2 + 3 x^2 y^3$$

However, notice that there is a $g(\cdot)$ function in the hypothesis – the logistic activation in your initial question. So for every value of $x_1$ and $x_2$ the polynomial function undergoes and “activation” (often non-linear, such in the a sigmoid function as in the OP, although not necessarily (e.g. RELU)). As a bounded output the sigmoid activation lends itself to a probabilistic interpretation: the idea in a classification model is that at a given threshold the output will be labeled $\large \times$ $\large($ or $\large \circ).$ Effectively, a continuous output will be squashed into a binary $(1,0)$ output.

Depending on the weights (or parameters) and the activation function, each point $(x_1,x_2)$ in the feature plane will be mapped to either the category $\large \times$ or $\large \circ$. This labeling may or may not be correct: they will be correct when the points in the sample drawn by $\color{red}{\large \times}$ and $\color{blue}{\large \circ}$ on the plane in the picture on the OP correspond to the predicted labels. The boundaries between regions of the plane labeled $\large \times$ and those adjacent regions labeled $\large \circ$. They can be a line, or multiple lines isolating “islands” (see by yourself playing with this app by Tony Fischetti part of this blog entry on R-bloggers).

Notice the entry in Wikipedia on decision boundary:

In a statistical-classification problem with two classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class. A decision boundary is the region of a problem space in which the output label of a classifier is ambiguous.

There is no need for a height component to graph the actual boundary. If, on the other hand, you are plotting the sigmoid activation value (continuous with range $∈[0,1]),$ then you do need a third (“height”) component to visualize the graph:

If you want to introduce a $3$D visualization for the decision surface, check this slide on an online course on NN’s by Hugo Larochelle, representing the activation of a neuron:

where $y_1 = h_\theta(x)$, and $\mathbf W$ is the weight vector $(\Theta)$ in the example in the OP. Most interesting is the fact that $\Theta$ is orthogonal to the separating “ridge” in the classifier: effectively, if the ridge is a (hyper-)plane, the vector of weights or parameters is the normal vector.

Joining multiple neurons, these separating hyperplanes can be added and subtracted to end up with capricious shapes:

This links to the universal approximation theorem.

**Attribution***Source : Link , Question Author : muZero , Answer Author : Antoni Parellada*