I don’t have any background in math, but I understand how the simple Perceptron works and I think I grasp the concept of a hyperplane (I imagine it geometrically as a plane in 3D space which seperates two point clouds, just as a line separates two point clouds in 2D space).

But I don’t understand how one plane or one line could separate three different point clouds in 3D space or in 2D space, respectively – this is geometrically not possible, is it?

I tried to understand the corresponding section in the Wikipedia article, but already failed miserably at the sentence “Here, the input x and the output y are drawn from arbitrary sets”. Could somebody explain the multiclass perceptron to me and how it goes with the idea of the hyperplane, or maybe point me to a not-so-mathematical explanation?

**Answer**

Suppose we have data (x1,y1),…,(xk,yk) where xi∈Rn are input vectors and yi∈{red, blue, green} are the classifications.

We know how to build a classifier for binary outcomes, so we do this three times: group the outcomes together, {red, blue or green},{blue, red or green} and {green, blue or red}.

Each model takes the form of a function f:Rn→R, call them fR,fB,fG respectively. This takes an input vector to the signed distance from the hyperplane associated to each model, where positive distance corresponds to a prediction of blue if fB, red if fR and green if fG. Basically the more positive fG(x) is, the more the model thinks that x is green, and vice versa. We don’t need the output to be a probability, we just need to be able to measure how confident the model is.

Given an input x, we classify it according to argmaxc fc(x), so if fG(x) is the largest amongst {fG(x),fB(x),fR(x)} we would predict green for x.

This strategy is called “one vs all”, and you can read about it here.

**Attribution***Source : Link , Question Author : grssnbchr , Answer Author : gung – Reinstate Monica*