# How can a multiclass perceptron work?

I don’t have any background in math, but I understand how the simple Perceptron works and I think I grasp the concept of a hyperplane (I imagine it geometrically as a plane in 3D space which seperates two point clouds, just as a line separates two point clouds in 2D space).

But I don’t understand how one plane or one line could separate three different point clouds in 3D space or in 2D space, respectively – this is geometrically not possible, is it?

I tried to understand the corresponding section in the Wikipedia article, but already failed miserably at the sentence “Here, the input x and the output y are drawn from arbitrary sets”. Could somebody explain the multiclass perceptron to me and how it goes with the idea of the hyperplane, or maybe point me to a not-so-mathematical explanation?

Suppose we have data $(x_1, y_1), \dots, (x_k,y_k)$ where $x_i \in \mathbb{R}^n$ are input vectors and $y_i \in \{\text{red, blue, green} \}$ are the classifications.
We know how to build a classifier for binary outcomes, so we do this three times: group the outcomes together, $\{\text{red, blue or green} \}$,$\{\text{blue, red or green} \}$ and $\{\text{green, blue or red} \}$.
Each model takes the form of a function $f: \mathbb{R}^n \to \mathbb{R}$, call them $f_R, f_B, f_G$ respectively. This takes an input vector to the signed distance from the hyperplane associated to each model, where positive distance corresponds to a prediction of blue if $f_B$, red if $f_R$ and green if $f_G$. Basically the more positive $f_G(x)$ is, the more the model thinks that $x$ is green, and vice versa. We don’t need the output to be a probability, we just need to be able to measure how confident the model is.
Given an input $x$, we classify it according to $\text{argmax}_{c} \ f_c(x)$, so if $f_G(x)$ is the largest amongst $\{f_G(x), f_B(x), f_R(x) \}$ we would predict green for $x$.