Say I want to train a classifier that assigns an image of a person as

young,middle-aged, orold.A simple way would be to treat the classes as independent categories and train a classifier. But apparently there’s some relationship between the classes, how can I make use of this to get better?

I’m thinking maybe I can do

1) change the loss, say increase the loss of predictingyoungasoldoroldasyoung.

2) turn it into a regression problem,young,middle-aged, andoldare represented as say 0, 1 and 2.

**Answer**

I had a look at this recently with a convolutional neural network classifier working with six ordinal classes. I tried three different methods:

### Method 1: Standard independent classification

This is what you mentioned as a baseline in the question, with the mapping:

```
class 0 -> [1, 0, 0, 0, 0, 0]
class 1 -> [0, 1, 0, 0, 0, 0]
class 2 -> [0, 0, 1, 0, 0, 0]
class 3 -> [0, 0, 0, 1, 0, 0]
class 4 -> [0, 0, 0, 0, 1, 0]
class 5 -> [0, 0, 0, 0, 0, 1]
```

We would typically use softmax activation, and categorical crossentropy loss with this.

However, this does not, as you say, take into account the relationship between the classes, so that the loss function is only affected by whether you hit the right class or not, and is not affected by whether you come close.

### Method 2: Ordinal target function

This is an approach published by Cheng et al. (2008), which has also been referred to on StackExchange here and here. The mapping is now:

```
class 0 -> [0, 0, 0, 0, 0]
class 1 -> [1, 0, 0, 0, 0]
class 2 -> [1, 1, 0, 0, 0]
class 3 -> [1, 1, 1, 0, 0]
class 4 -> [1, 1, 1, 1, 0]
class 5 -> [1, 1, 1, 1, 1]
```

This is used with a sigmoid activation and binary crossentropy loss. This target function means that the loss is smaller the closer you get to the right class.

You can predict a class from the output $\{y_k\}$ of this classifier by finding the first index $k$ where $y_k < 0.5$. $k$ then gives you the predicted class.

### Method 3: Turning classification into regression

This is the same idea as your second one. The mapping here would be:

```
class 0 -> [0]
class 1 -> [1]
class 2 -> [2]
class 3 -> [3]
class 4 -> [4]
class 5 -> [5]
```

I used a linear activation and mean-squared-error loss with this. Like the previous approach, this also gives you a smaller loss the less you miss.

When predicting a class based on the output of this, you can simply round the output to the nearest integer.

### Some example results

I evaluated the different methods with the same data set. The metrics were precise accuracy (hitting the correct class) and adjacent accuracy (hitting the correct class or one of its neighbours), in class-unbalanced and class-balanced versions. Each metric value shown below is found as the average of three runs.

For Method 1 / Method 2 / Method 3, the metrics gave:

- Unbalanced precise accuracy: 0.582 /
**0.606**/ 0.564 - Balanced precise accuracy: 0.460 / 0.499 /
**0.524** - Unbalanced adjacent accuracy: 0.827 / 0.835 /
**0.855** - Balanced adjacent accuracy: 0.827 / 0.832 /
**0.859**

Thus, for my particular dataset and network setup, the regression approach generally does the best, and the standard approach with independent classes generally does the worst. I don’t know how well these results generalise to other cases, but it should not be that difficult to adapt any ordinal classifier to be able to use all three methods so that you can test for yourself.

**Attribution***Source : Link , Question Author : dontloo , Answer Author : Erlend Magnus Viggen*