How does the back-propagation work in a siamese neural network?

I have been studying the architecture of the siamese neural network introduced by Yann LeCun and his colleagues in 1994 for the recognition of signatures (“Signature verification using a siamese time delay neural network” .pdf, NIPS 1994)

I understood the general idea of this architecture, but I really cannot understand how the backpropagation works in this case.
I cannot understand what are the target values of the neural network, that will allow backpropagation to properly set the weights of each neuron.

Image from  “Probabilistic Siamese Network for Learning Representations” by Chen Liu (University of Toronto 2013).

In this architecture, the algorithm computes the cosine similarity between the final representations of the two neural networks
The paper states: “The desired output is for a small angle between the outputs of the two subnetworks (f1 and f2) when to genuine signatures are presented, and a large angle if one of the signatures is a forgery”.

I cannot really understand how they could use a binary function (cosine similarity between two vectors) as target to run the backpropagation.

How is the backpropagation computed in the siamese neural networks?


Both networks share the similar architectures and but they are constrained to have the same weights as the publication describes at section 4 [1].

Their goal is to learn features that minimize the cosine similarity between, their output vectors when signatures are genuine, and maximize it when they are forged (this is the backprop goal as well, but the actual loss function is not presented).

The cosine similarity cos(A,B)=AB of two vectors A, B, is a measure of similarity that gives you the cosine of the angle between them (therefore, its output is not binary). If your concern is how you can backprop to a function that outputs either true or false, think of the case of binary classification.

You shouldn’t change the output layer, it consists of trained neurons with linear values and its a higher-level abstraction of your input. The whole network should be trained together. Both outputs O_1 and O_2 are passed through a cos(O_1,O_2) function that outputs their cosine similarity (1 if they are similar, and 0 if they are not). Given that, and that we have two sets of input tuples X_{Forged}, X_{Genuine}, an example of the simplest possible loss function you could have to train against could be:

\mathcal{L}=\sum_{(x_A,x_B) \in X_{Forged}} cos(x_A,x_B) – \sum_{(x_C,x_D) \in X_{Genuine}} cos(x_C,x_D)

After you have trained your network, you just input the two signatures you get the two outputs pass them to the cos(O_1,O_2) function, and check their similarity.

Finally, to keep the network weights identical there are several ways to do that (and they are used in Recurrent Neural Networks too); a common approach is to average the gradients of the two networks before performing the Gradient Descent update step.


Source : Link , Question Author : , Answer Author : Yannis Assael

Leave a Comment