In machine learning blogs I frequently encounter the word “vanilla”. For example, “Vanilla Gradient Descent” or “Vanilla method”. This term is literally never seen in any optimization textbooks.

For instance, in this post, it says:

This is the simplest form of gradient descent technique. Here, vanilla

means pure / without any adulteration. Its main feature is that we

take small steps in the direction of the minima by taking gradient of

the cost function.Pray tell, what does “adulteration” mean in this context? The author goes further by contrasting vanilla gradient descent with gradient descent with momentum. So in this case vanilla gradient descent is another word for gradient descent.

In another post, it says,

Vanilla gradient descent, aka batch gradient descent,…

Sadly I have never heard of batch gradient descent either. Oh boy.

Can someone clarify what “vanilla” means and if there is a firmer mathematical definition to it?

**Answer**

Vanilla means standard, usual, or unmodified version of something. Vanilla gradient descent means the basic gradient descent algorithm without any bells or whistles.

There are many variants on gradient descent. In usual gradient descent (also known as batch gradient descent or vanilla gradient descent), the gradient is computed as the average of the gradient of each datapoint.

∇f=1n∑i∇loss(xi)

In stochastic gradient descent with a batch size of one, we might estimate the gradient as

∇f≈∇loss(x∗), where x∗ is randomly sampled from our entire dataset. It is a variant of normal gradient descent, so it wouldn’t be vanilla gradient descent. However, since even stochastic gradient descent has many variants, you might call this “vanilla stochastic gradient descent”, when comparing it to other fancier SGD alternatives, for example, SGD with momentum.

**Attribution***Source : Link , Question Author : Fraïssé , Answer Author : shimao*