I know that KL Divergence is not symmetric and it cannot be strictly considered as a metric. If so, why is it used when JS Divergence satisfies the required properties for a metric?
Are there scenarios where KL divergence can be used but not JS Divergence or vice-versa?
I found a very mature answer on the Quora and just put it here for people who look for it here:
The Kullback-Leibler divergence has a few nice properties, one of them
being that $𝐾𝐿[𝑞;𝑝]$ kind of abhors regions where $𝑞(𝑥)$ have
non-null mass and $𝑝(𝑥)$ has null mass. This might look like a bug,
but it’s actually a feature in certain situations.
If you’re trying to find approximations for a complex (intractable)
distribution $𝑝(𝑥)$ by a (tractable) approximate distribution $𝑞(𝑥)$
you want to be absolutely sure that any 𝑥 that would be very
improbable to be drawn from $𝑝(𝑥)$ would also be very improbable to be
drawn from $𝑞(𝑥)$. That KL have this property is easily shown: there’s
a $𝑞(𝑥)𝑙𝑜𝑔[𝑞(𝑥)/𝑝(𝑥)]$ in the integrand. When 𝑞(𝑥) is small
but $𝑝(𝑥)$ is not, that’s ok. But when $𝑝(𝑥)$ is small, this grows very
rapidly if $𝑞(𝑥)$ isn’t also small. So, if you’re choosing $𝑞(𝑥)$ to
minimize $𝐾𝐿[𝑞;𝑝]$, it’s very improbable that $𝑞(𝑥)$ will assign a
lot of mass on regions where $𝑝(𝑥)$ is near zero.
The Jensen-Shannon divergence don’t have this property. It is well
behaved both when $𝑝(𝑥)$ and $𝑞(𝑥)$ are small. This means that it won’t
penalize as much a distribution $𝑞(𝑥)$ from which you can sample
values that are impossible in $𝑝(𝑥)$.