Artificial neural networks have a bad reputation of being a black box. More over in cases when we do have some prior knowledge about the domain of a particular supervised learning problem it is not obvious how to introduce it to the model.
On the other hand Bayesian models and the state of art of those – Bayesian networks – solve this problem naturally. But these models have their own known limitations.
Is it possible to take best from the both kinds of models. Are there any theory or practical success stories of combining both kinds of models into a some hybrid.
And, in general, what are the known strategies to incorporate prior knowledge into a neural network model (feed forward or recurrent)
Actually, there are many ways to incorporate prior knowledge into neural networks. The simplest type of prior knowledge often used is weight decay. Weight decay assumes the weights come from a normal distribution with zero mean and some fixed variance. This type of prior is added as an extra term to the loss function, having the form:
where E(w) is the data term (e.g. a MSE loss) and λ controls the relative importance of the two terms; it is also proportional to the prior variance. This corresponds to the negative log-likelihood of the following probability:
where p(w)=N(w|0,λ−1I) and −logp(w)∝−logexp(−λ2||w||22)=λ2||w||22. This is the same as the bayesian approach to modeling prior knowledge.
However, there are also other, less straight-forward methods to incorporate prior knowledge into neural networks. They are very important: prior knowledge is what really bridges the gap between huge neural networks and (relatively) small datasets. Some examples are:
Data augmentation: By training the network on data perturbed by various class-preserving transformations, you are incorporating your prior knowledge about the domain, namely the transformations that the network should be invariant to.
Network architecture: One of the most successful neural network techniques of the past decades are the convolutional networks. Their architecture sharing limited field-of-view kernels over spatial locations brilliantly exploits our knowledge about data in image space. This is also a form of prior knowledge incorporated into the model.
Regularization loss terms: Similar to weight decay, it is possible to construct other loss terms which penalize mappings contradicting our domain knowledge.
For an in-depth analysis/overview of these methods, I can point you to my article Regularization for Deep Learning: A Taxonomy. Also, I recommend looking into bayesian neural networks, meta-learning (finding meaningful prior information from other tasks in the same domain, see e.g. (Baxter, 2000)), possibly also one-shot learning (e.g. (Lake et al., 2015)).