From Bayesian Networks to Neural Networks: how multivariate regression can be transposed to a multi-output network

I’m dealing with a Bayesian Hierarchical Linear Model, here the network describing it.

Graphical Model describing the problem

Y represents daily sales of a product in a supermarket(observed).

X is a known matrix of regressors, including prices, promotions, day of the week, weather, holidays.

S is the unknown latent inventory level of each product, which causes the most problems and which I consider a vector of binary variables, one for each product with 1 indicating stockout and so the unavailability of the product.
Even if in theory unknown I estimated it through a HMM for each product, so it is to be considered as known as X. I just decided to unshade it for proper formalism.

η is a mixed effect parameter for any single product where the mixed effects considered are the product price, promotions and stockout.

β is the vector of fixed regression coefficients, while b1 and b2 are the vectors of mixed effects coefficient.
One group indicates brand and the other indicates flavour (this is an example, in reality I have many groups, but I here report just 2 for clarity).

Ση , Σb1 and Σb2 are hyperparameters over the mixed effects.

Since i have count data let’s say that I treat each product sales as Poisson distributed conditional on the Regressors (even if for some products the Linear approximation holds and for others a zero inflated model is better).
In such a case I would have for a product Y (this is just for who’s interested in the bayesian model itself, skip to the question if you find it uninteresting or non trivial 🙂):

ΣηIW(α0,γ0)

Σb1IW(α1,γ1)

Σb2IW(α2,γ2), α0,γ0,α1,γ1,α2,γ2 known.

ηN(0,Ση)

b1N(0,Σb1)

b2N(0,Σb2)

βN(0,Σβ), Σβ known.

λtijk=βXti+ηiXppsti+b1jZtj+b2kZtk,

YtijkPoi(exp(λtijk))

i1,,N, j1,,m1, k1,,m2

Zi matrix of mixed effects for the 2 groups, Xppsi indicating price, promotion and stockout of product considered. IW indicates inverse Wishart distributions, usually used for covariance matrices of normal multivariate priors. But it’s not important here. An example of a possible Zi could be the matrix of all the prices, or we could even say Zi=Xi. As regards the priors for the mixed-effects variance-covariance matrix, I would just try to preserve the correlation between the entries, so that σij would be positive if i and j are products of the same brand or either of the same flavour.

The intuition behind this model would be that the sales of a given product depend on its price, its availability or not, but also on the prices of all the other products and the stockouts of all the other products. Since I don’t want to have the same model (read: same regression curve) for all the coefficients, I introduced mixed effects which exploit some groups I have in my data, through parameter sharing.

My questions are:

  1. Is there a way to transpose this model to a neural network architecture? I know that there are many questions looking for the relationships between bayesian network, markov random fields, bayesian hierarchical models and neural networks, but I didn’t find anything going from the bayesian hierarchical model to neural nets.
    I ask the question about neural networks since, having a high dimensionality of my problem (consider that I have 340 products), parameter estimation through MCMC takes weeks (I tried just for 20 products running parallel chains in runJags and it took days of time). But I don’t want to go random and just give data to a neural network as a black box. I would like to exploit the dependence/independence structure of my network.

Here I just sketched a neural network. As you see, regressors(Pi and Si indicate respectively price and stockout of product i) at the top are inputed to the hidden layer as are those product specific (Here I considered prices and stockouts). (Blue and black edges have no particular meaning, it was just to make the figure more clear) . Furthermore Y1 and Y2 could be highly correlated while Y3 could be a totally different product (think about 2 orange juices and red wine), but I don’t use this information in neural networks. I wonder if the grouping information is used just in weight inizialization or if one could customize the network to the problem.

puppet example of a neural net

Edit, my idea:

Possible initialization?

My idea would be something like this: as before, Y1 and Y2 are correlated products, while Y3 is a totally different one. Knowing this a priori I do 2 things:

  1. I preallocate some neurons in the hidden layer to any group I have, in this case I have 2 groups {(Y1,Y2),(Y3)}.
  2. I initialize high weights between the inputs and the allocated nodes
    (the bold edges) and of course I build other hidden nodes to capture the remaining ‘randomness’ in the data.

Thank you in advance for your help

Answer

For the record, I don’t view this as an answer, but just a long comment !
The PDE (heat equation) that is used to model the flow of heat through a metal rod can also be used to model option pricing. No one that I know of has ever tried to suggest a connection between option pricing and heat flow per se. I think that the quote from Danilov’s link is saying the same thing. Both Bayesian Graphs and Neural Nets use the language of graphs to express the relations between their different internal pieces. However, Bayesian graphs tells one about the correlation structure of the input variables and the graph of a neural net tells one how to build the prediction function from the input variables. These are very different things.
Various methods used in DL attempt to ‘chose’ the most important variables, but that is an empirical issue. It also doesn’t tell one about the correlation structure of either the entire set of variables or the remaining variables. It merely suggests that the surviving variables will be best for prediciton.
For example if one looks at neural nets, one will be led to the German credit data set, which has, if I recall correctly, 2000 data points and 5 dependent variables. Through trial and error I think you will find that a net with only 1 hidden layer and using only 2 of the variables gives the best results for prediction. However, this can only be discovered by building all the models and testing them on the independent testing set.

Attribution
Source : Link , Question Author : Tommaso Guerrini , Answer Author : meh

Leave a Comment