From Bayesian Networks to Neural Networks: how multivariate regression can be transposed to a multi-output network

I’m dealing with a Bayesian Hierarchical Linear Model, here the network describing it.

$Y$ represents daily sales of a product in a supermarket(observed).

$X$ is a known matrix of regressors, including prices, promotions, day of the week, weather, holidays.

$S$ is the unknown latent inventory level of each product, which causes the most problems and which I consider a vector of binary variables, one for each product with $1$ indicating stockout and so the unavailability of the product.
Even if in theory unknown I estimated it through a HMM for each product, so it is to be considered as known as X. I just decided to unshade it for proper formalism.

$\eta$ is a mixed effect parameter for any single product where the mixed effects considered are the product price, promotions and stockout.

$\beta$ is the vector of fixed regression coefficients, while $b_1$ and $b_2$ are the vectors of mixed effects coefficient.
One group indicates brand and the other indicates flavour (this is an example, in reality I have many groups, but I here report just 2 for clarity).

$\Sigma_{\eta}$ , $\Sigma_{b_1}$ and $\Sigma_{b_2}$ are hyperparameters over the mixed effects.

Since i have count data let’s say that I treat each product sales as Poisson distributed conditional on the Regressors (even if for some products the Linear approximation holds and for others a zero inflated model is better).
In such a case I would have for a product $Y$ (this is just for who’s interested in the bayesian model itself, skip to the question if you find it uninteresting or non trivial 🙂):

$\Sigma_{\eta} \sim IW(\alpha_0,\gamma_0)$

$\Sigma_{b_1} \sim IW(\alpha_1,\gamma_1)$

$\Sigma_{b_2} \sim IW(\alpha_2,\gamma_2)$, $\alpha_0,\gamma_0,\alpha_1,\gamma_1,\alpha_2,\gamma_2$ known.

$\eta \sim N(\mathbf{0},\Sigma_{\eta})$

$b_1 \sim N(\mathbf{0},\Sigma_{b_1})$

$b_2 \sim N(\mathbf{0},\Sigma_{b_2})$

$\beta \sim N(\mathbf{0},\Sigma_{\beta})$, $\Sigma_{\beta}$ known.

$\lambda _{tijk} = \beta*X_{ti} + \eta_i*X_{pps_{ti}} + b_{1_j} * Z_{tj} + b_{2_k} Z_{tk}$,

$Y_{tijk} \sim Poi(exp(\lambda_{tijk}))$

$i \in {1,\dots,N}$, $j \in {1,\dots,m_1}$, $k \in {1,\dots,m_2}$

$Z_i$ matrix of mixed effects for the 2 groups, $X_{pps_i}$ indicating price, promotion and stockout of product considered. $IW$ indicates inverse Wishart distributions, usually used for covariance matrices of normal multivariate priors. But it’s not important here. An example of a possible $Z_i$ could be the matrix of all the prices, or we could even say $Z_i=X_i$. As regards the priors for the mixed-effects variance-covariance matrix, I would just try to preserve the correlation between the entries, so that $\sigma_{ij}$ would be positive if $i$ and $j$ are products of the same brand or either of the same flavour.

The intuition behind this model would be that the sales of a given product depend on its price, its availability or not, but also on the prices of all the other products and the stockouts of all the other products. Since I don’t want to have the same model (read: same regression curve) for all the coefficients, I introduced mixed effects which exploit some groups I have in my data, through parameter sharing.

My questions are:

1. Is there a way to transpose this model to a neural network architecture? I know that there are many questions looking for the relationships between bayesian network, markov random fields, bayesian hierarchical models and neural networks, but I didn’t find anything going from the bayesian hierarchical model to neural nets.
I ask the question about neural networks since, having a high dimensionality of my problem (consider that I have 340 products), parameter estimation through MCMC takes weeks (I tried just for 20 products running parallel chains in runJags and it took days of time). But I don’t want to go random and just give data to a neural network as a black box. I would like to exploit the dependence/independence structure of my network.

Here I just sketched a neural network. As you see, regressors($P_i$ and $S_i$ indicate respectively price and stockout of product $i$) at the top are inputed to the hidden layer as are those product specific (Here I considered prices and stockouts). (Blue and black edges have no particular meaning, it was just to make the figure more clear) . Furthermore $Y_1$ and $Y_2$ could be highly correlated while $Y_3$ could be a totally different product (think about 2 orange juices and red wine), but I don’t use this information in neural networks. I wonder if the grouping information is used just in weight inizialization or if one could customize the network to the problem.

Edit, my idea:

My idea would be something like this: as before, $Y_1$ and $Y_2$ are correlated products, while $Y_3$ is a totally different one. Knowing this a priori I do 2 things:

1. I preallocate some neurons in the hidden layer to any group I have, in this case I have 2 groups {($Y_1,Y_2$),($Y_3$)}.
2. I initialize high weights between the inputs and the allocated nodes
(the bold edges) and of course I build other hidden nodes to capture the remaining ‘randomness’ in the data.