# Feature selection on a Bayesian hierarchical generalized linear model

I am looking to estimate a hierarchical GLM but with feature selection to determine which covariates are relevant at the population level to include.

Suppose I have $G$ groups with $N$ observations and $K$ possible covariates
That is, I have design matrix of covariates $\boldsymbol{x}_{(N\cdot G) \times K}$, outcomes $\boldsymbol{y}_{(N\cdot G) \times 1}$. Coefficients on these covariates are $\beta_{K \times 1}$.

Suppose $Y$~$Bernoulli(p(x,\beta))$

The below is a standard hierarchical bayesian GLM with logit sampling model and normally distributed group coefficients.

$${\cal L}\left(\boldsymbol{y}|\boldsymbol{x},\beta_{1},…\beta_{G}\right)\propto\prod_{g=1}^{G}\prod_{t=1}^{N}\left(\Pr\{j=1|p_{t},\beta^{g}\}\right)^{y_{g,t}}\left(1-\Pr\{j=1|p_{t},\beta^{g}\}\right)^{1-y_{g,t}}$$

$$\beta_{1},…\beta_{G}|\mu,\Sigma\sim^{iid}{\cal N}_{d}\left(\mu,\Sigma\right)$$

$$\mu|\Sigma\sim{\cal N}\left(\mu_{0},a^{-1}\Sigma\right)$$
$$\Sigma\sim{\cal IW}\left(v_{0},V_{0}^{-1}\right)$$

I want to modify this model (or find a paper that does, or work that discusses it) in such a way that there is some sharp feature selection (as in LASSO) on the dimensionality of $\beta$.

(1) The simplest most direct way would be to regularize this at the population level so that we essentially restrict the dimensionality of $\mu$ and all $\beta$ have the same dimension.

(2) The more nuanced model would have shrinkage at the group level, where dimension of $\beta$ depends on the hierarhical unit.

I am interested in solving 1 and 2, but much more important is 1.

The way I’d tackle (1) would be involve a spike and slab model something like:

$\beta_{g,k} = z_{k}m_{g,k}$

$z_k \sim Bern(p)$

$m_{g,k} \sim N(\mu, \Sigma)$

$\mu, \Sigma \sim NIW_{v_0}(\mu_0, V_0^{-1})$

This:

• Retains the flexibility on the $\beta$’s from the NIW prior on $\mu, \Sigma$.
• Models selection of variables for all groups at once.
• Easily extensible by adding a sub-index for group to $z_{g,k}$ and having a common beta prior for each location $k$.

Of course, I think this is the kind of problem where there are a number of valid approaches.