What is bits per dimension (bits/dim) exactly (in pixel CNN papers)?

If it is for the lack of my effort to search, I apologize in advance but I couldn’t find a explicit definition of bits per dimension (bits/dim).

The first mention of its definition I found was from ‘Pixel Recurrent Neural Networks’. But it is still quite unclear to me so let me ask.

Defining the 256-softmax output of a image $$\boldsymbol{x}\boldsymbol{x}$$ as $$\boldsymbol{y} \in \mathbb{R}^{32 \times 32 \times 256}\boldsymbol{y} \in \mathbb{R}^{32 \times 32 \times 256}$$, the negative log-likelihood, to my understanding, is
$$– \mathbb{E}_{\boldsymbol{x}} \ln p(\boldsymbol{y}|\boldsymbol{x}). - \mathbb{E}_{\boldsymbol{x}} \ln p(\boldsymbol{y}|\boldsymbol{x}).$$
(Note that we are assuming here that image is one-channeled with its size being $$32 \times 32 \times 132 \times 32 \times 1$$.)

According to the above paper (and possibly other materials), it seems to me that the definition of bits/dim is
$$\text{bit/dim} = \dfrac{- \mathbb{E}_{\boldsymbol{x}} \log_2 p(\boldsymbol{y}|\boldsymbol{x})}{32\cdot 32\cdot 1} \text{bit/dim} = \dfrac{- \mathbb{E}_{\boldsymbol{x}} \log_2 p(\boldsymbol{y}|\boldsymbol{x})}{32\cdot 32\cdot 1}$$
because it says ‘The total discrete log-likelihood is normalized by
the dimensionality of the images ‘.

Questions.

1) Is the above definition correct?

2) Or should I replace $$\mathbb{E}_{\boldsymbol{x}}\mathbb{E}_{\boldsymbol{x}}$$ by $$\sum_{\boldsymbol{x}}\sum_{\boldsymbol{x}}$$?

It is explained on page 12 here in great detail.

and is also discussed
here although in not as much detail.

Compute the negative log likelihood in base e, apply change of base
for converting log base e to log base 2, then divide by the number of
pixels (e.g. 3072 pixels for a 32×32 rgb image).

To change base for the log, just divide the log base e value by log(2)
— e.g. in python it’s like: (nll_val / num_pixels) / numpy.log(2)

and

As noted by DWF, the continuous log-likelihood is not directly
comparable to discrete log-likelihood. Values in the PixelRNN paper
for NICE’s bits/pixel were computed after correctly accounting for the
discrete nature of pixel values in the relevant datasets. In the case
of the number in the NICE paper, you’d have to subtract log(128) from
the log-likelihood of each pixel (this is to account for data
scaling).

I.e. -((5371.78 / 3072.) - 4.852) / np.log(2.) = 4.477