Although a Relu activation function can deal with real value number but I have tried scaling the dataset in the range [0,1] (min-max scaling) is more effective before feed it to the neural network. on the other hand, the batch normalization (BN) is also normalizing data before passed to the non-linearity layer (activation function). I was wondering if the min-max scaling is still needed when BN is applied. can we perform min-max scaling and BN together?. It would be nice if someone guides me to the better understanding
As mentioned, it’s best to use [-1, 1] min-max scaling or zero-mean, unit-variance standardization. Scaling your data into [0, 1] will result in slow learning.
To answer your question: Yes, you should still standardize your inputs to a network that uses Batch Normalization. This will ensure that inputs to the first layer have zero mean and come from the same distribution, while Batch Normalization on subsequent layers will ensure that inputs to those layers have zero mean in expectation and that their distributions do not drift over time.
The reasons that we want zero mean and stable input distribution are discussed further in Section 4.3 of Efficient BackProp.