# Linear regression on a sample spanning many orders of magnitude

Beer’s law from chemistry says that the absorbance of a liquid $A$ is proportional to the concentration $C$, so:

The standard thing to do then is to prepare a set of solutions with known concentrations, measure the absorbance to form a ‘standard curve’ (a calibration curve basically), and do a simple linear regression on that data to get the proportionality (which can then be used to predict the concentrations of unknown solutions).

One easy way to do this is to start with a known concentration and perform a serial dilution, which would get you 2x dilution, 4x, 8x, 16x….etc. I.e. if you start with a solution of $100\mu \mathrm g/\mathrm {mL}$ you’d get solutions with $50\mu \mathrm g/\mathrm {mL}$, $25\mu \mathrm g/\mathrm {mL}$, $12.5\mu \mathrm g/\mathrm {mL}$ etc…

Now when you do the linear regression, you have a data set with lots of data points at low concentrations and very few at higher concentrations. It seems much more natural to represent this problem on a log scale. My question is then, should I be doing a linear regression of $A$ vs. $C$ or $\log A$ vs. $\log C$? When I compare the models, they seem to give answers that are on the same order of magnitude but on the order of 30% different.  Let the physics (of the experiment and the measuring apparatus) guide you.

Ultimately, absorption is determined by measuring amounts of radiation passing through the medium and those measurements come down to counting photons. When the medium is macroscopic, thermodynamic fluctuations in concentration are negligible so the principal source of error lies in the counting. This error (or “shot noise”) has a Poisson distribution. This implies the error is relatively large at high concentrations when little radiation is passing through.

With sufficient care in the laboratory, concentrations typically are measured extremely accurately, so I will not worry about errors in concentrations.

The absorbance itself is directly related to the logarithm of the measured radiation. Taking the logarithm evens out the amount of error across the entire possible range of concentrations. For this reason alone, it is best to analyze the absorbance in terms of its usual values rather than re-expressing them. In particular, we should avoid taking logs of absorbance, even though that would simplify the expression of the Beer-Lambert law.

We should also be alert to possible non-linearities. The derivation of the Beer-Lambert Law suggests the absorbance vs concentration curve will become nonlinear at high concentrations. Some way to detect or test this is needed.

These considerations suggest a simple procedure to analyze a series of $(C_i, A_i)$ pairs of concentrations and measured absorbances:

• Estimate the coefficient $\kappa$ as the arithmetic mean of $A/C$, $\hat{\kappa} = \sum_i \frac{A_i}{C_i}$.

• Predict the absorbance at each concentration in terms of the estimated coefficient: $\hat{A}(C) = \hat{\kappa}C.$

• Check the additive residuals $A_i - \hat{A_i}$ for nonlinear trends in $C_i$.

Of course all this is theoretical and somewhat speculative–we haven’t any actual data to analyze–but it is a reasonable place to start. If repeated laboratory experience suggests the data depart from the statistical behaviors described here, then some modifications of these procedures would be called for.

To illustrate these ideas, I have created a simulation that implements the key aspects of the measurement, including the Poisson noise and possibly nonlinear responses. By running it many times, we can observe the kind of variation that is likely to be encountered in the laboratory. Here are the results of one simulation run. (Other simulations can be carried out simply by changing the starting seed in the code below and modifying various parameters as desired.) This simulated experiment measured absorbance at concentrations of $1$ down to $1/32$. The vertical spreads in values apparent in the scatterplot show the effects of (a) shot noise in the transmission measurements and (b) shot noise in the initial transmission measurement at zero concentration. (Notice how this actually creates some negative absorbance values.) Although the resulting errors are not going to have exactly the same distributions at each concentration, the roughly equal spreads are empirical evidence that the distributions are close enough to being the same that we needn’t worry about that. In other words, there is no need to weight the absorbances according to the concentrations.

The red diagonal line has been estimated from all 50 simulations. It has a slope of $\hat{\kappa}=2.13$, which differs slightly from the physically correct slope of $2$ that was used in the simulations. This deviation is so large because I assumed there was very little radiation to measure; the maximum photon count was only $1000$. In practice, maximum counts could be many orders of magnitude greater than this, leading to highly precise slope estimates–but then we would not learn much from this figure!

The histogram of residuals does not look good: it is skewed to the right. This indicates some kind of trouble. That trouble does not come from asymmetry in the residuals at each concentration; rather, it comes from a lack of fit. That is evident in the boxplots at the right: although the first five of them line up almost horizontally, the last one–at the highest concentration–clearly differs in location (it is too high) and scale (it is too long). This results from a nonlinear response I built into the simulation. Although the nonlinearity is present throughout the full range of concentrations, it has an appreciable effect only at the very highest concentrations. This is more or less what would happen in the laboratory, too. However, with only one calibration run available we could not draw such boxplots. Consider analyzing multiple independent runs if nonlinearity might be a problem.

The simulation was performed in R. The calculations with actual data, though, are simple to conduct by hand or with a spreadsheet: just make sure to check the residuals for nonlinearity.

#
# Simulate instrument responses:
#   concentration is an array of concentrations to use.
#   kappa is the Beer-Lambert law coefficient.
#   n.0   is the largest  expected photon count (at 0 concentration).
#   start is a tiny positive value used to avoid logs of zero.
#   beta  is the amount of nonlinearity (it is a quadratic perturbation
#           of the Beer-Lambert law).
# The return value is a parallel array of measured absorbances; it is subject
# to random fluctuations.
#
observe <- function(concentration, kappa=1, n.0=10^3, start=1/6, beta=0.2) {
transmission <- exp(-kappa * concentration - beta * concentration^2)
transmission.observed <- start + rpois(length(transmission), transmission * n.0)
absorbance <- -log(transmission.observed / rpois(1, n.0))
return(absorbance)
}
#
# Perform a set of simulations.
#
concentration <- 2^(-(0:5)) # Concentrations to use
n.iter <- 50                # Number of iterations
set.seed(17)                # Make the results reproducible
absorbance <- replicate(n.iter, observe(concentration, kappa=2))
#
# Put the results into a data frame for further analysis.
#
a.df <- data.frame(absorbance = as.vector(absorbance))
a.df$concentration <- concentration # ($ interferes with TeX processing on this site)
#
# Create the figures.
#
par(mfrow=c(1,3))
#
# Set up a region for the scatterplot.
#
plot(c(min(concentration), max(concentration)),
c(min(absorbance), max(absorbance)), type="n",
xlab="Concentration", ylab="Absorbance",
main=paste("Scatterplot of", n.iter, "iterations"))
#
# Make the scatterplot.
#
invisible(apply(absorbance, 2,
function(a) points(concentration, a, col="#40404080")))
slope <- mean(a.df$absorbance / a.df$concentration)
abline(c(0, slope), col="Red")
#
# Show the residuals.
#
a.df$residuals <- a.df$absorbance - slope * a.df$concentration #$
hist(a.df$residuals, main="Histogram of residuals", xlab="Absorbance Difference") #$
#
# Study the residual distribution vs. concentration.
#
boxplot(a.df$residuals ~ a.df$concentration, main="Residual distributions",
xlab="Concentration")