I am using a script. It is for core records. I have a dataframe which shows the different elemental compositions in the columns over a given depth (in the first column). I want to perform a PCA with it and I am confused about the standardization method I have to choose.

Has anyone of you used the

`clr()`

to prepare your data for the`prcomp()`

? Or does it adulterate my solutions. I have tried using the`clr()`

on the data before using the`prcomp()`

function in addition to using the attribute scale in`prcomp()`

.`data_f_clr<- clr(data_f) data_pca <- prcomp(data_f, center = TRUE, scale. = TRUE)`

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prcomp.html

scale is described to scale the data, so they have unit variance. Since my data have a very different scale that is what i wanted, I think. The problem is, that I receive a different solution, when I use the code above or when I skip the

`clr()`

(which makes the more wanted result). But I want to know why is the`clr()`

disturbing in that case?

**Answer**

You might experience some issues with vanilla PCA on CLR coordinates. There are two major problems with compositional data:

- they are strictly non-negative
- they have a sum constraint

Various compositional transforms address one or both of these issues. In particular, CLR transforms your data by taking the log of the ratio between observed frequencies x and their geometric mean G(x), i.e.

ˆx={log(x1G(x)),…,log(xnG(x))}={log(x1)−log(G(x)),…,log(xn)−log(G(x))}

Now, consider that

log(G(x))=log(exp[1nn∑i=1log(xi)])=E[log(x)]

This effectively means that

∑ˆx=∑[log(x)−E[log(x)]]=0

In other words CLR removes the value-range restriction (which is good for some applications), but does not remove the sum constraint, resulting in a singular covariance matrix, which effectively breaks (M)ANOVA/linear regression/… and makes PCA sensitive to outliers (because robust covariance estimation requires a full-rank matrix). As far as I know, of all compositional transforms only ILR addresses both issues without any major underlying assumptions. The situation is a bit more complicated, though. SVD of CLR coordinates gives you an orthogonal basis in the ILR space (ILR coordinates span a hyperplane in CLR), so your variance estimations will not differ between ILR and CLR (that is of course obvious, because both ILR and CLR are isometries on the simplex). There are, however, methods for robust covariance estimation on ILR coordinates [2].

**Update I**

Just to illustrate that CLR is not valid for correlation and location-dependant methods. Let’s assume we sample a community of three linearly independent normally distributed components 100 times. For the sake of simplicity, let all components have equal expectations (100) and variances (100):

```
In [1]: import numpy as np
In [2]: from scipy.stats import linregress
In [3]: from scipy.stats.mstats import gmean
In [4]: def clr(x):
...: return np.log(x) - np.log(gmean(x))
...:
In [5]: nsamples = 100
In [6]: samples = np.random.multivariate_normal(
...: mean=[100]*3, cov=np.eye(3)*100, size=nsamples
...: ).T
In [7]: transformed = clr(samples)
In [8]: np.corrcoef(transformed)
Out[8]:
array([[ 1. , -0.59365113, -0.49087714],
[-0.59365113, 1. , -0.40968767],
[-0.49087714, -0.40968767, 1. ]])
In [9]: linregress(transformed[0], transformed[1])
Out[9]: LinregressResult(
...: slope=-0.5670, intercept=-0.0027, rvalue=-0.5936,
...: pvalue=7.5398e-11, stderr=0.0776
...: )
```

**Update II**

Considering the responses I’ve received, I find it necessary to point out that at no point in my answer I’ve said that PCA doesn’t work on CLR-transformed data. I’ve stated that CLR can break PCA in **subtle** ways, which might not be important for dimensionality reduction, but is important for exploratory data analysis. The paper cited by @Archie covers microbial ecology. In that field of computational biology PCA or PCoA on various distance matrices are used to explore sources of variation in the data. My answer should only be considered in this context. Moreover, this is highlighted in the paper itself:

… The compositional biplot

[note: referring to PCA]has several

advantages over the principal co-ordinate (PCoA) plots for β-diversity

analysis. The results obtained are very stable when the data are

subset (Bian et al., 2017), meaning thatexploratory analysisis not

driven simply by the presence absence relationships in the data nor by

excessive sparsity (Wong et al., 2016; Morton et al., 2017).

**Update III**

Additional references to published research (I thank @Nick Cox for the recommendation to add more references):

**Attribution***Source : Link , Question Author : T.rex , Answer Author : Eli Korvigo*