# Does Boruta feature selection (in R) take into account the correlation between variables?

I am a bit of a novice in R and feature selection, and have tried the Boruta package to select (diminish) my number of variables (n= 40). I thought that this method also took into account the possible correlation between variables, however, two (of the 20 variables selected) are highly correlated, and two others are completely correlated. Is this normal? Shouldn’t the Boruta method have classified one of the two as unimportant?

… , two (of the 20 variables selected) are highly correlated, and two
others are completely correlated. Is this normal? Shouldn’t the Boruta
method have classified one of the two as unimportant?

Yes it is normal. Boruta tends to find all features relevant to the response variable $y$. Rigorously speaking, a predictor variable $x_i$ is said to be relevant to $y$ if $x_i$ and $y$ are not conditionally independent given some other predictor variables (or given nothing, which would simply mean that $x_i$ and $y$ are not independent).

Consider this simple example :

set.seed(666)
n <- 100
x1 <- rnorm(n)
x2 <- x1 + rnorm(n,sd=0.5)
x3 <- rnorm(n)
y <- x2 + rnorm(n)


You see that $y=x_2+\text{noise}$, then $x_2$ is relevant to $y$, because $y$ and $x_2$ are not independent. You also see that $x_2=x_1+\text{noise}$ and then $y$ is not independent of $x_2$. The only variable not relevant to $y$ is $x_3$, because:

• $y$ and $x_3$ are independent
• $y$ and $x_3$ are conditionally independent given $x_1$
• $y$ and $x_3$ are conditionnaly independent given $(x_1,x_2)$

Then Boruta finds the expected result:

> library(Boruta)
> Boruta(data.frame(x1,x2,x3), y)
Boruta performed 30 iterations in 2.395286 secs.
2 attributes confirmed important: x1, x2.
1 attributes confirmed unimportant: x3.


There is a high correlation between $x_1$ and $x_2$, but Boruta does not mind about that:

> cor(x1,x2)
[1] 0.896883