I am a bit of a novice in R and feature selection, and have tried the Boruta package to select (diminish) my number of variables (n= 40). I thought that this method also took into account the possible correlation between variables, however, two (of the 20 variables selected) are highly correlated, and two others are completely correlated. Is this normal? Shouldn’t the Boruta method have classified one of the two as unimportant?

**Answer**

… , two (of the 20 variables selected) are highly correlated, and two

others are completely correlated. Is this normal? Shouldn’t the Boruta

method have classified one of the two as unimportant?

Yes it is normal. Boruta tends to find all features relevant to the response variable $y$. Rigorously speaking, a predictor variable $x_i$ is said to be relevant to $y$ if $x_i$ and $y$ are not conditionally independent given some other predictor variables (or given nothing, which would simply mean that $x_i$ and $y$ are not independent).

Consider this simple example :

```
set.seed(666)
n <- 100
x1 <- rnorm(n)
x2 <- x1 + rnorm(n,sd=0.5)
x3 <- rnorm(n)
y <- x2 + rnorm(n)
```

You see that $y=x_2+\text{noise}$, then $x_2$ is relevant to $y$, because $y$ and $x_2$ are not independent. You also see that $x_2=x_1+\text{noise}$ and then $y$ is not independent of $x_2$. The only variable not relevant to $y$ is $x_3$, because:

- $y$ and $x_3$ are independent
- $y$ and $x_3$ are conditionally independent given $x_1$
- $y$ and $x_3$ are conditionnaly independent given $(x_1,x_2)$

Then Boruta finds the expected result:

```
> library(Boruta)
> Boruta(data.frame(x1,x2,x3), y)
Boruta performed 30 iterations in 2.395286 secs.
2 attributes confirmed important: x1, x2.
1 attributes confirmed unimportant: x3.
```

There is a high correlation between $x_1$ and $x_2$, but Boruta does not mind about that:

```
> cor(x1,x2)
[1] 0.896883
```

**Attribution***Source : Link , Question Author : Charlotte , Answer Author : Stéphane Laurent*