Why does the implementation of t-SNE in R default to the removal of duplicates?

Specifically, the R implementation of t-SNE in the Rtsne package has a “check_duplicates” argument, and the documentation suggests that “it is best to make sure there are no duplicates present and set this option to FALSE, especially for large datasets”.

Further, if you attempt to run t-SNE on a dataset in R that does have duplicates, you get the error message: “Error in [command snipped by user]: Remove duplicates before running TSNE.”

So, why does this behavior occur? I have a dataset in which multiple samples coincidentally have the same measurements.

Is it simply a “duplicate datapoints, after reduction, will have the same data points anyway, so don’t waste processing power”? Does the presence of duplicates affect the process’ calculations?

Answer

The algorithm is designed to handle datasets without duplicate information, so the package mades a check before apply the technique. They suggest you to remove duplicates and set check_duplicates = FALSE for a performance improvement.

The implementation in R is this:

if (check_duplicates & !is_distance){
if (any(duplicated(X))) { stop("Remove duplicates before running TSNE.")}

Whith default values check_duplicates = TRUE and is_distance = FALSE.

The paper, for who wants to undestand more about the method, is here.

Attribution
Source : Link , Question Author : tluh , Answer Author : cdutra

Leave a Comment