Choosing the hyperparameters using T-SNE for classification

In as specific problem that I work with (a competition) I have the follwoing setting: 21 features (numerical on [0,1]) and a binary output. I have approx 100 K rows. The setting seems to be very noisy.

Me and other participants apply feature generation for a while and t-distributed stochastic neighbor embedding turned out to be rather powerful in this setting.

I stumbled upon this post “How to Use t-SNE Effectively” but still I can not really conclude on how to choose the hyperparameters best in my setting of classifcation.

Are there any rules of thumb (number of features, dimension of embedding -> choice of perplexity)?

I just apply ad-hoc settings at the moment as it takes too long to iterate various settings.
Thanks for any comments.

Answer

I routinely use t-SNE (alongside clustering techniques – more on this in the end) to recognise/assess the presence of clusters in my data. Unfortunately to my knowledge there is no standard way to choose the correct perplexity aside looking at the produced reduced dimension dataset and then assessing if it is meaningful. There are some general facts, eg. distances between clusters are mostly meaningless, small perplexity values encourage small clot-like structures but that’s about it.

A very rough rule-of-thumb is to check what is the error value associated with each reconstruction. t-SNE is trying to minimise the sum of the Kullback-Leibler divergences between the distribution of the distances between the data in the original domain and the distribution of distances between the data in the reduced dimension domain (actually the target distributions are the distributions of the probabilities that a point will pick another point as its neighbour but these are directly proportional to the distance between the two points). It could be argued that smaller values of KL-divergence show better results. This idea does not work very well in practice but it would theoretically help to exclude some ranges of the perplexity values as well as some runs of the algorithm that are clearly suboptimal. I explain why this heuristic is far from a panacea and how it could though be mildly useful: The perplexity parameter increases monotonically with the variance of the Gaussian used to calculate the distances/probabilities. Therefore as you increase the perplexity parameter as a whole you will get smaller distances in absolute terms and subsequent KL-divergence values. Nevertheless if you have 20 runs with the same perplexity and you cannot (do not want to) look at them you can always pick the one with the smallest variable hoping it retains the original distances more accurately. The same goes for the θ, the approximation parameter for the Barnes-Hut approximation, assuming perplexity is fixed changing θ and then checking the resulting costs should be somewhat informative. In the end of the day, lower costs are associated with more faithful reconstructions. All is not lost though…

For your particular use case, a trick to mildly automate the procedure of picking a good perplexity value is the following: Run a small clustering procedure (say a k-means or DBSCAN) on the reduced dimensionality dataset and then assess the quality of that clustering using some sort of index (Cohen’s k, Rand index, Fowlkes-Mallows, etc.) against what you try to predict. The idea here is that for your task at hand the correct representation of the data (the perplexity dependant t-SNE results) should give the most informative representation (in the form of one of the metrics mentioned) in terms of their alignment with the property you try to predict. This is why t-SNE was used in the first place after all, if the resulting representation is uninformative for the properties we are investigating then it is simply no good despite its low reconstruction error, visual appeal, etc. etc.

Let me point out that what I describe are heuristics. As mentioned in the beginning of my post, manually inspecting the results is an indispensable way of assessing the quality of the resulting dimensionality reduction/clustering.

Attribution
Source : Link , Question Author : Richi W , Answer Author : Hagbard

Leave a Comment