## Given a t-SNE plot, how can I infer the "most correct" labels? How does one understand its structure?

4

Let's say I begin with an exceptionally large dataframe (e.g. imported/munged from tsv files). Several of these columns are categorical labels.

(As a more concrete example, let's imagine a group of students in a school district, pre-school to high-school).

Now, I begin using sklearn and instantiate a t-SNE model, similar to the example here:

http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

 import numpy as np
from sklearn.manifold import TSNE
X  # my data
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
model.fit_transform(X)


and then we plot this. The plot might look something like this: http://imgur.com/a/3amkJ

Here's my problem: with real datasets, after using t-sne to learn/cluster, you will have a number of "clusters". Then, using the categorical labels, I try to go through each of these, and try to figure out what structure the t-SNE plot is giving me.

For our school example, I'd get the t-SNE output, then I would label the datapoints. (Let's assume that the clusters are actually representative of age/classroom, e.g. the first-graders group together, the second-graders are a group, etc.)

If I try to color this plot with "grades", I'll see that the grades does not really explain the structure of this plot. (Why? Because every class-level has students with As, Bs, Cs, etc.) Then I might try height...that does pretty good (because there's a correlation between short students--> pre-school, tall students --> high school seniors).

How does one use a t-SNE plot to infer the "most correct" labels of the data? How does one use t-SNE plots to explain (and further explore) the plot structure?

1The tSNE authors, if I recall correctly, advise against using tSNE for clustering or other analysis. They consider it to be only suited for presentation, because it may pull similar objects too far apart. – Has QUIT--Anony-Mousse – 2017-01-28T20:00:30.693

@Anony-Mousse What do you mean, "presentation"? – ShanZhengYang – 2017-01-29T11:13:32.723

As in "visualization of results". – Has QUIT--Anony-Mousse – 2017-01-29T12:34:45.760

@Anony-Mousse Just visualizing data? That's a bit odd---the authors directly contrast t-SNE with PCA... – ShanZhengYang – 2017-01-29T19:50:44.090

The paper is very clear about the use case (first word is 'visualizing'!) - and it's fairly easy to see that tSNE does not preserve density by design. But density is what makes up clusters (even in methods not considered to be density-based). – Has QUIT--Anony-Mousse – 2017-01-29T20:26:33.993

PCA is of course also a popular choice for visualization, so it makes sense to compare the visualization results. – Has QUIT--Anony-Mousse – 2017-01-29T21:21:08.960

@Anony-Mousse So how do I interpret the clusters within t-SNE? I realize that cluster sizes and distances between clusters mean nothing. – ShanZhengYang – 2017-02-02T14:36:02.590

You don't 'interpret clusters of tSNE'. It is a visualization; if you have clusters you can e.g. color points to see if the projection preserves the clusters somewhat; if they are completely random, something is broken. But remember that tSNE does not show how well clusters are separated in the original space; it only tries to put close neighbors somewhat close if possible. – Has QUIT--Anony-Mousse – 2017-02-02T22:07:32.137

@Anony-Mousse I'm still confused. Let's say we have a t-SNE plot: certain points are closer together in data space, as show in a t-SNE plot. Other points are far away from one another in a t-SNE plot. What is the correct conclusion from the viewer seeing this t-SNE plot? – ShanZhengYang – 2017-02-03T23:00:17.660

There is only a good probability (but no guarantees) that the nearest p neighbors are preserved. Not so much their distances. An outlier will still try to be close to it's p neighbors, even if they originally were very far away. – Has QUIT--Anony-Mousse – 2017-02-04T00:57:16.157

4

With t-SNE none of the input parameters are weighted more than any other parameter so the differences you want to see like students forming islands by grade level will not happen because there is so much other data present to pull those students/data points in different directions.

I highly encourage you to have a specific question in mind and tailor your input categories so that your question can be answered by the t-sne map.

You might try asking a specific question and changing which input categories you ask t-SNE to look at. For example, do taller students get better grades? Feed in height and grades categories while leaving out grade level and age. This is a silly example but I hope it gives an idea of how you can use t-sne to help you learn about your data.

You might also find that there are categories that mask meaningful findings. Height might not be very useful for pulling out meaningful information and since the height range is going to be much larger than the A-F grade range, it will likely influence the t-sne map more.

Looking at the data as you described on a color scale for each category is a good place to start with each new t-sne run.

It is okay to run multiple t-snes with different parameters. To be sure you are getting a meaningful answer, I recommend running t-sne multiple times with the same parameters as well.

If you have enough students, it would be great to included half in your training set where you explore and figure out what questions to ask and then when you think you've found something meaningful, apply those conditions to the other students in the testing set to see if it holds true.

" For example, do taller students get better grades? Feed in height and grades categories while leaving out grade level and age. This is a silly example but I hope it gives an idea of how you can use t-sne to help you learn about your data." This is a good suggestion. Thank you. It's a good method for seeing the structure of t-SNE. – ShanZhengYang – 2017-01-27T20:50:57.377