Let's say I begin with an exceptionally large dataframe (e.g. imported/munged from tsv files). Several of these columns are categorical labels.
(As a more concrete example, let's imagine a group of students in a school district, pre-school to high-school).
Now, I begin using sklearn and instantiate a t-SNE model, similar to the example here:
import numpy as np from sklearn.manifold import TSNE X # my data model = TSNE(n_components=2, random_state=0) np.set_printoptions(suppress=True) model.fit_transform(X)
and then we plot this. The plot might look something like this: http://imgur.com/a/3amkJ
Here's my problem: with real datasets, after using t-sne to learn/cluster, you will have a number of "clusters". Then, using the categorical labels, I try to go through each of these, and try to figure out what structure the t-SNE plot is giving me.
For our school example, I'd get the t-SNE output, then I would label the datapoints. (Let's assume that the clusters are actually representative of age/classroom, e.g. the first-graders group together, the second-graders are a group, etc.)
If I try to color this plot with "grades", I'll see that the grades does not really explain the structure of this plot. (Why? Because every class-level has students with As, Bs, Cs, etc.) Then I might try height...that does pretty good (because there's a correlation between short students--> pre-school, tall students --> high school seniors).
How does one use a t-SNE plot to infer the "most correct" labels of the data? How does one use t-SNE plots to explain (and further explore) the plot structure?