How to more simply see overlapping data for dozens of variables?



I'm trying to think of the best way to see how multiple variables (about 40) related to a very large userbase can be seen to interact with one another.

As an analogy, imagine I had survey data of superheros liked by thousands of school children, such that Adam Agner only likes Batman, Brian Agosto likes Superman and Batman, Cian Ailiff likes Wonderwoman, Superman and the Flash, etc. The index would be the list of children's names (tens of thousands), and the variables would be each superhero (from a list of 50) each with a True or False value.

Rather than just the amount of children who like each superhero, if I can see the overlap information in some easy way, I might find that an unusually large number only like Batman, or that most of those who like Superman are likely to also like the Flash.

The easiest would be to do it visually, but Venn diagrams wouldn't be practical with a large number (50 here) of variables. So it would seem like a correlation matrix with a heat-map base would be the way to go, with something like this:


I can imagine a heat map with superheros plotted on both axes could work for seeing interesting matches of one variable against all other ones, but it would still require some less ideal extra steps to see when three or more superheros matched, like seeing the colour overlap of superhero1 with superhero2 & superhero3 both be high, and that's kind of guesswork, because the actual children who like all 3 could still be low.

At the moment, the best solution I can think of is to try to reduce my variables down into categories, so the example could be Marvel, DC, male, female, but that loses some potential data that could be helpful from overlaps within those categories.

Maybe if I did something like the image above, the circle size would be the number for that overlap, and the colour could be number matches with other variables (without listing them, but I could further investigate). That kind of coding is a little out of my comfort zone, but I could try.

Ideas appreciated! I'd ideally do this in matplotlib or some other means within python, but if I have to use Matlab or another tool then I will consider that. Ideas or suggestions appreciated! Hopefully this was clear, thank you!

Benny Lewis

Posted 2017-06-09T21:14:49.103

Reputation: 113



For this example specifically, I would suggest visualizing the data using a Chord Diagram.

Chord diagram from Chord Diagram From]

A chord diagram would allow you to see interactions and co-occurrences of likes between each of the heroes directly, including relative magnitude of the effects intuitively and immediately. You could also include other properties of the character (universe of origin, gender, time period of introduction etc) by use of color and/or positioning of the hero in question on the circumference of the diagram.

As chord diagrams are a graph-based approach, you'd want to transform the individual observations you have into what would be a (this has to be a first for this term) hero-like-cofrequency matrix formatted as follows:

|           | Batman | Spiderrman | Superman |  
| Batman    | 0      | 2          | 3        |
| Spiderman | 2      | 0          | 1        | 
| Superman  | 3      | 1          | 0        |

Note that this represents an undirected graph, and that it is symmetric about the identity column.

As you're using Python, you may want to look into using Bokeh to create the chord diagram. One tutorial for doing so is available here.

Best of luck, true believer!

Thomas Cleberg

Posted 2017-06-09T21:14:49.103

Reputation: 1 437

Excellent!! That is visually simple to interpret, and yet does correlate between all values, making it easy to notice any that stand out. I'll read-up on all the links you've provided and attempt to implement this version then, as it seems the most intuitive and straightforward. Bounty well-earned! – Benny Lewis – 2017-06-13T16:01:29.263


t-Distributed Stochastic Neighbor Embedding (t-SNE) might be exactly what you need. The linked page has code and links to various blogs describing its use.


Posted 2017-06-09T21:14:49.103

Reputation: 214