6

2

I'm trying to think of the best way to see how multiple variables (about 40) related to a very large userbase can be seen to interact with one another.

As an analogy, imagine I had survey data of superheros liked by thousands of school children, such that *Adam Agner* only likes Batman, *Brian Agosto* likes Superman and Batman, *Cian Ailiff* likes Wonderwoman, Superman and the Flash, etc. The index would be the list of children's names (tens of thousands), and the variables would be each superhero (from a list of 50) each with a True or False value.

Rather than just the amount of children who like each superhero, if I can see the overlap information in some easy way, I might find that an unusually large number only like Batman, or that most of those who like Superman are likely to also like the Flash.

The easiest would be to do it visually, but Venn diagrams wouldn't be practical with a large number (50 here) of variables. So it would seem like a correlation matrix with a heat-map base would be the way to go, with something like this:

I can imagine a heat map with superheros plotted on both axes could work for seeing interesting matches of one variable against all other ones, but it would still require some less ideal extra steps to see when three or more superheros matched, like seeing the colour overlap of superhero1 with superhero2 & superhero3 both be high, and that's kind of guesswork, because the actual children who like all 3 could still be low.

At the moment, the best solution I can think of is to try to reduce my variables down into categories, so the example could be *Marvel*, *DC*, *male*, *female*, but that loses some potential data that could be helpful from overlaps within those categories.

Maybe if I did something like the image above, the circle size would be the number for that overlap, and the colour could be number matches with other variables (without listing them, but I could further investigate). That kind of coding is a little out of my comfort zone, but I could try.

Ideas appreciated! I'd ideally do this in matplotlib or some other means within python, but if I have to use Matlab or another tool then I will consider that. Ideas or suggestions appreciated! Hopefully this was clear, thank you!

Excellent!! That is visually simple to interpret, and yet does correlate between all values, making it easy to notice any that stand out. I'll read-up on all the links you've provided and attempt to implement this version then, as it seems the most intuitive and straightforward. Bounty well-earned! – Benny Lewis – 2017-06-13T16:01:29.263