Thought process behind data analysis

0

I am looking for books/tutorials that help you gain the insight into the thought process behind data analysis.

Most of the books I've read were mostly documentation - the author shows you a function and some data that he applies that function to. They also show how to use graphs and how histogram/boxplots etc. work. They go through popular libraries like numpy, pandas and how to use them.

I am interested in how people on (for example) Kaggle in their kernels on Titanic get their ideas from, on what to apply to datasets. Those people know which columns to plot as a function of each other, when to plot histograms, when to plot density functions and so on.

I am somewhat experienced in Machine Learning. It's pretty obvious which algorithm can be applied to what situation. Data exploring seems like a very ambiguous exercise with many solutions/ideas.

Another way to put is: Where to get the ideas for data exploration from?

dydokamil

Posted 2017-06-17T15:28:01.133

Reputation: 101

Answers

1

I guess it is more about "What would be the most informative (and easy readable) way to visualize one specific problem?". Having a specific question in mind you can explore galleries of the libraries with their examples, for instance, seaborn. You could also explore another kernels to see how people visualized it. For some question there are pretty trivial solutions. Other questions can be visualized in very many different ways but what counts is how readable and informative your plot is.

In case of titanic imagine three questions that you want to visualize.

  1. What if survival rate depends somehow on the age and/or sex?
    For this you might want to choose a density or histogram with having two curves on the plot: for survival and non-survival. Maybe you want to split the plot into two plots for female and male. enter image description here


  1. What if survival rate depends on a Pclass? Or maybe Emarked as well? And maybe sex plays also an important role?
    I am a fan of factor plots as you can introduce lots of information into it and it is quite understandable from the first sight. So, you see that 1st and 2nd class had a higher survival rate comparing to the 3rd class. (The next question can arise: For each embark what is the relative proportion of the pclass? What would be the better visualization?) enter image description here


  1. What if a survival rate depends on the pclass and age? Additionally you could see what is the age distribution of each class.
    Violin plots might be good for that. enter image description here

Jekaterina Kokatjuhha

Posted 2017-06-17T15:28:01.133

Reputation: 283