Prediction vs causation in a ML project

3

I am performing a classification task and was able to identify significant predictors (important features using Random Forest) that can help separate the classes or influence the outcome.

But I read online that prediction models are not causal models.

Let's say if my prediction model says that Age is one of the significant factors that influence outcome (death), how can I prove that Age is the cause of death.

I read that any intervention/change on strong predictors of your models, will not necessarily impact the outcome.

How can I find out the list of factors that really cause change in the outcome?

Currently what I do is run a RF model to identify the important features and communicate that these are the top 5 features that seem to influence the outcome.

How can I prove that it is causation and not just correlation?

The Great

Posted 2019-12-26T06:32:27.160

Reputation: 1 539

Answers

3

ML questions are concerned mainly around predictions, but we can extrapolate to (in a certain sence) causality.

First of all these are two different modelling approaches:

Causal inference is focused on knowing what happens to $Y$ when you change $X$. Prediction is focused on knowing the next $Y$ given $X$

Some of the current causal approaches are randomised testing, do-calculus etc...

So how can we extrapolate to causal inference of standard predictive ML models?

Counterfactual Explanations we can simulate counterfactuals for predictions of machine learning models where we simply change the feature values of an instance before making the predictions and we analyze how the prediction changes. Read more about it here, and there is python library called alibi that implements it.

Noah Weber

Posted 2019-12-26T06:32:27.160

Reputation: 4 932

Thanks for the response. Upvoted. can PFI help us with causal inference as it is also about feature values (shuffling)? – The Great – 2019-12-26T10:30:15.137

sure, yes it can – Noah Weber – 2019-12-26T10:38:38.267

Hi @Noah Weber, can you help me with this post? https://datascience.stackexchange.com/questions/69384/how-to-trust-the-labels-generated-using-ml-models

– The Great – 2020-03-09T09:08:50.773

1

I would agree with your assessment of things. ML is much more concerned with making predictions and a discipline like Econometrics, or Statisitcs, for instance, strives to find causation between variables.

ML excels at finding patterns in data and using these patterns for classification and prediction. Econometrics shares machine learners' interest in classification and prediction, as well as statisticians' concern with sample representativeness and sampling variance. As an aside, The discipline of statistics was born out of a desire to work with data efficiently, primarily by drawing relatively small samples from larger populations of interest instead of collecting data on everyone. As you know, people in the ML world will try to consume as much data as possible, whereas people in the statistics world, are taking samples of populations, with the understanding that a small subset is representative of an entire population, and that's good enough for the analysis that's being done.

Now, back to your question about proving causation. Correlation is a statistical technique which tells us how strongly the pair of variables are linearly related and change together. Causation takes a step further than correlation; it says any change in the value of one variable will cause a change in the value of another variable, which means one variable makes another happen. This is referred as cause and effect. Essentially, we can infer causality from a well designed randomized controlled experiment. Randomized and controlled don't intuitively belong together, but it's a complex dynamic. Think of the predator-prey model. As the number of prey increase, more predators can exist, but too many predators will decimate the prey population, so the number of predators will diminish, and then the number of prey will increase. This cycle continues over and over.

I did a quick Google search and came up with a couple of links, which seem to make a decent comparison between the two disciplines.

https://towardsdatascience.com/why-correlation-does-not-imply-causation-5b99790df07e

https://medium.com/causal-data-science/if-correlation-doesnt-imply-causation-then-what-does-c74f20d26438

Hope that helps!!!

ASH

Posted 2019-12-26T06:32:27.160

Reputation: 341

Hi, Thanks for your response. Upvoted – The Great – 2020-02-29T03:06:02.870