## Determine highly correlated segments

5

Given a dataset that has a binary (0/1) dependent variable and a large collection of continuous and categorical independent variables, is there a process and ideally a R package that can find combinations/subsets/segments of the IVs that are highly correlated with the DV?

Simple example: DV: college education (0/1), and IVs: age (20 to 120), income (0 to 1 million), race (white, black, hispanic etc), gender (0/1), state, etc.

Then finding correlations combining IVs and subsets of IVs (e.g. women between 30 and 50, with incomes over 100k are highly positively correlated with the DV), and then being able to compare the combinations (e.g. to find out women between 30 and 40, with incomes over 100k have a higher correlation than women between 40 and 50, with incomes over 100k)

Maybe I tend to always want to keep it simple, but why not just use Logistic Regression with some interaction terms? – None – 2015-03-01T03:58:26.763

## Answers

4

The idea you have in mind is called "feature selection" or "attribute selection". The fact that you have a categorical dependent variable and continuous independent variables is mostly irrelevant because you're expected to use an algorithm or statistical method that is suitable for your requirements.

As for feature selection methods, there are several options:

1. Find the subset of features that achieves better performance (usually in cross validation)

2. Find the subset of features that correlates highly with the target variable and low with each other (although other criteria can be used)

3. Use an algorithm that includes a built-in feature selection mechanism (e.g. decision trees, hierarchical bayesian methods)

Furthermore, there are several methods aimed at obtaining a good compromise between a thorough search and a reasonable time execution (e.g. best first, steepest ascent search, etc)

This question in particular provides very good suggestions for R packages.

2

I would suggest to consider using latent variable modeling (LVM) or similar structural equation modeling (SEM) as an approach to this problem. Using this approach is based on recognizing and analyzing latent variables - constructs (factors), measured not directly, but through sets of measured variables (indicators). Note that a closely related term latent feature is frequently used within the machine learning domain. It seems to me that latent variables resemble what you call "combinations/subsets/segments of the IVs".

By hypothesizing - usually, based on theory or domain knowledge - the latent structure of factors, LVM or SEM are able to automatically confirm or decline those hypotheses. This is done by using a combination of exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) (see my answer ). While EFA is frequently performed independently (and maybe that's enough for your purposes), doing it along with CFA represents a large part of LVM/SEM methodology, which is usually completed by performing path analysis, which is concerned about relationships between latent variables.

The R ecosystem offers a variety of packages for performing LVM/SEM in its entirety or for performing EFA, CFA and path analysis. The most popular ones for EFA are psych, GPArotation and Hmisc. The most popular packages for CFA, path analysis and LVM are sem (the first R package for SEM), lavaan, OpenMx, semPLS, plspm. Various supplementary SEM-focused packages are also available.

1

I am no expert in that particular case, but doing a bit of research, it seems that the measure you want to construct is called "Point-biserial correlation coefficient", i.e. the inferred correlation between a continuous variable $X$ and a categorical variable $Y$, e.g. $Y∈\{−1,0,1\}$. See a related question on Cross Validated SE.

And yes, there is an R package for that :)

I have done point-biserial correlations before, but I don't think this works for what I want to achieve. As you wrote - it calculates correlation between variables - in their entirety. I am looking for correlation for parts/segments of a variable, not all values, e.g. when continuous variable X > 3000, not for all values of variable X. Basically something that automatically, or via some machine learning algorithm, "discovers"/data-mines different combinations of variable segments that highly correlate to the DV. Maybe I can try to make my question clearer? – Bryan – 2015-01-14T20:30:21.177

0

Recently I am working at a similar analysis. I wrote some functions to test any possible combinations between variables, however it is specifically used for my own data set which definitely is different from your one.

This is a fairly small job so I can not say any package dealing with such tests. And you have already worked out some combinations. Just keep going for a ideal function, maybe will be done in a couple of days.

I add a link here, which partially answers your question and code is included: https://stats.stackexchange.com/questions/4040/r-compute-correlation-by-group