2

I have a dataset of approximately 48,000 rows each one a click of a an article, some of these clicks were also comments. For each article I have the country and subject of the article and name of person who clicked on it these are all categorical variables with different levels from each other. I want to determine which of these categorical variables has the highest association or correlation with comments i.e. based the dataset is the best variable country, subject, or person to predict whether or not there will be a comment on an article.

Which statistical method would you use to determine correlation and how would you implement it in R?

2

Possible duplicate of Determine highly correlated segments

– Spacedman – 2016-12-29T08:29:23.527The

`comments`

variable is unstructured text or is a binary variable (whether the person has commented or not)? – Ricardo Cruz – 2016-12-29T10:50:22.173I am guessing binary variable. Why not just use a logistic regression to find which variable is associated with your target variable? Then look at the coefficients. Use loss=L1 to with high regularization to ensure the coefficients are non-colinear. – Ricardo Cruz – 2016-12-29T10:57:39.200

Thanks Ricardo the commment is a binary value (the DV). with the glm function or some other package in R how do I set loss=L1 to with high regularization for my dataset – Jonathan Dine – 2016-12-29T16:44:35.887