How can I determine the relationship between spam and weekdays?


I am trying to check if there is a correlation between spam emails and weekdays. My dataset looks like as follows:

  Spam? Day
0   1.0 Saturday
1   1.0 Saturday
3   0.0 Saturday
5   1.0 Saturday
7   0.0 Friday
... ... ...
346 0.0 Friday
348 1.0 Friday
361 0.0 Saturday
383 1.0 Thursday
387 1.0 Friday

where 1 means spam and 0 not spam.

I have tried as follows

corr = (numpy.corrcoef(df['Spam?'],df['Days']))

I do not know how to explain a possible relationship between these two variables and if a plot could help to better visualise data and relationship.


Posted 2020-09-11T18:30:42.457

Reputation: 165

1Do you mean to see if there is a difference between volume of spam received on weekdays compared to the weekend? – Dave – 2020-09-11T19:58:29.010

yes, I am trying to analyse some aspect of sending spam, trying to understand what could be interesting to be investigated. So I wanted determine if there is a correlation between the volume of spam received on weekdays compared to the weekend. But other type of suggestions would be greatly appreciated it – LdM – 2020-09-11T22:03:28.677



(started as a comment but it turned out to be longer than expected)

With a dataset like this a simple barplot could be very insightful: on the X axis the days of the week, on the Y axis the frequency, with two bars (spam/not spam using different color) for each day. A slightly more advanced version: two boxplots, one for weekdays the other for weekends. A boxplot is kind of overkill for only 5 (mon-fri) and 2 (sat-sun) values but it's easy to do and shows the big picture.

In order to test whether any difference (e.g. weekdays vs weekends) is significant I think this is a good case for a chi square test.


Posted 2020-09-11T18:30:42.457

Reputation: 12 600


numpy.corrcoef will gives you Pearson Correlation but your Features are Categorical.
You should calculate Crammer'v.

You can get details/code in this answer as both the questions are a bit similar DS.SE

On Plot
What Erwan has suggested seems good.
Also, try to plot(line-plot) between days-of-week and spam/total ratio(i.e. Stanardizing for total volume) since a single figure is easier to comprehend.


Posted 2020-09-11T18:30:42.457

Reputation: 3 634

thanks for your answer 10xAl – LdM – 2020-09-13T02:10:18.047