## data processing, correlation calculation

2

1

I have product purchase count data which looks likes this:

user item1 item2
a     2     4
b     1     3
c     5     6
...   ...   ...


These data are imported into python using numpy.genfromtxt. Now I want to process it to get the correlation between item1 purchase amount and item2 purchase amount -- basically for each value x of item1 I want to find all the users who bought item1 in x quantity then average the item2 over the same users. What is the best way to do this? I can do this by using for loops but I thought there might be something more efficient than that. Thanks!

Use one of Pandas' built in functions: http://pandas.pydata.org/pandas-docs/stable/computation.html#correlation

– Mieszko – 2014-10-08T22:12:31.823

2

Pandas is the best thing since sliced bread (for data science, at least).

an example:

import pd

In [23]: df
Out[23]:
user   item1   item2
0     a        2      4
1     b        1      3
2     c        5      6

In [24]: df.columns
Out[24]: Index([u'user ', u'item1 ', u'item2'], dtype='object')

In [25]: df.corr()
Out[25]:
item1      item2
item1   1.000000  0.995871
item2   0.995871  1.000000

In [26]: df.cov()
Out[26]:
item1      item2
item1   4.333333  3.166667
item2   3.166667  2.333333


Bingo!