data processing, correlation calculation



I have product purchase count data which looks likes this:

user item1 item2
   a     2     4
   b     1     3
   c     5     6
   ...   ...   ...

These data are imported into python using numpy.genfromtxt. Now I want to process it to get the correlation between item1 purchase amount and item2 purchase amount -- basically for each value x of item1 I want to find all the users who bought item1 in x quantity then average the item2 over the same users. What is the best way to do this? I can do this by using for loops but I thought there might be something more efficient than that. Thanks!


Posted 2014-10-08T10:42:41.833

Reputation: 123

Use one of Pandas' built in functions:

– Mieszko – 2014-10-08T22:12:31.823



Pandas is the best thing since sliced bread (for data science, at least).

an example:

import pd
In [22]: df = pd.read_csv('yourexample.csv')

In [23]: df
   user   item1   item2
0     a        2      4
1     b        1      3
2     c        5      6

In [24]: df.columns
Out[24]: Index([u'user ', u'item1 ', u'item2'], dtype='object')

In [25]: df.corr()
          item1      item2
item1   1.000000  0.995871
item2   0.995871  1.000000

In [26]: df.cov()
          item1      item2
item1   4.333333  3.166667
item2   3.166667  2.333333


Adriano Almeida

Posted 2014-10-08T10:42:41.833

Reputation: 146