How to summarize data by group in R?

160

101

I have R data frame like this:

        age group
1   23.0883     1
2   25.8344     1
3   29.4648     1
4   32.7858     2
5   33.6372     1
6   34.9350     1
7   35.2115     2
8   35.2115     2
9   35.2115     2
10  36.7803     1
...

I need to get data frame in the following form:

group mean     sd
1     34.5     5.6
2     32.3     4.2
...

Group number may vary, but their names and quantity could be obtained by calling levels(factor(data$group))

What manipulations should be done with the data to get the result?

Yuriy Petrovskiy

Posted 2011-03-13T12:02:38.533

Reputation: 2 017

Question was closed 2015-09-11T23:23:19.400

the commas in the result data frame mean something special, or is it just the decimal point? – mpiktas – 2011-03-13T12:46:04.723

@mpiktas Thank you for noting. Corrected. These were locale issues (I am russian) - we we use comma for decimal separation. – Yuriy Petrovskiy – 2011-03-13T12:59:05.770

3

I suspected that. All of the Europe uses comma except the British.

– mpiktas – 2011-03-13T13:04:26.597

4Despite not being British, I prefer dot for decimal separator. – Roman Luštrik – 2011-03-14T12:17:54.077

1

See aggregate, tapply, and then http://stackoverflow.com for any subsequent coding questions of this type.

– conjugateprior – 2015-09-11T23:26:40.743

Answers

126

Here is the plyr one line variant using ddply:

dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
ddply(dt,~group,summarise,mean=mean(age),sd=sd(age))

Here is another one line variant using new package data.table.

dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)
dt[,list(mean=mean(age),sd=sd(age)),by=group]

This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:

> system.time(aa <- ddply(dtf,~group,summarise,mean=mean(age),sd=sd(age)))
utilisateur     système      écoulé 
      0.513       0.180       0.692 
> system.time(aa <- dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.087       0.018       0.103 

Further savings are possible if we use setkey:

> setkey(dt,group)
> system.time(dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.040       0.007       0.048 

mpiktas

Posted 2011-03-13T12:02:38.533

Reputation: 27 571

1@chl, it gave me a chance to try out this new data.table package. It looks really promising. – mpiktas – 2011-03-15T12:54:27.890

Simple typo: I think you meant dt &lt;- data.table(dtf) instead of dt &lt;- data.table(dt) in the second code block. That way, you are creating the data table from a data frame instead of from the dt function from the stats package. I tried editing it, but I cannot do edits under six characters. – Christopher Bottoms – 2014-10-24T18:50:57.693

In my (not humble in this case) opinion data.table is the best way to aggregate data and this answer is great, but still only scratches the surface. Aside from being syntactically superior, it's also extremely flexible and has many advanced features that involve joins and internal mechanics. Check out the FAQ, github page, or course for more info. – geneorama – 2014-10-29T03:47:57.217

6+6000 for data.table. It really is so much faster than ddply, even for me on datasets smaller than 100k (I have one with just 20k rows). Must be something to do with the functions I am applying, but ddply will take minutes and data.table a few seconds. – atomicules – 2011-09-22T15:22:22.837

84

One possibility is to use the aggregate function. For instance,

aggregate(data$age, by=list(data$group), FUN=mean)[2]

gives you the second column of the desired result.

ocram

Posted 2011-03-13T12:02:38.533

Reputation: 14 118

1Don't link to your local help server :-) +1 but see my comments to @steffen's response. – chl – 2011-03-13T12:26:37.863

Done the thing by calling data.frame(group=levels(factor(data$group)),mean=(aggregate(data$age, by=list(data$group), FUN=mean)$x),sd=(aggregate(data$age, by=list(data$group), FUN=sd)$x)) but I am not shure it is correct way. I am not sure what will happen then the results of binded columns will be in different order (I think it is possible). What is your oppinion? – Yuriy Petrovskiy – 2011-03-13T12:46:40.693

8@Yuriy The rows should not be out of order, but here is a way to do it one call to aggregate(): aggregate(age ~ group, data=dat, FUN = function(x) c(M=mean(x), SD=sd(x))) – lockedoff – 2011-03-14T16:51:56.437

@lockedoff: Thank you for having completed my answer! – ocram – 2011-03-15T10:22:20.997

25

Since you are manipulating a data frame, the dplyr package is probably the faster way to do it.

library(dplyr)
dt <- data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T))
grp <- group_by(dt, group)
summarise(grp, mean=mean(age), sd=sd(age))

or equivalently, using the dplyr/magrittr pipe operator:

library(dplyr)
dt <- data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T))
group_by(dt, group) %>%
 summarise(mean=mean(age), sd=sd(age))

EDIT full use of pipe operator:

library(dplyr)
data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T)) %>%
  group_by(group) %>%
  summarise(mean=mean(age), sd=sd(age))

Bastiaan Quast

Posted 2011-03-13T12:02:38.533

Reputation: 392

3+1 for dplyr. It has made so many R tasks simple and many of these methods obsolete. – gregmacfarlane – 2014-07-15T13:02:03.113

The full use of pipe operator version does not work for me unfortunately – dagcilibili – 2017-09-10T05:41:03.380

did you load dplyr or magrittr ? – Bastiaan Quast – 2017-09-10T09:42:05.440

thank you very much @bquast for pointing out towards the solution, summarise function was called from plyr instead of dplyr which was causing the problem. – dagcilibili – 2017-09-10T21:31:20.207

11

Great, thanks bquast for adding the dplyr solution!

Turns out that then, dplyr and data.table are very close:

library(plyr)
library(dplyr)
library(data.table)
library(rbenchmark)

dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)

setkey(dt,group)

a<-benchmark(ddply(dtf,~group,plyr:::summarise,mean=mean(age),sd=sd(age)),
         dt[,list(mean=mean(age),sd=sd(age)),by=group],
         group_by(dt, group) %>% summarise(mean=mean(age),sd=sd(age) ),
         group_by(dtf, group) %>% summarise(mean=mean(age),sd=sd(age) )
)

a[, c(1,3,4)]

data.table is still the fastest, by followed very closely by dplyr(), which interestingly seems faster on the data.frame than the data.table:

                                                              test elapsed relative
1 ddply(dtf, ~group, plyr:::summarise, mean = mean(age), sd = sd(age))   1.689    4.867
2               dt[, list(mean = mean(age), sd = sd(age)), by = group]   0.347    1.000
4   group_by(dtf, group) %>% summarise(mean = mean(age), sd = sd(age))   0.369    1.063
3    group_by(dt, group) %>% summarise(mean = mean(age), sd = sd(age))   0.580    1.671

Matifou

Posted 2011-03-13T12:02:38.533

Reputation: 1 717

At first I thought you needed to move setkey into the benchmark, but turns out that takes almost no time at all. – kasterma – 2014-10-16T14:27:44.393

9

I have found the function summaryBy in the doBy package to be the most convenient for this:

library(doBy)

age    = c(23.0883, 25.8344, 29.4648, 32.7858, 33.6372,
           34.935,  35.2115, 35.2115,  5.2115, 36.7803)
group  = c(1, 1, 1, 2, 1, 1, 2, 2, 2, 1)
dframe = data.frame(age=age, group=group)

summaryBy(age~group, data=dframe, FUN=c(mean, sd))
# 
#   group age.mean    age.sd
# 1     1 30.62333  5.415439
# 2     2 27.10507 14.640441

gung

Posted 2011-03-13T12:02:38.533

Reputation: 96 740

9

Use the sqldf package. This allows you now to use SQL to summarize the data. Once you load it you can write something like -

sqldf('  select group,avg(age) from data group by group  ')

KalEl

Posted 2011-03-13T12:02:38.533

Reputation: 473

9

In addition to existing suggestions, you might want to check out the describe.by function in the psych package.

It provides a number of descriptive statistics including the mean and standard deviation based on a grouping variable.

Jeromy Anglim

Posted 2011-03-13T12:02:38.533

Reputation: 30 906

its nice, but somewhat tricky to export to LaTeX IME. – richiemorrisroe – 2011-03-14T10:08:56.933

8

Edited: According to chl's suggestions

The function you are looking for is called "tapply" which applies a function per group specified by a factor.

# create some artificial data
set.seed(42)
groups <- 5

agedat <- c()
groupdat <- c()

for(group in 1:groups){
    agedat <- c(agedat,rnorm(100,mean=0 + group,1/group))
    groupdat <- c(groupdat,rep(group,100))
}
dat <- data.frame("age"=agedat,"group"=factor(groupdat))

# calculate mean and stdev age per group
res <- rbind.data.frame(group=1:5, with(dat, tapply(age, group, function(x) c(mean(x), sd(x)))))
names(res) <- paste("group",1:5)
row.names(res)[2:3] <- c("mean","sd")

I really suggest to work through a basic R tutorial explaining all commonly used datastructures and methods. Otherwise you will get stuck every inch during programming. See this question for a collection of free available resources.

steffen

Posted 2011-03-13T12:02:38.533

Reputation: 7 426

2

@steffen +1 but there's no need for a for loop here, you can contruct your dataframe inline, IMO. For the tapply call, use function(x) c(mean(x),sd(x))) and cbind the result as the OP asked for both statistics. Also, ddply from the plyr package could do this smoothly.

– chl – 2011-03-13T12:24:41.290

@steffen The problem is I need the exactly the table structure I described. There is no problem with getting means and sd. The problem is with stucture. – Yuriy Petrovskiy – 2011-03-13T12:35:28.860

@chl: Thank you for your comment, did not know about plyr :). I added cbind, but left the rest untouched. May another one take the credit, this answer shall remain as a less optimal example. – steffen – 2011-03-13T12:35:58.383

@Yuriy: Added cbind. If you already knew how to apply functions per group, you may reformulate your question (just for clarity ;)). – steffen – 2011-03-13T12:37:47.197

@steffen cbind("mean"=mperage,"stdev"=stperage) gives no 'group' column. Will be joining bycbind(group=levels(factor(data$group)),"mean"=mperage,"stdev"=stperage)` correct? – Yuriy Petrovskiy – 2011-03-13T12:51:09.647

@Yuriy: The row names are the group-names .... Nevertheless I added a separate column (now using data.frame) to show this. – steffen – 2011-03-13T12:51:20.330

@steffen Are you sure there will be no ordering issues when joining columns? – Yuriy Petrovskiy – 2011-03-13T12:53:52.140

@Yuriy, tapply orders the results by the order of the levels, so there ordering issues will not arise. @steffen, I upvoted the answer for the effort, before I knew plyr, I used to solve such problems in similar way. – mpiktas – 2011-03-13T13:24:59.323

@steffen Try to replace your last three lines of code with: t(rbind.data.frame(group=1:5, with(dat, tapply(age, group, function(x) c(mean(x), sd(x)))))) (+ rename the last two columns as mean and sd). – chl – 2011-03-13T15:36:16.773

6

Here is an example with the function aggregates() I did myself some time ago:

# simulates data
set.seed(666)
( dat <- data.frame(group=gl(3,6), level=factor(rep(c("A","B","C"), 6)), 
                    y=round(rnorm(18,10),1)) )

> dat
   group level    y
1      1     A 10.8
2      1     B 12.0
3      1     C  9.6
4      1     A 12.0
5      1     B  7.8
6      1     C 10.8
7      2     A  8.7
8      2     B  9.2
9      2     C  8.2
10     2     A 10.0
11     2     B 12.2
12     2     C  8.2
13     3     A 10.9
14     3     B  8.3
15     3     C 10.1
16     3     A  9.9
17     3     B 10.9
18     3     C 10.3

# aggregates() function
aggregates <- function(formula, data=NULL, FUNS){ 
    if(class(FUNS)=="list"){ 
        f <- function(x) sapply(FUNS, function(fun) fun(x)) 
    }else{f <- FUNS} 
    temp <- aggregate(formula, data, f) 
    out <- data.frame(temp[,-ncol(temp)], temp[,ncol(temp)]) 
    colnames(out)[1] <- colnames(temp)[1] 
return(out) 
} 

# example 
FUNS <- function(x) c(mean=round(mean(x),0), sd=round(sd(x), 0)) 
( ag <- aggregates(y~group:level, data=dat, FUNS=FUNS) ) 

It gives the following result:

> ag
  group level mean sd
1     1     A   11  1
2     2     A    9  1
3     3     A   10  1
4     1     B   10  3
5     2     B   11  2
6     3     B   10  2
7     1     C   10  1
8     2     C    8  0
9     3     C   10  0

Maybe you can get the same result starting from the R function split():

> with(dat, sapply( split(y, group:level), FUNS ) )
     1:A 1:B 1:C 2:A 2:B 2:C 3:A 3:B 3:C
mean  11  10  10   9  11   8  10  10  10
sd     1   3   1   1   2   0   1   2   0

Let me come back to the output of the aggregates function. You can transform it in a beautiful table using reshape(), xtabs() and ftable():

rag <- reshape(ag, varying=list(3:4), direction="long", v.names="y") 
rag$time <- factor(rag$time) 
ft <- ftable(xtabs(y~group+level+time, data=rag)) 
attributes(ft)$col.vars <- list(c("mean","sd")) 

This gives:

> ft 
             mean sd
group level         
1     A        11  1
      B        10  3
      C        10  1
2     A         9  1
      B        11  2
      C         8  0
3     A        10  1
      B        10  2
      C        10  0

Beautiful, isn't it? You can export this table to a pdf with the textplot() function of the gplots package.

See here for others' solutions.

Stéphane Laurent

Posted 2011-03-13T12:02:38.533

Reputation: 11 101