## How to summarize data by group in R?

160

101

I have R data frame like this:

``````        age group
1   23.0883     1
2   25.8344     1
3   29.4648     1
4   32.7858     2
5   33.6372     1
6   34.9350     1
7   35.2115     2
8   35.2115     2
9   35.2115     2
10  36.7803     1
...
``````

I need to get data frame in the following form:

``````group mean     sd
1     34.5     5.6
2     32.3     4.2
...
``````

Group number may vary, but their names and quantity could be obtained by calling `levels(factor(data\$group))`

What manipulations should be done with the data to get the result?

Question was closed 2015-09-11T23:23:19.400

the commas in the result data frame mean something special, or is it just the decimal point? – mpiktas – 2011-03-13T12:46:04.723

@mpiktas Thank you for noting. Corrected. These were locale issues (I am russian) - we we use comma for decimal separation. – Yuriy Petrovskiy – 2011-03-13T12:59:05.770

3

I suspected that. All of the Europe uses comma except the British.

– mpiktas – 2011-03-13T13:04:26.597

4Despite not being British, I prefer dot for decimal separator. – Roman Luštrik – 2011-03-14T12:17:54.077

1

See `aggregate`, `tapply`, and then http://stackoverflow.com for any subsequent coding questions of this type.

– conjugateprior – 2015-09-11T23:26:40.743

## Answers

126

Here is the plyr one line variant using ddply:

``````dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
ddply(dt,~group,summarise,mean=mean(age),sd=sd(age))
``````

Here is another one line variant using new package data.table.

``````dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)
dt[,list(mean=mean(age),sd=sd(age)),by=group]
``````

This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:

``````> system.time(aa <- ddply(dtf,~group,summarise,mean=mean(age),sd=sd(age)))
utilisateur     système      écoulé
0.513       0.180       0.692
> system.time(aa <- dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé
0.087       0.018       0.103
``````

Further savings are possible if we use `setkey`:

``````> setkey(dt,group)
> system.time(dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé
0.040       0.007       0.048
``````

1@chl, it gave me a chance to try out this new data.table package. It looks really promising. – mpiktas – 2011-03-15T12:54:27.890

Simple typo: I think you meant `dt &lt;- data.table(dtf)` instead of `dt &lt;- data.table(dt)` in the second code block. That way, you are creating the data table from a data frame instead of from the `dt` function from the `stats` package. I tried editing it, but I cannot do edits under six characters. – Christopher Bottoms – 2014-10-24T18:50:57.693

In my (not humble in this case) opinion `data.table` is the best way to aggregate data and this answer is great, but still only scratches the surface. Aside from being syntactically superior, it's also extremely flexible and has many advanced features that involve joins and internal mechanics. Check out the FAQ, github page, or course for more info. – geneorama – 2014-10-29T03:47:57.217

6+6000 for data.table. It really is so much faster than ddply, even for me on datasets smaller than 100k (I have one with just 20k rows). Must be something to do with the functions I am applying, but ddply will take minutes and data.table a few seconds. – atomicules – 2011-09-22T15:22:22.837

84

One possibility is to use the aggregate function. For instance,

``````aggregate(data\$age, by=list(data\$group), FUN=mean)[2]
``````

gives you the second column of the desired result.

1Don't link to your local help server :-) +1 but see my comments to @steffen's response. – chl – 2011-03-13T12:26:37.863

Done the thing by calling `data.frame(group=levels(factor(data\$group)),mean=(aggregate(data\$age, by=list(data\$group), FUN=mean)\$x),sd=(aggregate(data\$age, by=list(data\$group), FUN=sd)\$x))` but I am not shure it is correct way. I am not sure what will happen then the results of binded columns will be in different order (I think it is possible). What is your oppinion? – Yuriy Petrovskiy – 2011-03-13T12:46:40.693

8@Yuriy The rows should not be out of order, but here is a way to do it one call to `aggregate()`: `aggregate(age ~ group, data=dat, FUN = function(x) c(M=mean(x), SD=sd(x)))` – lockedoff – 2011-03-14T16:51:56.437

@lockedoff: Thank you for having completed my answer! – ocram – 2011-03-15T10:22:20.997

25

Since you are manipulating a data frame, the `dplyr` package is probably the faster way to do it.

``````library(dplyr)
dt <- data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T))
grp <- group_by(dt, group)
summarise(grp, mean=mean(age), sd=sd(age))
``````

or equivalently, using the `dplyr`/`magrittr` pipe operator:

``````library(dplyr)
dt <- data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T))
group_by(dt, group) %>%
summarise(mean=mean(age), sd=sd(age))
``````

EDIT full use of pipe operator:

``````library(dplyr)
data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T)) %>%
group_by(group) %>%
summarise(mean=mean(age), sd=sd(age))
``````

3+1 for `dplyr`. It has made so many R tasks simple and many of these methods obsolete. – gregmacfarlane – 2014-07-15T13:02:03.113

The full use of pipe operator version does not work for me unfortunately – dagcilibili – 2017-09-10T05:41:03.380

did you load dplyr or magrittr ? – Bastiaan Quast – 2017-09-10T09:42:05.440

thank you very much @bquast for pointing out towards the solution, summarise function was called from `plyr` instead of `dplyr` which was causing the problem. – dagcilibili – 2017-09-10T21:31:20.207

11

Great, thanks bquast for adding the dplyr solution!

Turns out that then, dplyr and data.table are very close:

``````library(plyr)
library(dplyr)
library(data.table)
library(rbenchmark)

dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)

setkey(dt,group)

a<-benchmark(ddply(dtf,~group,plyr:::summarise,mean=mean(age),sd=sd(age)),
dt[,list(mean=mean(age),sd=sd(age)),by=group],
group_by(dt, group) %>% summarise(mean=mean(age),sd=sd(age) ),
group_by(dtf, group) %>% summarise(mean=mean(age),sd=sd(age) )
)

a[, c(1,3,4)]
``````

data.table is still the fastest, by followed very closely by dplyr(), which interestingly seems faster on the data.frame than the data.table:

``````                                                              test elapsed relative
1 ddply(dtf, ~group, plyr:::summarise, mean = mean(age), sd = sd(age))   1.689    4.867
2               dt[, list(mean = mean(age), sd = sd(age)), by = group]   0.347    1.000
4   group_by(dtf, group) %>% summarise(mean = mean(age), sd = sd(age))   0.369    1.063
3    group_by(dt, group) %>% summarise(mean = mean(age), sd = sd(age))   0.580    1.671
``````

At first I thought you needed to move setkey into the benchmark, but turns out that takes almost no time at all. – kasterma – 2014-10-16T14:27:44.393

9

I have found the function `summaryBy` in the doBy package to be the most convenient for this:

``````library(doBy)

age    = c(23.0883, 25.8344, 29.4648, 32.7858, 33.6372,
34.935,  35.2115, 35.2115,  5.2115, 36.7803)
group  = c(1, 1, 1, 2, 1, 1, 2, 2, 2, 1)
dframe = data.frame(age=age, group=group)

summaryBy(age~group, data=dframe, FUN=c(mean, sd))
#
#   group age.mean    age.sd
# 1     1 30.62333  5.415439
# 2     2 27.10507 14.640441
``````

9

Use the `sqldf` package. This allows you now to use SQL to summarize the data. Once you load it you can write something like -

``````sqldf('  select group,avg(age) from data group by group  ')
``````

9

In addition to existing suggestions, you might want to check out the `describe.by` function in the `psych` package.

It provides a number of descriptive statistics including the mean and standard deviation based on a grouping variable.

its nice, but somewhat tricky to export to LaTeX IME. – richiemorrisroe – 2011-03-14T10:08:56.933

8

Edited: According to chl's suggestions

The function you are looking for is called "tapply" which applies a function per group specified by a factor.

``````# create some artificial data
set.seed(42)
groups <- 5

agedat <- c()
groupdat <- c()

for(group in 1:groups){
agedat <- c(agedat,rnorm(100,mean=0 + group,1/group))
groupdat <- c(groupdat,rep(group,100))
}
dat <- data.frame("age"=agedat,"group"=factor(groupdat))

# calculate mean and stdev age per group
res <- rbind.data.frame(group=1:5, with(dat, tapply(age, group, function(x) c(mean(x), sd(x)))))
names(res) <- paste("group",1:5)
row.names(res)[2:3] <- c("mean","sd")
``````

I really suggest to work through a basic R tutorial explaining all commonly used datastructures and methods. Otherwise you will get stuck every inch during programming. See this question for a collection of free available resources.

2

@steffen +1 but there's no need for a `for` loop here, you can contruct your dataframe inline, IMO. For the `tapply` call, use `function(x) c(mean(x),sd(x)))` and `cbind` the result as the OP asked for both statistics. Also, `ddply` from the plyr package could do this smoothly.

– chl – 2011-03-13T12:24:41.290

@steffen The problem is I need the exactly the table structure I described. There is no problem with getting means and sd. The problem is with stucture. – Yuriy Petrovskiy – 2011-03-13T12:35:28.860

@chl: Thank you for your comment, did not know about plyr :). I added cbind, but left the rest untouched. May another one take the credit, this answer shall remain as a less optimal example. – steffen – 2011-03-13T12:35:58.383

@Yuriy: Added cbind. If you already knew how to apply functions per group, you may reformulate your question (just for clarity ;)). – steffen – 2011-03-13T12:37:47.197

@steffen `cbind("mean"=mperage,"stdev"=stperage) gives no 'group' column. Will be joining by`cbind(group=levels(factor(data\$group)),"mean"=mperage,"stdev"=stperage)` correct? – Yuriy Petrovskiy – 2011-03-13T12:51:09.647

@Yuriy: The row names are the group-names .... Nevertheless I added a separate column (now using data.frame) to show this. – steffen – 2011-03-13T12:51:20.330

@steffen Are you sure there will be no ordering issues when joining columns? – Yuriy Petrovskiy – 2011-03-13T12:53:52.140

@Yuriy, tapply orders the results by the order of the levels, so there ordering issues will not arise. @steffen, I upvoted the answer for the effort, before I knew plyr, I used to solve such problems in similar way. – mpiktas – 2011-03-13T13:24:59.323

@steffen Try to replace your last three lines of code with: `t(rbind.data.frame(group=1:5, with(dat, tapply(age, group, function(x) c(mean(x), sd(x))))))` (+ rename the last two columns as `mean` and `sd`). – chl – 2011-03-13T15:36:16.773

6

Here is an example with the function `aggregates()` I did myself some time ago:

``````# simulates data
set.seed(666)
( dat <- data.frame(group=gl(3,6), level=factor(rep(c("A","B","C"), 6)),
y=round(rnorm(18,10),1)) )

> dat
group level    y
1      1     A 10.8
2      1     B 12.0
3      1     C  9.6
4      1     A 12.0
5      1     B  7.8
6      1     C 10.8
7      2     A  8.7
8      2     B  9.2
9      2     C  8.2
10     2     A 10.0
11     2     B 12.2
12     2     C  8.2
13     3     A 10.9
14     3     B  8.3
15     3     C 10.1
16     3     A  9.9
17     3     B 10.9
18     3     C 10.3

# aggregates() function
aggregates <- function(formula, data=NULL, FUNS){
if(class(FUNS)=="list"){
f <- function(x) sapply(FUNS, function(fun) fun(x))
}else{f <- FUNS}
temp <- aggregate(formula, data, f)
out <- data.frame(temp[,-ncol(temp)], temp[,ncol(temp)])
colnames(out)[1] <- colnames(temp)[1]
return(out)
}

# example
FUNS <- function(x) c(mean=round(mean(x),0), sd=round(sd(x), 0))
( ag <- aggregates(y~group:level, data=dat, FUNS=FUNS) )
``````

It gives the following result:

``````> ag
group level mean sd
1     1     A   11  1
2     2     A    9  1
3     3     A   10  1
4     1     B   10  3
5     2     B   11  2
6     3     B   10  2
7     1     C   10  1
8     2     C    8  0
9     3     C   10  0
``````

Maybe you can get the same result starting from the R function split():

``````> with(dat, sapply( split(y, group:level), FUNS ) )
1:A 1:B 1:C 2:A 2:B 2:C 3:A 3:B 3:C
mean  11  10  10   9  11   8  10  10  10
sd     1   3   1   1   2   0   1   2   0
``````

Let me come back to the output of the `aggregates` function. You can transform it in a beautiful table using `reshape()`, `xtabs()` and `ftable()`:

``````rag <- reshape(ag, varying=list(3:4), direction="long", v.names="y")
rag\$time <- factor(rag\$time)
ft <- ftable(xtabs(y~group+level+time, data=rag))
attributes(ft)\$col.vars <- list(c("mean","sd"))
``````

This gives:

``````> ft
mean sd
group level
1     A        11  1
B        10  3
C        10  1
2     A         9  1
B        11  2
C         8  0
3     A        10  1
B        10  2
C        10  0
``````

Beautiful, isn't it? You can export this table to a pdf with the `textplot()` function of the `gplots` package.

See here for others' solutions.