r - Per-group operation on multiple columns on data.frame -

April 15, 2012

general problem have often: want perform operation on data.frame, each factor level produce 1 number, , uses information multiple columns. how write in r?

i considered these functions:

tapply - doesn't operate on multiple columns
aggregate - function given columns separately
ave - result has same number of rows input, not number of factors' levels
by - hottest candidate, i hate format returned - list. want data.frame result, know can convert ugly, prefer solution!

the op asking general answer, think 'plyr' package appropriate. 'plyr' package has limitations when approaching large data sets, everyday use (implied in original post), 'plyr' functions wonderful assets r user.

setup: here quick data sample work with.

data <- data.frame(id=1:50, group=sample(letters[1:3], 50, rep=true), x_value=sample(1:500, 50), y_value=sample(2:5, 50, rep=true)*100)

how use plyr: i'm going address basic uses here example things started. first, load package.

library(plyr)

now, let's start calculating things. 'plyr' functions, choose first 2 letters of function based on input , output. in example, inputting data frame (d) , outputting data frame (d), use 'ddply" function.

the 'ddply' function uses syntax:

ddply(     data_source,      .(grouping_variables),      function,      column_definitions)

first, let's find out how many entries belong groups a, b, , c:

ddply(     data,      .(group),      summarize,      n=length(id)) #   group  n # 1     17 # 2     b 16 # 3     c 17

here, specified data source first, , specified wanted group lines 'group' variable. use 'summarize' function trash of columns except in our grouping_variables , column_definitions. using 'length' function count purpose.

now, let's add column data shows group means x , y values.

ddply(     data,     .(group),      mutate,      group_mean_x=mean(x_value),      group_mean_y=mean(y_value)) #    id group x_value y_value group_mean_x group_mean_y # 1   8         301     300     218.7059     394.1176 # 2  13          38     500     218.7059     394.1176 # 3  14         425     300     218.7059     394.1176 # ..................................................... # 17 47         191     300     218.7059     394.1176 # 18  5     b     411     500     235.1875     325.0000 # 19  6     b     121     400     235.1875     325.0000 # 20 11     b     151     200     235.1875     325.0000 # ..................................................... # 33 49     b     354     200     235.1875     325.0000 # 34  1     c     482     400     246.1765     400.0000 # 35  2     c      43     300     246.1765     400.0000 # ..................................................... # 50 50     c     248     500     246.1765     400.0000

i've truncated results make shorter. here, used same data source , grouping variable, 'mutate' function preserves of data in data source while adding columns.

now, let's two-step effort previous data. let's show means , difference between x , y mean values in summary table.

ddply(     data,      .(group),      summarize,      group_mean_x=mean(x_value),      group_mean_y=mean(y_value),      difference=group_mean_x - group_mean_y) #   group group_mean_x group_mean_y difference # 1         218.7059     394.1176  -175.4118 # 2     b     235.1875     325.0000   -89.8125 # 3     c     246.1765     400.0000  -153.8235

i show example, because there important going on... we're using columns defined part of different column's definition. very, useful when creating summary tables.

finally, let's group 2 factors: group , digit in 10^2 place of x value. let's create summary table shows mean x , y values each group , 10^2 digit x value.

ddply(     data,      .(group, x_100=as.integer(x_value/100)),      summarize,      mean_x=mean(x_value),      mean_y=mean(y_value)) #    group x_100   mean_x   mean_y # 1          0  20.0000 425.0000 # 2          1 145.6667 333.3333 # 3          2 272.0000 400.0000 # 4          3 328.6667 433.3333 # 5          4 427.5000 350.0000 # 6      b     0  37.0000 200.0000 # 7      b     1 148.6667 383.3333 # 8      b     2 230.0000 325.0000 # 9      b     3 363.0000 200.0000 # 10     b     4 412.5000 400.0000 # 11     c     0  55.6000 360.0000 # 12     c     1 173.5000 350.0000 # 13     c     2 262.5000 450.0000 # 14     c     3 355.6667 400.0000 # 15     c     4 481.0000 433.3333

this example important, because shows 2 things: can create grouping columns using vectorized statements , can group more 1 column separating list of columns comma.

this quick set of examples should enough started using 'plyr' packages. more details can found in help(plyr).

Search This Blog

Detect

r - Per-group operation on multiple columns on data.frame -

Comments

Post a Comment

Popular posts from this blog

javascript - addthis share facebook and google+ url -

ios - Show keyboard with UITextField in the input accessory view -

c++ - importing crypto++ in QT application and occurring linker errors? -