r - Per-group operation on multiple columns on data.frame -
general problem have often: want perform operation on data.frame, each factor level produce 1 number, , uses information multiple columns. how write in r?
i considered these functions:
- tapply - doesn't operate on multiple columns
- aggregate - function given columns separately
- ave - result has same number of rows input, not number of factors' levels
- by - hottest candidate, i hate format returned - list. want
data.frame
result, know can convert ugly, prefer solution!
the op asking general answer, think 'plyr' package appropriate. 'plyr' package has limitations when approaching large data sets, everyday use (implied in original post), 'plyr' functions wonderful assets r user.
setup: here quick data sample work with.
data <- data.frame(id=1:50, group=sample(letters[1:3], 50, rep=true), x_value=sample(1:500, 50), y_value=sample(2:5, 50, rep=true)*100)
how use plyr: i'm going address basic uses here example things started. first, load package.
library(plyr)
now, let's start calculating things. 'plyr' functions, choose first 2 letters of function based on input , output. in example, inputting data frame (d) , outputting data frame (d), use 'ddply" function.
the 'ddply' function uses syntax:
ddply( data_source, .(grouping_variables), function, column_definitions)
first, let's find out how many entries belong groups a, b, , c:
ddply( data, .(group), summarize, n=length(id)) # group n # 1 17 # 2 b 16 # 3 c 17
here, specified data source first, , specified wanted group lines 'group' variable. use 'summarize' function trash of columns except in our grouping_variables , column_definitions. using 'length' function count purpose.
now, let's add column data shows group means x , y values.
ddply( data, .(group), mutate, group_mean_x=mean(x_value), group_mean_y=mean(y_value)) # id group x_value y_value group_mean_x group_mean_y # 1 8 301 300 218.7059 394.1176 # 2 13 38 500 218.7059 394.1176 # 3 14 425 300 218.7059 394.1176 # ..................................................... # 17 47 191 300 218.7059 394.1176 # 18 5 b 411 500 235.1875 325.0000 # 19 6 b 121 400 235.1875 325.0000 # 20 11 b 151 200 235.1875 325.0000 # ..................................................... # 33 49 b 354 200 235.1875 325.0000 # 34 1 c 482 400 246.1765 400.0000 # 35 2 c 43 300 246.1765 400.0000 # ..................................................... # 50 50 c 248 500 246.1765 400.0000
i've truncated results make shorter. here, used same data source , grouping variable, 'mutate' function preserves of data in data source while adding columns.
now, let's two-step effort previous data. let's show means , difference between x , y mean values in summary table.
ddply( data, .(group), summarize, group_mean_x=mean(x_value), group_mean_y=mean(y_value), difference=group_mean_x - group_mean_y) # group group_mean_x group_mean_y difference # 1 218.7059 394.1176 -175.4118 # 2 b 235.1875 325.0000 -89.8125 # 3 c 246.1765 400.0000 -153.8235
i show example, because there important going on... we're using columns defined part of different column's definition. very, useful when creating summary tables.
finally, let's group 2 factors: group , digit in 10^2 place of x value. let's create summary table shows mean x , y values each group , 10^2 digit x value.
ddply( data, .(group, x_100=as.integer(x_value/100)), summarize, mean_x=mean(x_value), mean_y=mean(y_value)) # group x_100 mean_x mean_y # 1 0 20.0000 425.0000 # 2 1 145.6667 333.3333 # 3 2 272.0000 400.0000 # 4 3 328.6667 433.3333 # 5 4 427.5000 350.0000 # 6 b 0 37.0000 200.0000 # 7 b 1 148.6667 383.3333 # 8 b 2 230.0000 325.0000 # 9 b 3 363.0000 200.0000 # 10 b 4 412.5000 400.0000 # 11 c 0 55.6000 360.0000 # 12 c 1 173.5000 350.0000 # 13 c 2 262.5000 450.0000 # 14 c 3 355.6667 400.0000 # 15 c 4 481.0000 433.3333
this example important, because shows 2 things: can create grouping columns using vectorized statements , can group more 1 column separating list of columns comma.
this quick set of examples should enough started using 'plyr' packages. more details can found in help(plyr)
.
Comments
Post a Comment