How do you subset a data frame in R based on a minimum sample size -
let's have data frame 2 levels of factors looks this:
factor1 factor2 value 1 0.75 1 0.34 2 1.21 2 0.75 2 0.53 b 1 0.42 b 2 0.21 b 2 0.18 b 2 1.42
etc.
how subset
data frame ("df", if will) based on condition combination of factor1 , factor2 (fact1*fact2) has more than, say, 2 observations? can use length
argument in subset
this?
assuming data.frame
called mydf
, can use ave
create logical vector subset:
mydf[with(mydf, as.logical(ave(factor1, factor1, factor2, fun = function(x) length(x) > 2))), ] # factor1 factor2 value # 3 2 1.21 # 4 2 0.75 # 5 2 0.53 # 7 b 2 0.21 # 8 b 2 0.18 # 9 b 2 1.42
here's ave
counting combinations. notice ave
returns object same length number of rows in data.frame
(this makes convenient subsetting).
> with(mydf, ave(factor1, factor1, factor2, fun = length)) [1] "2" "2" "3" "3" "3" "1" "3" "3" "3"
the next step compare length threshold. need anonymous function our fun
argument.
> with(mydf, ave(factor1, factor1, factor2, fun = function(x) length(x) > 2)) [1] "false" "false" "true" "true" "true" "false" "true" "true" "true"
almost there... since first item character vector, our output character vector. want as.logical
can directly use subsetting.
ave
doesn't work on objects of class factor
, in case you'll need like:
mydf[with(mydf, as.logical(ave(as.character(factor1), factor1, factor2, fun = function(x) length(x) > 2))),]
Comments
Post a Comment