How do you subset a data frame in R based on a minimum sample size -
let's have data frame 2 levels of factors looks this:
factor1 factor2 value 1 0.75 1 0.34 2 1.21 2 0.75 2 0.53 b 1 0.42 b 2 0.21 b 2 0.18 b 2 1.42 etc.
how subset data frame ("df", if will) based on condition combination of factor1 , factor2 (fact1*fact2) has more than, say, 2 observations? can use length argument in subset this?
assuming data.frame called mydf, can use ave create logical vector subset:
mydf[with(mydf, as.logical(ave(factor1, factor1, factor2, fun = function(x) length(x) > 2))), ] # factor1 factor2 value # 3 2 1.21 # 4 2 0.75 # 5 2 0.53 # 7 b 2 0.21 # 8 b 2 0.18 # 9 b 2 1.42 here's ave counting combinations. notice ave returns object same length number of rows in data.frame (this makes convenient subsetting).
> with(mydf, ave(factor1, factor1, factor2, fun = length)) [1] "2" "2" "3" "3" "3" "1" "3" "3" "3" the next step compare length threshold. need anonymous function our fun argument.
> with(mydf, ave(factor1, factor1, factor2, fun = function(x) length(x) > 2)) [1] "false" "false" "true" "true" "true" "false" "true" "true" "true" almost there... since first item character vector, our output character vector. want as.logical can directly use subsetting.
ave doesn't work on objects of class factor, in case you'll need like:
mydf[with(mydf, as.logical(ave(as.character(factor1), factor1, factor2, fun = function(x) length(x) > 2))),]
Comments
Post a Comment