ffdfdply, splitting and memory limit in R -


i'm having "error: cannot allocate vector of size ...mb" problem using ff/ffdf , ffdfdply function.

i'm trying use ff , ffdf packages process large amount of data has been keyed groups. data (in ffdf table format) looks this:

x =   id_1    id_2    month    year    amount    key    1      13        1    2013     -200      11    1      13        2    2013      300      54    2      19        1    2013      300      82    3      33        2    2013      300      70  .... (10+ million rows) 

the unique keys created using like:

x$key = as.ff(as.integer(ikey(x[c("id_1","id_2","month","year")]))) 

to summarise grouping using key variable, have command:

summary = ffdfdply(x=x, split=x$key, fun=function(df) {   df = data.table(df)   df = df[,list(id_1 = id_1[1], withdraw = sum(amount*(amount>0),na.rm=t), = "key"]   df },trace=t) 

using data.table's excellent grouping feature (idea taken this discussion). in real code there more functions applied amount variable, can not process full ffdf table (a smaller subset of table works fine).

it seems ffdfdplyis using huge amount of ram, giving the:

error: cannot allocate vector of size 64mb 

also batchbytes not seem help. 1 experience ffdffply can recommend other way go this, without pre-splitting ffdf table chunks?

the difficult part using ff/ffbase making sure data stays in ff , not accidently put in ram. once have put data in ram (mostly due misunderstanding of when data put in ram , when not), hard ram r , if working on ram limit, small request of ram 'error: cannot allocate vector of size'.

now, think misspecified input ikey. @ ?ikey, requires input argument ffdf, not several ff vectors. has put data in ram while wanted use ikey(x[c("id_1","id_2","month","year")])

it simulated data on computer follows ffdf 24mio rows, , following not give me ram troubles (it uses approx 3.5gb of ram in machine)

require(ffbase) require(data.table) x <- expand.ffgrid(id_1 = ffseq(1, 1000), id_2 = ffseq(1, 1000), year = as.ff(c(2012,2013)), month = as.ff(1:12)) x$amount <- ffrandom(nrow(x), rnorm, mean = 10, sd = 5) x$key <- ikey(x[c("id_1","id_2","month","year")]) x$key <- as.character(x$key) summary <- ffdfdply(x, split=x$key, fun=function(df) {   df <- data.table(df)   df <- df[, list(     id_1 = id_1[1],      id_2 = id_2[1],     month = month[1],     year = year[1],     withdraw = sum(amount*(amount>0), na.rm=t)   ), = key]   df }, trace=true) 

another reason might have other data in ram not talking about. mark in ff, factor levels in ram, might issue if working lot of character/factor data - in case need asking whether need these data in analysis or not.


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -