hadoop - How to control number of records stored in a part file by pig job or a maprecude job? -

May 15, 2015

is there way control number of records stored in part file?

thanks.

not (if @ all). number of part files in output determined parallelism of script, , data split non-deterministically part files. way can think of like:

a = foreach output generate 1 num ; b = foreach (group all) generate count(a) totaloutputlines ;  -- store both output , b

then, within python wrapper, use totaloutputlines set parallelism of script python wrapper running, par = number of lines in b / number of lines want per file. hopefully, approximately control number of records per part file.

maybe can close want multistorage splitting output 1 file per value of field use.

Search This Blog

Detect

hadoop - How to control number of records stored in a part file by pig job or a maprecude job? -

Comments

Post a Comment

Popular posts from this blog

javascript - addthis share facebook and google+ url -

c++ - importing crypto++ in QT application and occurring linker errors? -

ios - Show keyboard with UITextField in the input accessory view -