hadoop - How to control number of records stored in a part file by pig job or a maprecude job? -
is there way control number of records stored in part file?
thanks.
not (if @ all). number of part files in output determined parallelism of script, , data split non-deterministically part files. way can think of like:
a = foreach output generate 1 num ; b = foreach (group all) generate count(a) totaloutputlines ; -- store both output , b
then, within python wrapper, use totaloutputlines
set parallelism of script python wrapper running, par = number of lines in b / number of lines want per file
. hopefully, approximately control number of records per part file.
maybe can close want multistorage
splitting output 1 file per value of field use.
Comments
Post a Comment