hadoop - How to control number of records stored in a part file by pig job or a maprecude job? -


is there way control number of records stored in part file?

thanks.

not (if @ all). number of part files in output determined parallelism of script, , data split non-deterministically part files. way can think of like:

a = foreach output generate 1 num ; b = foreach (group all) generate count(a) totaloutputlines ;  -- store both output , b 

then, within python wrapper, use totaloutputlines set parallelism of script python wrapper running, par = number of lines in b / number of lines want per file. hopefully, approximately control number of records per part file.

maybe can close want multistorage splitting output 1 file per value of field use.


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -