hadoop - How to control number of records stored in a part file by pig job or a maprecude job? -


is there way control number of records stored in part file?

thanks.

not (if @ all). number of part files in output determined parallelism of script, , data split non-deterministically part files. way can think of like:

a = foreach output generate 1 num ; b = foreach (group all) generate count(a) totaloutputlines ;  -- store both output , b 

then, within python wrapper, use totaloutputlines set parallelism of script python wrapper running, par = number of lines in b / number of lines want per file. hopefully, approximately control number of records per part file.

maybe can close want multistorage splitting output 1 file per value of field use.


Comments

Popular posts from this blog

assembly - 8086 TASM: Illegal Indexing Mode -

Java, LWJGL, OpenGL 1.1, decoding BufferedImage to Bytebuffer and binding to OpenGL across classes -

asp.net - Configuring WCF Services in Code WCF 4.5 -