hadoop - Duplicate value in part-r-00000 -
while processing xml file(https://github.com/studhadoop/xml/blob/master/rpt) getting duplicate values output.
bin/hadoop fs -text /user/root/t-output1/part-r-00000 st17925 1.02 st17925 1.02 st17926 3.00 st17926 3.00 st17927 3.00 st17927 3.00
my mapred https://github.com/studhadoop/xml/blob/master/xmlparser11.java
why so? whether depends on size of xml file? whwn having large xml file, iget duplicated values. if small xml file ,the output ok. updated 1
one more doubt. instead of listing this
studentid grade st17925 1.02 st17926 3.00 st17927 3.00
what change should make in program?
update 2 how make output in csv format?
because in reducer implementation, write key every value output collector:
(text value : values) { context.write(key, value); }
what wanted this:
stringbuilder sb = new stringbuilder(); (text value : values) { sb.append(value.tostring()); sb.append(" "); } context.write(key, new text(sb.tostring());
which generate space separated list of every value per key.
Comments
Post a Comment