hadoop - Pig group by and average function -
i have data looks this
stn--- wban yearmoda temp dewp slp stp visib wdsp mxspd gust max min prcp sndp frshtt 030050 99999 19291029 46.7 4 42.0 4 990.9 4 9999.9 0 10.9 4 13.0 4 13.0 999.9 46.9* 44.1 99.99 999.9 010000 030050 99999 19291030 43.5 4 33.5 4 1015.4 4 9999.9 0 12.4 4 14.3 4 18.1 999.9 46.9 42.1 0.00i 999.9 000000 030050 99999 19291031 43.7 4 37.3 4 1026.8 4 9999.9 0 12.4 4 4.5 4 8.9 999.9 46.9* 37.9 0.00i 999.9 000000 030050 99999 19291101 49.2 4 45.5 4 1019.9 4 9999.9 0 6.2 4 8.2 4 13.0 999.9 51.1* 46.0 99.99 999.9 010000 030050 99999 19291102 47.0 4 44.5 4 1013.6 4 9999.9 0 7.8 4 6.2 4 8.9 999.9 51.1 44.1 0.00i 999.9 000000 030050 99999 19291103 44.0 4 36.0 4 1009.2 4 9999.9 0 10.9 4 8.0 4 8.9 999.9 50.0 42.1 0.00i 999.9 000000
i want average each month, in case: 10 , 11.
first load data using:
raw_logs = load 'data' (line:chararray);
then separate data different variables using regex:
logs_base = foreach raw_logs generate flatten( regex_extract_all(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d+\\.\\d).*$') ) ( stn: int, wban: int, year: int, month: int, day: int, temp: float );
next rid of top tuple contained header data:
no_nulls = filter logs_base stn not null;
then group data stn, wban, year, , month:
grouped = group no_nulls stn..month;
and try generate average , run error:
c = foreach grouped generate avg(logs_base.temp); error org.apache.pig.tools.grunt.grunt - error 1045: <line 17, column 29> not infer matching function org.apache.pig.builtin.avg multiple or none of them fit. please use explicit cast.
i think error may regex in returning temp string though telling double wrong.
edit: changed c to:
c = foreach grouped generate avg(no_nulls.temp);
and error:
hadoopversion pigversion userid startedat finishedat features 1.0.3 0.9.2-amzn hadoop 2013-04-20 19:55:25 2013-04-20 19:57:21 group_by,filter failed! failed jobs: jobid alias feature message outputs job_201304201942_0001 c,logs_base,raw_logs,grouped,no_nulls group_by,combiner message: job failed! error - # of failed map tasks exceeded allowed limit. failedcount: 1. lastfailedtask: task_201304201942_0001_m_000000 hdfs://10.254.106.85:9000/tmp/temp413183623/tmp1677272203,
the log has bit more info:
org.apache.pig.backend.executionengine.execexception: error 2106: error while computing average in initial @ org.apache.pig.builtin.floatavg$initial.exec(floatavg.java:99) @ org.apache.pig.builtin.floatavg$initial.exec(floatavg.java:75) @ org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnext(pouserfunc.java:216) @ org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnext(pouserfunc.java:253) @ org.apache.pig.backend.hadoop.executionengine.physicallayer.physicaloperator.getnext(physicaloperator.java:334) @ org.apache.pig.backend.hadoop.executionengine.physicallayer.relationaloperators.poforeach.processplan(poforeach.java:332) @ org.apache.pig.backend.hadoop.executionengine.physicallayer.relationaloperators.poforeach.getnext(poforeach.java:284) @ org.apache.pig.backend.hadoop.executionengine.physicallayer.physicaloperator.processinput(physicaloperator.java:290) @ org.apache.pig.backend.hadoop.executionengine.physicallayer.relationaloperators.polocalrearrange.getnext(polocalrearrange.java:256) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapbase.runpipeline(piggenericmapbase.java:267) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapbase.map(piggenericmapbase.java:262) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapbase.map(piggenericmapbase.java:64) @ org.apache.hadoop.mapreduce.mapper.run(mapper.java:144) @ org.apache.hadoop.mapred.maptask.runnewmapper(maptask.java:771) @ org.apache.hadoop.mapred.maptask.run(maptask.java:375) @ org.apache.hadoop.mapred.child$4.run(child.java:255) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:396) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1132) @ org.apache.hadoop.mapred.child.main(child.java:249) caused by: java.lang.classcastexception: java.lang.string cannot cast java.lang.float @ org.apache.pig.builtin.floatavg$initial.exec(floatavg.java:86) ... 19 more pig stack trace --------------- error 2997: unable recreate exception backed error: org.apache.pig.backend.executionengine.execexception: error 2106: error while computing average in initial org.apache.pig.impl.logicallayer.frontendexception: error 1066: unable open iterator alias c. backend error : unable recreate exception backed error: org.apache.pig.backend.executionengine.execexception: error 2106: error while computing average in initial @ org.apache.pig.pigserver.openiterator(pigserver.java:890) @ org.apache.pig.tools.grunt.gruntparser.processdump(gruntparser.java:679) @ org.apache.pig.tools.pigscript.parser.pigscriptparser.parse(pigscriptparser.java:303) @ org.apache.pig.tools.grunt.gruntparser.parsestoponerror(gruntparser.java:189) @ org.apache.pig.tools.grunt.gruntparser.parsestoponerror(gruntparser.java:165) @ org.apache.pig.tools.grunt.grunt.run(grunt.java:69) @ org.apache.pig.main.run(main.java:500) @ org.apache.pig.main.main(main.java:114) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method) @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) @ java.lang.reflect.method.invoke(method.java:597) @ org.apache.hadoop.util.runjar.main(runjar.java:187) caused by: org.apache.pig.backend.executionengine.execexception: error 2997: unable recreate exception backed error: org.apache.pig.backend.executionengine.execexception: error 2106: error while computing average in initial @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.launcher.geterrormessages(launcher.java:221) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.launcher.getstats(launcher.java:151) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher.launchpig(mapreducelauncher.java:354) @ org.apache.pig.pigserver.launchplan(pigserver.java:1313) @ org.apache.pig.pigserver.executecompiledlogicalplan(pigserver.java:1298) @ org.apache.pig.pigserver.storeex(pigserver.java:995) @ org.apache.pig.pigserver.store(pigserver.java:962) @ org.apache.pig.pigserver.openiterator(pigserver.java:875)
my guess because grouped doesn't contain logs_base, contains no_nulls. try making it
c = foreach grouped generate avg(no_nulls.temp);
and see if fixes it.
if doesn't work, try adding dump raw_logs
after first line , commenting else out, make sure looks good, uncomment second line , make dump dump logs_base
, repeat rest of lines. sanity check each piece of pig script.
Comments
Post a Comment