hadoop - Pig group by and average function -


i have data looks this

stn--- wban   yearmoda    temp       dewp      slp        stp       visib      wdsp     mxspd   gust    max     min   prcp   sndp   frshtt 030050 99999  19291029    46.7  4    42.0  4   990.9  4  9999.9  0   10.9  4   13.0  4   13.0  999.9    46.9*   44.1  99.99  999.9  010000 030050 99999  19291030    43.5  4    33.5  4  1015.4  4  9999.9  0   12.4  4   14.3  4   18.1  999.9    46.9    42.1   0.00i 999.9  000000 030050 99999  19291031    43.7  4    37.3  4  1026.8  4  9999.9  0   12.4  4    4.5  4    8.9  999.9    46.9*   37.9   0.00i 999.9  000000 030050 99999  19291101    49.2  4    45.5  4  1019.9  4  9999.9  0    6.2  4    8.2  4   13.0  999.9    51.1*   46.0  99.99  999.9  010000 030050 99999  19291102    47.0  4    44.5  4  1013.6  4  9999.9  0    7.8  4    6.2  4    8.9  999.9    51.1    44.1   0.00i 999.9  000000 030050 99999  19291103    44.0  4    36.0  4  1009.2  4  9999.9  0   10.9  4    8.0  4    8.9  999.9    50.0    42.1   0.00i 999.9  000000 

i want average each month, in case: 10 , 11.

first load data using:

raw_logs = load 'data' (line:chararray); 

then separate data different variables using regex:

logs_base = foreach raw_logs generate      flatten(         regex_extract_all(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d+\\.\\d).*$')       )      (       stn: int,        wban: int,        year: int,        month: int,       day: int,       temp: float   ); 

next rid of top tuple contained header data:

no_nulls = filter logs_base stn not null; 

then group data stn, wban, year, , month:

grouped = group no_nulls stn..month; 

and try generate average , run error:

c = foreach grouped generate avg(logs_base.temp);  error org.apache.pig.tools.grunt.grunt - error 1045: <line 17, column 29> not infer matching function org.apache.pig.builtin.avg    multiple or none of them fit. please use explicit cast. 

i think error may regex in returning temp string though telling double wrong.

edit: changed c to:

c = foreach grouped generate avg(no_nulls.temp); 

and error:

hadoopversion   pigversion      userid  startedat       finishedat      features 1.0.3   0.9.2-amzn      hadoop  2013-04-20 19:55:25     2013-04-20 19:57:21     group_by,filter  failed!  failed jobs: jobid   alias   feature message outputs job_201304201942_0001   c,logs_base,raw_logs,grouped,no_nulls   group_by,combiner       message: job failed! error - # of failed map tasks exceeded allowed limit. failedcount: 1. lastfailedtask: task_201304201942_0001_m_000000 hdfs://10.254.106.85:9000/tmp/temp413183623/tmp1677272203, 

the log has bit more info:

org.apache.pig.backend.executionengine.execexception: error 2106: error while computing average in initial     @ org.apache.pig.builtin.floatavg$initial.exec(floatavg.java:99)     @ org.apache.pig.builtin.floatavg$initial.exec(floatavg.java:75)     @ org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnext(pouserfunc.java:216)     @ org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnext(pouserfunc.java:253)     @ org.apache.pig.backend.hadoop.executionengine.physicallayer.physicaloperator.getnext(physicaloperator.java:334)     @ org.apache.pig.backend.hadoop.executionengine.physicallayer.relationaloperators.poforeach.processplan(poforeach.java:332)     @ org.apache.pig.backend.hadoop.executionengine.physicallayer.relationaloperators.poforeach.getnext(poforeach.java:284)     @ org.apache.pig.backend.hadoop.executionengine.physicallayer.physicaloperator.processinput(physicaloperator.java:290)     @ org.apache.pig.backend.hadoop.executionengine.physicallayer.relationaloperators.polocalrearrange.getnext(polocalrearrange.java:256)     @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapbase.runpipeline(piggenericmapbase.java:267)     @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapbase.map(piggenericmapbase.java:262)     @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapbase.map(piggenericmapbase.java:64)     @ org.apache.hadoop.mapreduce.mapper.run(mapper.java:144)     @ org.apache.hadoop.mapred.maptask.runnewmapper(maptask.java:771)     @ org.apache.hadoop.mapred.maptask.run(maptask.java:375)     @ org.apache.hadoop.mapred.child$4.run(child.java:255)     @ java.security.accesscontroller.doprivileged(native method)     @ javax.security.auth.subject.doas(subject.java:396)     @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1132)     @ org.apache.hadoop.mapred.child.main(child.java:249) caused by: java.lang.classcastexception: java.lang.string cannot cast java.lang.float     @ org.apache.pig.builtin.floatavg$initial.exec(floatavg.java:86)     ... 19 more  pig stack trace --------------- error 2997: unable recreate exception backed error: org.apache.pig.backend.executionengine.execexception: error 2106: error while computing average in initial  org.apache.pig.impl.logicallayer.frontendexception: error 1066: unable open iterator alias c. backend error : unable recreate exception backed error: org.apache.pig.backend.executionengine.execexception: error 2106: error while computing average in initial     @ org.apache.pig.pigserver.openiterator(pigserver.java:890)     @ org.apache.pig.tools.grunt.gruntparser.processdump(gruntparser.java:679)     @ org.apache.pig.tools.pigscript.parser.pigscriptparser.parse(pigscriptparser.java:303)     @ org.apache.pig.tools.grunt.gruntparser.parsestoponerror(gruntparser.java:189)     @ org.apache.pig.tools.grunt.gruntparser.parsestoponerror(gruntparser.java:165)     @ org.apache.pig.tools.grunt.grunt.run(grunt.java:69)     @ org.apache.pig.main.run(main.java:500)     @ org.apache.pig.main.main(main.java:114)     @ sun.reflect.nativemethodaccessorimpl.invoke0(native method)     @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39)     @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25)     @ java.lang.reflect.method.invoke(method.java:597)     @ org.apache.hadoop.util.runjar.main(runjar.java:187) caused by: org.apache.pig.backend.executionengine.execexception: error 2997: unable recreate exception backed error: org.apache.pig.backend.executionengine.execexception: error 2106: error while computing average in initial     @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.launcher.geterrormessages(launcher.java:221)     @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.launcher.getstats(launcher.java:151)     @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher.launchpig(mapreducelauncher.java:354)     @ org.apache.pig.pigserver.launchplan(pigserver.java:1313)     @ org.apache.pig.pigserver.executecompiledlogicalplan(pigserver.java:1298)     @ org.apache.pig.pigserver.storeex(pigserver.java:995)     @ org.apache.pig.pigserver.store(pigserver.java:962)     @ org.apache.pig.pigserver.openiterator(pigserver.java:875) 

my guess because grouped doesn't contain logs_base, contains no_nulls. try making it

c = foreach grouped generate avg(no_nulls.temp); 

and see if fixes it.

if doesn't work, try adding dump raw_logs after first line , commenting else out, make sure looks good, uncomment second line , make dump dump logs_base, repeat rest of lines. sanity check each piece of pig script.


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -