hadoop - How to run large Mahout fuzzy kmeans clustering without running out of memory? -
i running mahout 0.7 fuzzy k-means clustering on amazon's emr (ami 2.3.1) , running out of memory.
- my overall question: how working easily?
here invocation:
./bin/mahout fkmeans \ --input s3://.../foo/vectors.seq \ --output s3://.../foo/fuzzyk2 \ --numclusters 128 \ --clusters s3://.../foo/initial_clusters/ \ --maxiter 20 \ --m 2 \ --method mapreduce \ --distancemeasure org.apache.mahout.common.distance.tanimotodistancemeasure
more detailed questions:
how tell how memory i'm using? i'm on c1.xlarge instances. if believe aws docs, sets mapred.child.java.opts=-xmx512m.
how tell how memory need? can try different sizes, gives me no idea of size of problem can handle.
how change memory usage? start different workflow different class of machine? try setting mapred.child.java.opts?
my dataset not seem large. it?
vectors.seq collection of sparse vectors 50225 vectors (50225 things related 124420 others), total of 1.2m relationships.
this post says set --method mapreduce, am, , default.
this post says clusters held in memory on every mapper , reducer. 4*124420=498k things, doesn't seem bad.
here stack:
13/04/19 18:12:53 info mapred.jobclient: job complete: job_201304161435_7034 13/04/19 18:12:53 info mapred.jobclient: counters: 7 13/04/19 18:12:53 info mapred.jobclient: job counters 13/04/19 18:12:53 info mapred.jobclient: slots_millis_maps=28482 13/04/19 18:12:53 info mapred.jobclient: total time spent reduces waiting after reserving slots (ms)=0 13/04/19 18:12:53 info mapred.jobclient: total time spent maps waiting after reserving slots (ms)=0 13/04/19 18:12:53 info mapred.jobclient: rack-local map tasks=4 13/04/19 18:12:53 info mapred.jobclient: launched map tasks=4 13/04/19 18:12:53 info mapred.jobclient: slots_millis_reduces=0 13/04/19 18:12:53 info mapred.jobclient: failed map tasks=1 exception in thread "main" java.lang.interruptedexception: cluster iteration 1 failed processing s3://.../foo/fuzzyk2/clusters-1 @ org.apache.mahout.clustering.iterator.clusteriterator.iteratemr(clusteriterator.java:186) @ org.apache.mahout.clustering.fuzzykmeans.fuzzykmeansdriver.buildclusters(fuzzykmeansdriver.java:288) @ org.apache.mahout.clustering.fuzzykmeans.fuzzykmeansdriver.run(fuzzykmeansdriver.java:221) @ org.apache.mahout.clustering.fuzzykmeans.fuzzykmeansdriver.run(fuzzykmeansdriver.java:110) @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:65) @ org.apache.mahout.clustering.fuzzykmeans.fuzzykmeansdriver.main(fuzzykmeansdriver.java:52) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method) @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) @ java.lang.reflect.method.invoke(method.java:597) @ org.apache.hadoop.util.programdriver$programdescription.invoke(programdriver.java:68) @ org.apache.hadoop.util.programdriver.driver(programdriver.java:139) @ org.apache.mahout.driver.mahoutdriver.main(mahoutdriver.java:195) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method) @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) @ java.lang.reflect.method.invoke(method.java:597) @ org.apache.hadoop.util.runjar.main(runjar.java:187)
and here's part of log of mapper:
2013-04-19 18:10:38,734 info org.apache.hadoop.fs.s3native.natives3filesystem (main): received ioexception while reading '.../foo/vectors.seq', attempting reopen. java.net.sockettimeoutexception: read timed out @ java.net.socketinputstream.socketread0(native method) @ java.net.socketinputstream.read(socketinputstream.java:129) @ com.sun.net.ssl.internal.ssl.inputrecord.readfully(inputrecord.java:293) @ com.sun.net.ssl.internal.ssl.inputrecord.readv3record(inputrecord.java:405) @ com.sun.net.ssl.internal.ssl.inputrecord.read(inputrecord.java:360) @ com.sun.net.ssl.internal.ssl.sslsocketimpl.readrecord(sslsocketimpl.java:798) @ com.sun.net.ssl.internal.ssl.sslsocketimpl.readdatarecord(sslsocketimpl.java:755) @ com.sun.net.ssl.internal.ssl.appinputstream.read(appinputstream.java:75) @ org.apache.http.impl.io.abstractsessioninputbuffer.read(abstractsessioninputbuffer.java:187) @ org.apache.http.impl.io.contentlengthinputstream.read(contentlengthinputstream.java:164) @ org.apache.http.conn.eofsensorinputstream.read(eofsensorinputstream.java:138) @ java.io.filterinputstream.read(filterinputstream.java:116) @ org.apache.hadoop.fs.s3native.natives3filesystem$natives3fsinputstream.read(natives3filesystem.java:291) @ java.io.bufferedinputstream.fill(bufferedinputstream.java:218) @ java.io.bufferedinputstream.read1(bufferedinputstream.java:258) @ java.io.bufferedinputstream.read(bufferedinputstream.java:317) @ java.io.datainputstream.readfully(datainputstream.java:178) @ org.apache.hadoop.io.dataoutputbuffer$buffer.write(dataoutputbuffer.java:63) @ org.apache.hadoop.io.dataoutputbuffer.write(dataoutputbuffer.java:101) @ org.apache.hadoop.io.sequencefile$reader.next(sequencefile.java:2060) @ org.apache.hadoop.io.sequencefile$reader.next(sequencefile.java:2194) @ org.apache.hadoop.mapreduce.lib.input.sequencefilerecordreader.nextkeyvalue(sequencefilerecordreader.java:68) @ org.apache.hadoop.mapred.maptask$newtrackingrecordreader.nextkeyvalue(maptask.java:540) @ org.apache.hadoop.mapreduce.mapcontext.nextkeyvalue(mapcontext.java:67) @ org.apache.hadoop.mapreduce.mapper.run(mapper.java:143) @ org.apache.hadoop.mapred.maptask.runnewmapper(maptask.java:771) @ org.apache.hadoop.mapred.maptask.run(maptask.java:375) @ org.apache.hadoop.mapred.child$4.run(child.java:255) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:396) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1132) @ org.apache.hadoop.mapred.child.main(child.java:249) 2013-04-19 18:10:38,737 info org.apache.hadoop.fs.s3native.natives3filesystem (main): stream key '.../foo/vectors.seq' seeking position '62584' 2013-04-19 18:10:42,619 info org.apache.hadoop.mapred.tasklogstruncater (main): initializing logs' truncater mapretainsize=-1 , reduceretainsize=-1 2013-04-19 18:10:42,730 info org.apache.hadoop.io.nativeio.nativeio (main): initialized cache uid user mapping cache timeout of 14400 seconds. 2013-04-19 18:10:42,730 info org.apache.hadoop.io.nativeio.nativeio (main): got username hadoop uid 106 native implementation 2013-04-19 18:10:42,733 fatal org.apache.hadoop.mapred.child (main): error running child : java.lang.outofmemoryerror: java heap space @ org.apache.mahout.math.map.openintdoublehashmap.rehash(openintdoublehashmap.java:434) @ org.apache.mahout.math.map.openintdoublehashmap.put(openintdoublehashmap.java:387) @ org.apache.mahout.math.randomaccesssparsevector.setquick(randomaccesssparsevector.java:139) @ org.apache.mahout.math.abstractvector.assign(abstractvector.java:560) @ org.apache.mahout.clustering.abstractcluster.observe(abstractcluster.java:253) @ org.apache.mahout.clustering.abstractcluster.observe(abstractcluster.java:241) @ org.apache.mahout.clustering.abstractcluster.observe(abstractcluster.java:37) @ org.apache.mahout.clustering.classify.clusterclassifier.train(clusterclassifier.java:158) @ org.apache.mahout.clustering.iterator.cimapper.map(cimapper.java:55) @ org.apache.mahout.clustering.iterator.cimapper.map(cimapper.java:18) @ org.apache.hadoop.mapreduce.mapper.run(mapper.java:144) @ org.apache.hadoop.mapred.maptask.runnewmapper(maptask.java:771) @ org.apache.hadoop.mapred.maptask.run(maptask.java:375) @ org.apache.hadoop.mapred.child$4.run(child.java:255) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:396) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1132) @ org.apache.hadoop.mapred.child.main(child.java:249)
yes you're running out of memory. far know, "memory intensive workload" bootstrap action long since deprecated, may nothing. see note on page.
a c1.xlarge
should use 384mb per mapper default. when subtract out jvm overhead, room splits , combining, etc, don't have whole lot left.
you set hadoop params in bootstrap action. choose "configure hadoop" action instead if using console , set --site-key-value mapred.map.child.java.opts=-xmx1g
(if you're doing programmatically, , having trouble, contact me offline; can provide snippets myrrix since heavily tunes emr clusters speed in recommend/clustering jobs.)
you can set mapred.map.java.child.opts
instead control mappers separately reducers. can turn down number of mappers per machine make more room, or, choose high-memory instance. find ml.xlarge
optimal emr given price-to-i/o ratio, , because jobs end being i/o-bound.
Comments
Post a Comment