hadoop - How to run large Mahout fuzzy kmeans clustering without running out of memory? -


i running mahout 0.7 fuzzy k-means clustering on amazon's emr (ami 2.3.1) , running out of memory.

  • my overall question: how working easily?

here invocation:

./bin/mahout fkmeans \   --input s3://.../foo/vectors.seq \   --output s3://.../foo/fuzzyk2 \   --numclusters 128 \   --clusters s3://.../foo/initial_clusters/ \   --maxiter 20 \   --m 2 \   --method mapreduce \   --distancemeasure org.apache.mahout.common.distance.tanimotodistancemeasure 

more detailed questions:

  • how tell how memory i'm using? i'm on c1.xlarge instances. if believe aws docs, sets mapred.child.java.opts=-xmx512m.

  • how tell how memory need? can try different sizes, gives me no idea of size of problem can handle.

  • how change memory usage? start different workflow different class of machine? try setting mapred.child.java.opts?

  • my dataset not seem large. it?

vectors.seq collection of sparse vectors 50225 vectors (50225 things related 124420 others), total of 1.2m relationships.

this post says set --method mapreduce, am, , default.

this post says clusters held in memory on every mapper , reducer. 4*124420=498k things, doesn't seem bad.

here stack:

13/04/19 18:12:53 info mapred.jobclient: job complete: job_201304161435_7034 13/04/19 18:12:53 info mapred.jobclient: counters: 7 13/04/19 18:12:53 info mapred.jobclient:   job counters  13/04/19 18:12:53 info mapred.jobclient:     slots_millis_maps=28482 13/04/19 18:12:53 info mapred.jobclient:     total time spent reduces waiting after reserving slots (ms)=0 13/04/19 18:12:53 info mapred.jobclient:     total time spent maps waiting after reserving slots (ms)=0 13/04/19 18:12:53 info mapred.jobclient:     rack-local map tasks=4 13/04/19 18:12:53 info mapred.jobclient:     launched map tasks=4 13/04/19 18:12:53 info mapred.jobclient:     slots_millis_reduces=0 13/04/19 18:12:53 info mapred.jobclient:     failed map tasks=1 exception in thread "main" java.lang.interruptedexception: cluster iteration 1 failed processing s3://.../foo/fuzzyk2/clusters-1         @ org.apache.mahout.clustering.iterator.clusteriterator.iteratemr(clusteriterator.java:186)         @ org.apache.mahout.clustering.fuzzykmeans.fuzzykmeansdriver.buildclusters(fuzzykmeansdriver.java:288)         @ org.apache.mahout.clustering.fuzzykmeans.fuzzykmeansdriver.run(fuzzykmeansdriver.java:221)         @ org.apache.mahout.clustering.fuzzykmeans.fuzzykmeansdriver.run(fuzzykmeansdriver.java:110)         @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:65)         @ org.apache.mahout.clustering.fuzzykmeans.fuzzykmeansdriver.main(fuzzykmeansdriver.java:52)         @ sun.reflect.nativemethodaccessorimpl.invoke0(native method)         @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39)         @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25)         @ java.lang.reflect.method.invoke(method.java:597)         @ org.apache.hadoop.util.programdriver$programdescription.invoke(programdriver.java:68)         @ org.apache.hadoop.util.programdriver.driver(programdriver.java:139)         @ org.apache.mahout.driver.mahoutdriver.main(mahoutdriver.java:195)         @ sun.reflect.nativemethodaccessorimpl.invoke0(native method)         @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39)         @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25)         @ java.lang.reflect.method.invoke(method.java:597)         @ org.apache.hadoop.util.runjar.main(runjar.java:187) 

and here's part of log of mapper:

2013-04-19 18:10:38,734 info org.apache.hadoop.fs.s3native.natives3filesystem (main): received ioexception while reading '.../foo/vectors.seq', attempting reopen. java.net.sockettimeoutexception: read timed out         @ java.net.socketinputstream.socketread0(native method)         @ java.net.socketinputstream.read(socketinputstream.java:129)         @ com.sun.net.ssl.internal.ssl.inputrecord.readfully(inputrecord.java:293)         @ com.sun.net.ssl.internal.ssl.inputrecord.readv3record(inputrecord.java:405)         @ com.sun.net.ssl.internal.ssl.inputrecord.read(inputrecord.java:360)         @ com.sun.net.ssl.internal.ssl.sslsocketimpl.readrecord(sslsocketimpl.java:798)         @ com.sun.net.ssl.internal.ssl.sslsocketimpl.readdatarecord(sslsocketimpl.java:755)         @ com.sun.net.ssl.internal.ssl.appinputstream.read(appinputstream.java:75)         @ org.apache.http.impl.io.abstractsessioninputbuffer.read(abstractsessioninputbuffer.java:187)         @ org.apache.http.impl.io.contentlengthinputstream.read(contentlengthinputstream.java:164)         @ org.apache.http.conn.eofsensorinputstream.read(eofsensorinputstream.java:138)         @ java.io.filterinputstream.read(filterinputstream.java:116)         @ org.apache.hadoop.fs.s3native.natives3filesystem$natives3fsinputstream.read(natives3filesystem.java:291)         @ java.io.bufferedinputstream.fill(bufferedinputstream.java:218)         @ java.io.bufferedinputstream.read1(bufferedinputstream.java:258)         @ java.io.bufferedinputstream.read(bufferedinputstream.java:317)         @ java.io.datainputstream.readfully(datainputstream.java:178)         @ org.apache.hadoop.io.dataoutputbuffer$buffer.write(dataoutputbuffer.java:63)         @ org.apache.hadoop.io.dataoutputbuffer.write(dataoutputbuffer.java:101)         @ org.apache.hadoop.io.sequencefile$reader.next(sequencefile.java:2060)         @ org.apache.hadoop.io.sequencefile$reader.next(sequencefile.java:2194)         @ org.apache.hadoop.mapreduce.lib.input.sequencefilerecordreader.nextkeyvalue(sequencefilerecordreader.java:68)         @ org.apache.hadoop.mapred.maptask$newtrackingrecordreader.nextkeyvalue(maptask.java:540)         @ org.apache.hadoop.mapreduce.mapcontext.nextkeyvalue(mapcontext.java:67)         @ org.apache.hadoop.mapreduce.mapper.run(mapper.java:143)         @ org.apache.hadoop.mapred.maptask.runnewmapper(maptask.java:771)         @ org.apache.hadoop.mapred.maptask.run(maptask.java:375)         @ org.apache.hadoop.mapred.child$4.run(child.java:255)         @ java.security.accesscontroller.doprivileged(native method)         @ javax.security.auth.subject.doas(subject.java:396)         @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1132)         @ org.apache.hadoop.mapred.child.main(child.java:249) 2013-04-19 18:10:38,737 info org.apache.hadoop.fs.s3native.natives3filesystem (main): stream key '.../foo/vectors.seq' seeking position '62584' 2013-04-19 18:10:42,619 info org.apache.hadoop.mapred.tasklogstruncater (main): initializing logs' truncater mapretainsize=-1 , reduceretainsize=-1 2013-04-19 18:10:42,730 info org.apache.hadoop.io.nativeio.nativeio (main): initialized cache uid user mapping cache timeout of 14400 seconds. 2013-04-19 18:10:42,730 info org.apache.hadoop.io.nativeio.nativeio (main): got username hadoop uid 106 native implementation 2013-04-19 18:10:42,733 fatal org.apache.hadoop.mapred.child (main): error running child : java.lang.outofmemoryerror: java heap space         @ org.apache.mahout.math.map.openintdoublehashmap.rehash(openintdoublehashmap.java:434)         @ org.apache.mahout.math.map.openintdoublehashmap.put(openintdoublehashmap.java:387)         @ org.apache.mahout.math.randomaccesssparsevector.setquick(randomaccesssparsevector.java:139)         @ org.apache.mahout.math.abstractvector.assign(abstractvector.java:560)         @ org.apache.mahout.clustering.abstractcluster.observe(abstractcluster.java:253)         @ org.apache.mahout.clustering.abstractcluster.observe(abstractcluster.java:241)         @ org.apache.mahout.clustering.abstractcluster.observe(abstractcluster.java:37)         @ org.apache.mahout.clustering.classify.clusterclassifier.train(clusterclassifier.java:158)         @ org.apache.mahout.clustering.iterator.cimapper.map(cimapper.java:55)         @ org.apache.mahout.clustering.iterator.cimapper.map(cimapper.java:18)         @ org.apache.hadoop.mapreduce.mapper.run(mapper.java:144)         @ org.apache.hadoop.mapred.maptask.runnewmapper(maptask.java:771)         @ org.apache.hadoop.mapred.maptask.run(maptask.java:375)         @ org.apache.hadoop.mapred.child$4.run(child.java:255)         @ java.security.accesscontroller.doprivileged(native method)         @ javax.security.auth.subject.doas(subject.java:396)         @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1132)         @ org.apache.hadoop.mapred.child.main(child.java:249) 

yes you're running out of memory. far know, "memory intensive workload" bootstrap action long since deprecated, may nothing. see note on page.

a c1.xlarge should use 384mb per mapper default. when subtract out jvm overhead, room splits , combining, etc, don't have whole lot left.

you set hadoop params in bootstrap action. choose "configure hadoop" action instead if using console , set --site-key-value mapred.map.child.java.opts=-xmx1g

(if you're doing programmatically, , having trouble, contact me offline; can provide snippets myrrix since heavily tunes emr clusters speed in recommend/clustering jobs.)

you can set mapred.map.java.child.opts instead control mappers separately reducers. can turn down number of mappers per machine make more room, or, choose high-memory instance. find ml.xlarge optimal emr given price-to-i/o ratio, , because jobs end being i/o-bound.


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -