multithreading - Haskell: sub-optimal parallel GC work balance, no speedup in parallel execution -
the description of problem practically same in this post, although think can understand corresponding solution, can not see how apply problem, if @ all.
here example program
{-# language bangpatterns #-} import system.random (randoms, mkstdgen) import control.parallel.strategies import control.deepseq (nfdata) import data.list data point = point !double !double fmod :: double -> double -> double fmod b | < 0 = b - fmod (abs a) b | otherwise = if < b else let q = / b in b * (q - fromintegral (floor q :: int)) standardmap :: double -> point -> point standardmap k (point q p) = point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi)) iterate' gen !p = p : (iterate' gen $ gen p) iteraten :: (point -> point) -> [int] -> point -> [point] iteraten _ [] p = [p] iteraten gen (dn:dns) p = p : (iteraten gen dns $ (head . drop dn) $ iterate' gen p) ensemble :: [point] ensemble = zipwith point qs ps qs = randoms (mkstdgen 42) ps = randoms (mkstdgen 21) main = let dns = take 100 $ repeat 10000 ens = take 1000 ensemble obs = \(point p q) -> p^2 - q^2 work = map obs . (iteraten (standardmap 7.0) dns) ps = parmap rdeepseq work ens in putstrln $ show (foldl' (+) 0 $ map (foldl' (+) 0) ps) the problem program not scale number of threads. example, on debian 3.2.46-1 x86_64} ghc 7.4.1 get
$ ghc -o3 --make stmap.hs -threaded $ time ./stmap +rts -n1 real 1m9.791s user 1m9.448s sys 0m0.208s $ time ./stmap +rts -n2 real 0m36.981s user 1m13.113s sys 0m0.656s $ time ./stmap +rts -n4 real 0m23.110s user 1m31.310s sys 0m0.792s $ time ./stmap +rts -n8 real 0m20.537s user 2m21.921s sys 0m21.017s this numbers may fluctuate lot. indicator have found of problem might suboptimal parallel gc work balance, example:
$ ./stmap +rts -n8 -sstderr 1>/dev/null 112,032,905,392 bytes allocated in heap 59,112,296 bytes copied during gc 971,520 bytes maximum residency (35 sample(s)) 96,416 bytes maximum slop 8 mb total memory in use (1 mb lost due fragmentation) tot time (elapsed) avg pause max pause gen 0 27032 colls, 27031 par 6.49s 0.81s 0.0000s 0.0015s gen 1 35 colls, 35 par 0.39s 0.05s 0.0014s 0.0028s parallel gc work balance: 4.05 (6799831 / 1680927, ideal 8) mut time (elapsed) gc time (elapsed) task 0 (worker) : 14.81s ( 14.84s) 0.96s ( 0.97s) task 1 (worker) : 0.00s ( 15.81s) 0.00s ( 0.00s) task 2 (bound) : 0.03s ( 15.80s) 0.01s ( 0.01s) task 3 (worker) : 14.72s ( 14.82s) 0.98s ( 0.99s) task 4 (worker) : 14.70s ( 14.84s) 0.96s ( 0.97s) task 5 (worker) : 14.69s ( 14.82s) 0.98s ( 0.99s) task 6 (worker) : 14.69s ( 14.82s) 0.98s ( 0.99s) task 7 (worker) : 14.72s ( 14.81s) 0.99s ( 1.00s) task 8 (worker) : 14.76s ( 14.83s) 0.97s ( 0.98s) task 9 (worker) : 14.76s ( 14.81s) 1.00s ( 1.00s) sparks: 1000 (1000 converted, 0 overflowed, 0 dud, 0 gc'd, 0 fizzled) init time 0.00s ( 0.00s elapsed) mut time 118.87s ( 14.95s elapsed) gc time 6.87s ( 0.86s elapsed) exit time 0.00s ( 0.00s elapsed) total time 125.74s ( 15.81s elapsed) alloc rate 942,488,358 bytes per mut second productivity 94.5% of total user, 751.8% of total elapsed gc_alloc_block_sync: 1130880 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 175 where ~4, in next run worse, ~2,
$ ./stmap +rts -n8 -sstderr 60364.38698300099 112,033,885,088 bytes allocated in heap 4,626,963,592 bytes copied during gc 2,101,264 bytes maximum residency (1846 sample(s)) 652,528 bytes maximum slop 13 mb total memory in use (0 mb lost due fragmentation) tot time (elapsed) avg pause max pause gen 0 25497 colls, 25496 par 29.42s 3.70s 0.0001s 0.0022s gen 1 1846 colls, 1846 par 17.97s 2.26s 0.0012s 0.0071s parallel gc work balance: 2.00 (577773617 / 288947149, ideal 8) mut time (elapsed) gc time (elapsed) task 0 (worker) : 14.86s ( 15.03s) 6.07s ( 6.10s) task 1 (worker) : 0.00s ( 21.13s) 0.00s ( 0.00s) task 2 (bound) : 0.03s ( 21.11s) 0.02s ( 0.02s) task 3 (worker) : 14.92s ( 14.99s) 6.06s ( 6.14s) task 4 (worker) : 14.88s ( 15.02s) 6.07s ( 6.11s) task 5 (worker) : 14.91s ( 15.02s) 6.09s ( 6.12s) task 6 (worker) : 14.92s ( 15.04s) 6.07s ( 6.10s) task 7 (worker) : 14.86s ( 15.03s) 6.03s ( 6.11s) task 8 (worker) : 14.86s ( 15.03s) 6.07s ( 6.10s) task 9 (worker) : 14.92s ( 15.00s) 6.11s ( 6.13s) sparks: 1000 (1000 converted, 0 overflowed, 0 dud, 0 gc'd, 0 fizzled) init time 0.00s ( 0.00s elapsed) mut time 120.36s ( 15.18s elapsed) gc time 47.39s ( 5.96s elapsed) exit time 0.00s ( 0.00s elapsed) total time 167.75s ( 21.13s elapsed) alloc rate 930,821,901 bytes per mut second productivity 71.7% of total user, 569.5% of total elapsed gc_alloc_block_sync: 1253157 whitehole_spin: 21 gen[0].sync: 4 gen[1].sync: 19789 what responsible these fluctuations in execution time? , importantly, how can 1 improve parallel gc work balance in concrete example , in general?
the varaition due fact using +rts -nn leads creation of 1 bound thread , n worker threads (cf. output), hence 1 worker share physical core bound thread , interfere. hence, recommended use number lower total number of available physical cores argument +rts -n.
another potential issue load balancing: may need split work differently if there load unbalance (threadscope profile help). have @ paper more details on tuning.
Comments
Post a Comment