multithreading - Haskell: sub-optimal parallel GC work balance, no speedup in parallel execution -


the description of problem practically same in this post, although think can understand corresponding solution, can not see how apply problem, if @ all.

here example program

{-# language bangpatterns #-}  import system.random (randoms, mkstdgen) import control.parallel.strategies import control.deepseq (nfdata) import data.list  data point = point !double !double  fmod :: double -> double -> double fmod b | < 0     = b - fmod (abs a) b           | otherwise = if < b                          else let q = / b                               in b * (q - fromintegral (floor q :: int))  standardmap :: double -> point -> point standardmap k (point q p) =     point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi))  iterate' gen !p = p : (iterate' gen $ gen p)  iteraten :: (point -> point) -> [int] -> point -> [point] iteraten _ [] p = [p] iteraten gen (dn:dns) p =     p : (iteraten gen dns $ (head . drop dn) $ iterate' gen p)   ensemble :: [point] ensemble = zipwith point qs ps    qs = randoms (mkstdgen 42)          ps = randoms (mkstdgen 21)  main = let dns = take 100 $ repeat 10000            ens = take 1000 ensemble            obs = \(point p q) -> p^2 - q^2            work = map obs . (iteraten (standardmap 7.0) dns)            ps = parmap rdeepseq work ens        in putstrln $ show (foldl' (+) 0 $ map (foldl' (+) 0) ps) 

the problem program not scale number of threads. example, on debian 3.2.46-1 x86_64} ghc 7.4.1 get

$ ghc -o3 --make stmap.hs -threaded  $ time ./stmap +rts -n1   real    1m9.791s   user    1m9.448s   sys     0m0.208s  $ time ./stmap +rts -n2   real    0m36.981s   user    1m13.113s   sys     0m0.656s  $ time ./stmap +rts -n4   real    0m23.110s   user    1m31.310s   sys     0m0.792s  $ time ./stmap +rts -n8   real    0m20.537s   user    2m21.921s   sys     0m21.017s 

this numbers may fluctuate lot. indicator have found of problem might suboptimal parallel gc work balance, example:

$ ./stmap +rts -n8 -sstderr 1>/dev/null 112,032,905,392 bytes allocated in heap   59,112,296 bytes copied during gc      971,520 bytes maximum residency (35 sample(s))       96,416 bytes maximum slop            8 mb total memory in use (1 mb lost due fragmentation)                                  tot time (elapsed)  avg pause  max pause gen  0     27032 colls, 27031 par    6.49s    0.81s     0.0000s    0.0015s gen  1        35 colls,    35 par    0.39s    0.05s     0.0014s    0.0028s  parallel gc work balance: 4.05 (6799831 / 1680927, ideal 8)                       mut time (elapsed)       gc time  (elapsed) task  0 (worker) :   14.81s    ( 14.84s)       0.96s    (  0.97s) task  1 (worker) :    0.00s    ( 15.81s)       0.00s    (  0.00s) task  2 (bound)  :    0.03s    ( 15.80s)       0.01s    (  0.01s) task  3 (worker) :   14.72s    ( 14.82s)       0.98s    (  0.99s) task  4 (worker) :   14.70s    ( 14.84s)       0.96s    (  0.97s) task  5 (worker) :   14.69s    ( 14.82s)       0.98s    (  0.99s) task  6 (worker) :   14.69s    ( 14.82s)       0.98s    (  0.99s) task  7 (worker) :   14.72s    ( 14.81s)       0.99s    (  1.00s) task  8 (worker) :   14.76s    ( 14.83s)       0.97s    (  0.98s) task  9 (worker) :   14.76s    ( 14.81s)       1.00s    (  1.00s)  sparks: 1000 (1000 converted, 0 overflowed, 0 dud, 0 gc'd, 0 fizzled)  init    time    0.00s  (  0.00s elapsed) mut     time  118.87s  ( 14.95s elapsed) gc      time    6.87s  (  0.86s elapsed) exit    time    0.00s  (  0.00s elapsed) total   time  125.74s  ( 15.81s elapsed)  alloc rate    942,488,358 bytes per mut second  productivity  94.5% of total user, 751.8% of total elapsed  gc_alloc_block_sync: 1130880 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 175 

where ~4, in next run worse, ~2,

$ ./stmap +rts -n8 -sstderr 60364.38698300099  112,033,885,088 bytes allocated in heap   4,626,963,592 bytes copied during gc    2,101,264 bytes maximum residency (1846 sample(s))      652,528 bytes maximum slop           13 mb total memory in use (0 mb lost due fragmentation)                                     tot time (elapsed)  avg pause  max pause gen  0     25497 colls, 25496 par   29.42s    3.70s     0.0001s    0.0022s gen  1      1846 colls,  1846 par   17.97s    2.26s     0.0012s    0.0071s  parallel gc work balance: 2.00 (577773617 / 288947149, ideal 8)                      mut time (elapsed)       gc time  (elapsed) task  0 (worker) :   14.86s    ( 15.03s)       6.07s    (  6.10s) task  1 (worker) :    0.00s    ( 21.13s)       0.00s    (  0.00s) task  2 (bound)  :    0.03s    ( 21.11s)       0.02s    (  0.02s) task  3 (worker) :   14.92s    ( 14.99s)       6.06s    (  6.14s) task  4 (worker) :   14.88s    ( 15.02s)       6.07s    (  6.11s) task  5 (worker) :   14.91s    ( 15.02s)       6.09s    (  6.12s) task  6 (worker) :   14.92s    ( 15.04s)       6.07s    (  6.10s) task  7 (worker) :   14.86s    ( 15.03s)       6.03s    (  6.11s) task  8 (worker) :   14.86s    ( 15.03s)       6.07s    (  6.10s) task  9 (worker) :   14.92s    ( 15.00s)       6.11s    (  6.13s)  sparks: 1000 (1000 converted, 0 overflowed, 0 dud, 0 gc'd, 0 fizzled)  init    time    0.00s  (  0.00s elapsed) mut     time  120.36s  ( 15.18s elapsed) gc      time   47.39s  (  5.96s elapsed) exit    time    0.00s  (  0.00s elapsed) total   time  167.75s  ( 21.13s elapsed)  alloc rate    930,821,901 bytes per mut second  productivity  71.7% of total user, 569.5% of total elapsed  gc_alloc_block_sync: 1253157  whitehole_spin: 21 gen[0].sync: 4 gen[1].sync: 19789 

what responsible these fluctuations in execution time? , importantly, how can 1 improve parallel gc work balance in concrete example , in general?

the varaition due fact using +rts -nn leads creation of 1 bound thread , n worker threads (cf. output), hence 1 worker share physical core bound thread , interfere. hence, recommended use number lower total number of available physical cores argument +rts -n.

another potential issue load balancing: may need split work differently if there load unbalance (threadscope profile help). have @ paper more details on tuning.


Comments

Popular posts from this blog

assembly - 8086 TASM: Illegal Indexing Mode -

Java, LWJGL, OpenGL 1.1, decoding BufferedImage to Bytebuffer and binding to OpenGL across classes -

javascript - addthis share facebook and google+ url -