c++ - Why is iterating a list of objects slower than iterating a list of object pointers? -


after reading blog post how unfriendly list cache: http://www.baptiste-wicht.com/2012/11/cpp-benchmark-vector-vs-list/

... tried make std::list of pointers objects more cache friendly putting actual object each node (thereby removing 1 indirection operation) in hope when current node cached, object too. however, performance decreased. here's code used:

source , binaries: http://wilcobrouwer.nl/bestanden/listtest%202013-8-15%20%233.7z

#include <list> using std::list;  list<object*> case1; list<object> case2;  class object {     public:         object(char i);         ~object();          char dump[256]; };  // should not notice of difference here, equal amounts of memory  // allocated void insertion(test* test) {      // create object, copy pointer     float start1 = clock->gettimesec();     for(int = 0;i < test->size;i++) {         case1.push_back(new object(i));      }     test->insertion1 = clock->gettimesec()-start1;      // create object in place, no temps on stack     float start2 = clock->gettimesec();     for(int = 0;i < test->size;i++) {         case2.emplace_back(i);      }     test->insertion2 = clock->gettimesec()-start2; }  // case 2 removes 1 layer of derefence, should more cache  // friendly, because when list node found in cache, object should // there void iteration(test* test) {      // faster case2 reason     float start1 = clock->gettimesec();     int tmp1 = 0;     for(list<object*>::iterator = case1.begin();i != case1.end();i++) {         tmp1 += (**i).dump[128];      }     test->iteration1 = clock->gettimesec()-start1;      // why hell slower? removed dereference     float start2 = clock->gettimesec();     int tmp2 = 0;     for(list<object>::iterator = case2.begin();i != case2.end();i++) {         tmp2 += (*i).dump[128]; // equal tmp1, no mistakes...     }     test->iteration2 = clock->gettimesec()-start2; }  // case 2 removes 1 layer of derefence, should more cache  // friendly, because when list node found in cache, object should // there void deletion(test* test) {      // again, faster case2 reason     float start1 = clock->gettimesec();     int size1 = case1.size();     for(list<object*>::iterator = case1.begin();i != case1.end();i++) {         delete *i;     }     case1.clear();     test->deletion1 = clock->gettimesec()-start1;      // before: why slower? removed dereference     float start2 = clock->gettimesec();     int size2 = case2.size();     case2.clear();     test->deletion2 = clock->gettimesec()-start2; } 

these functions run test->size values linearly varying 1 100000, , time differences between clock->gettimesec() saved disk after calculations finished. plot of results can found here:

http://wilcobrouwer.nl/bestanden/listtestfix.png
as can see, case 2 10% faster @ inserting , deleting, 10% slower @ iterating, means dereference needed iterating case 1 makes faster!

what missing here?

edit 1: cpu phenom ii x4 @ 3.5ghz (constant frequency) 64k/1mb/6mb of cache, , i'm compiling way (please note -m64 implied, implies ban on x87 via -mfpmath=ssse):

compiler: tdm-gcc 4.7.1 64-bit release rm -f obj/clock.o obj/main.o obj/object.o listtest.exe g++.exe -c clock.cpp -o obj/clock.o -std=gnu++11 g++.exe -c main.cpp -o obj/main.o -std=gnu++11 g++.exe -c objecst.cpp -o obj/object.o -std=gnu++11 g++.exe obj/clock.o obj/main.o obj/object.o -o listtest.exe -static-libgcc 

edit 2: answer dale wilson: list mean std::list. answer mats petersson: summary has been added picture. optimization checks under way. answer guy asked bigger data sets: sorry, i've got 4gib of ram, , plots current maximum filling quite boring.

edit 3: i've enabled -o3 (-o2 produces similar results), made stuff worse:

http://wilcobrouwer.nl/bestanden/listtesto3fix.png
this time, case 2 20% faster @ inserting , deleting, time 1~5 times slower @ iteration (gets worse @ higher test sizes). same conclusion.

edit 4: answer maxim yegorushkin: cpu frequency scaling happens disabled (forgot mention), cpu runs @ 3.5ghz. also, picking averages or best results out of more tests done too, because there more enough sample points on x axis. optimization enabled too: -o3, -m64 , mfpmath=sse set. adding identical tests after each other std::vector tests (check source) did not change significant.

edit 5: fixed few typos (deletion results not shown, iteration results shown twice. has cleared deletion problem, iteration problem remains.

a bit off-topic such benchmarking methodology not yield correct , repeatable results because ignores cache effects, cpu frequency scaling , process scheduler.

to measure times correctly needs run each micro-benchmark (i.e. each , every loop) few times (say @ least 3) , pick best time. best time best possible time achievable when cpu cache, tlb , branch predictor hot. need best times because worst times have no upper bound, can not meaningfully compared.

when benchmarking need disable cpu frequency scaling, not switch frequencies in middle of benchmark. ought run real-time priority reduce scheduling noise resulting other processes pre-empting benchmark.

and don't forget compile optimization.

next, lets review benchmarks:

  • insertion: measures time of 2 memory allocations (list<object*>) vs. 1 memory allocation (list<object>).
  • deletion: same above, replace allocation deallocation.
  • iteration: object size 256 bytes, 4x64-byte cache lines. such object size big compared list node size, measuring time of cache misses when reads byte 256-byte object.

what want measure iteration of list vs. iteration on array while reading bytes of object (e.g. sum bytes of object). hypothesis when objects laid out in array , accessed sequentially cpu pre-loads next object cache when access don't incur cache miss. whereas when objects stored in list nodes not contiguous in memory cache read-ahead not improve speed because next object not adjacent in memory current one, when chases pointer of list incurs cache miss.


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -