Today I needed to measure some memory bandwidths of these four architectures.
I have nodes of a cluster with the following features:
Dell PowerEdge M600:
– 2 sockets
– 2x Intel Xeon E5410, 4c (Harpetown), UMA with front-side-bus
– 15GiB FB-DIMM DDR2 667Mhz
Dell PowerEdge M605
– 2 sockets
– 2x AMD Opteron 2345, 4c (Barcelona), NUMA
– 32 GiB DDR2 800Mhz
Dell PowerEdge M610
– 2 sockets
– 2x Intel Xeon E5645, 6c (Westmere-EP), NUMA
– 32 GiB DDR3 1333Mhz
– 4 sockets
– 4x AMD Opteron 6272, 12c (Interlagos), NUMA
– 256GB (n/d)
Here is how I did it:
- Download STREAM benchmark source code: stream.c , version 1.5.0:
- Determine how much cache memory do the processors have. Just see the specs or execute “lstopo” to see the architecture. Add all values for the higher level caches. We do this way because we want to use all the cores available, so use OpenMP.
- Calculate the minimum size of the array of stream.c:
As explicited inside stream.c:
(a) Each array must be at least 4 times the size of the available cache memory.
If we look inside stream.c we see that the type of the elements of arrays a, b and c are doubles. In this machines, with Linux 64 bits, and the standard GCC of RHEL 6.x (or any RHEL based distro), the size of a double is 8 bytes.
STREAM_TYPE a[STREAM_ARRAY_SIZE+OFFSET]; /*we leave offset=0*/
- This are the explicit arrays sizes:
AMD2356: 2048KiB * 2 = 4096KiB = 4194304B
So the array should be 4194304*4 = 16777216B = 16384KB = 16MB E5645: 2 * 12MiB = 24576KiB = 25165824B
So the array should be 25165824*4 = 100663296B = 98304KiB = 96MiB E5410: 6144KiB * 4 = 24576KiB = 25165824B
So the array should be 25165824*4 = 100663296B = 98304KiB = 96MiB OPT6272: 64MiB = 65536KiB = 67108864B
So the array should be 67108864*4 = 268435456B = 262144KiB = 256MiB
- Now we can choose to use the highest value for all the tests, or to compile for each architecture. To simplify the steps and not modify the stream.c source code, add a preprocessor directive to compilation line:
gcc -O -fopenmp -DSTREAM_ARRAY_SIZE=100663296 stream.c -o stream.96M
But be careful!, when compiling with for example 256M, it is possible that you receive the following error:
relocation truncated to fit: R_X86_64_32 against `.bss'
This happens because the linker (ld) needs more than 2GB of memory to link the program. So, to workaround this be sure that you have enough memory in your machine and run the command like this:
gcc -O -fopenmp -mcmodel=medium -DSTREAM_ARRAY_SIZE=268435455 stream.c -o stream.256M
See what this flag does at:
- Set the OMP_NUM_THREADS=<whatever number of cores> and run steam.
- Grab the results:
See more at: http://www.streambench.org/