Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart
#How to run linpack benchmark on linux software#
and it is also fine if you have more nodes in systems and each node have 1 mpi rank.For the software library, see LINPACK. Case c should be almost same result as A and B. Case A and Case B should be similar and as Holger's test. xhpl_intel64_dynamic (where, p=2, Q=1 in HPL.dat) xhpl_intel64_dynamic : -env HPL_HOST_NODE=2, 3 -np 1. Or B: mpirun -env HPL_HOST_NODE=0,1 -np 1. b y default, HPL will use whole resource and creating two HPL will share most of resources, so bad performance in case 2.Įxport HPL_HOST_NODE=$(($PMI_RANK * 2 + 0)),$(($PMI_RANK * 2 + 1)) Summarize so that more developer may refer. MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2 MPI startup(): Rank Pid Node name Pin cpu MPI startup(): Multi-threaded optimized library I try on my 2 socket skylake numactl -hardware Here is the result ,which almost be able to reproduce your problem. and for your 2 node, 2 socket sky lake and 4 sockets Broadwell. ( I get almost same result if use case 1 ) So for your 1 node, 4 socket, it should be just simple to run > mpirun -np 1. MPI_PER_NODE='This should be 1 and it doesn't matter if you have single socket or dual socket, if you put 2 for a dual socket system, the memory usage in htop will be shown to use 40% but in fact is using 80%, and there would be 2 controlling threads instead of one controlling thread' MPI_PROC_NUM='The number of actual physical server, which equals PxQ) The performance may be cause by different memory usage etc. from CPU usage, the Case 1 use less CPU, but good performance.ģ. the case 2 and case 1 should be same, but more clear affinity. the MP_LINPACK don't use OPENMP threads, so you may not use OpenMP number to control MKL threads. 18 core *2 HT * 4 = 144 thread right? How was your Top looks like? What is your exact test machine? it is 4 socket broad well system. Unfortunately to use this four socket machine together with our two socket nodes, I would like to be able to start two MPI processes. Therefore I think that in the second example there are also more threads per process and both MPI ranks are trying to use the whole machine. I don't understand, why multiple processes are spawned here. Now it claims to start one thread per MPI process, but in fact (top says so), also starts 36 threads per process. Mpirun -machinefile $M_NAME -np $NUM_PROCS /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic On the same machine, the output is the same regarding the thread placement, I think, but you only get 2 TFLOPs Mpirun -machinefile $M_NAME -np 2 /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic This delivers about 5.8 TFLOPs at the beginning of Linpack. Mpirun -machinefile $M_NAME -host r05n01 -env HPL_HOST_NODE=0,1 -np 1 /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic : -host r05n01 -env HPL_HOST_NODE=2,3 -np 1 /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic What I do at the moment is the following (see output_linpack1.txt in the attachment): I am using Linpack from MKL under Linux, so most probably option 2).
In fact, the usage of HPL_HOST_CORE, respectively HPL_HOST_NODE is helping me.