Scaling of VASP

HSE06 hybrid calculations

Test benchmark case: VASP 5.4.1, gamma only (except of GPU is vasp_std)

ENCUT = 600.00 eV (ENAUG = 1200.00 eV)

PRECFOCK=Fast

NAME	budapest2	uv2	debrecen2	szeged	fast
Perf. per Node (1/s)	120.25	615.78	543.88	131.23	89.29
Perf. per Core (1/s)	6.01	6.41	181.29	2.73	7.44
Total Perf. (1/s)	0.84	0.62	2.18	0.52	0.36
time of one SCF iteration (s)	1188	1624	460	1905	2800
NCORE	5	1	1	6	1
Total Cores	140	96	12	192	48
Node count	7	1	4	4	4
MPI processes per Node	20	96	3	48	12
IALGO	DAV	DAV	DAV	DAV	DAV

CPU	E5-2680V2	E5-4610	Nvidia K20x	Opteron 6174	E5-2620
TDP (W)	115	95	235	115	115
count per Node	2	16	3	4	2
suplement (W)	50	50	150	75	50
perNode (W)	280	1570	855	535	280
total (W)	1960	1570	3420	2140	1120
efficiency (1/W)	4295	3922	6361	2453	3189

Notes:

- There is one MPI thread per pysical GPU. There are 3 GPU's per node in debrecen2. As well as there is no gamma only VASP version available, the performance would be even higher.

- AMD Opteron 6174 is considerably slower than recent intel CPU's. They have only 1/2 of the speed of an intel core. It is best to use NCORE = 6 flag. This will group the six cores of the real cpu dies. Opteron 6174 actually two six core cpu's inside the same chassis. Thus at szeged's nodes there are 8 hexa core cpu's, totaling 48 cores.

- On intel cpu's NCORE > 1 is important when there are more than 100 MPI threads. A value 2 might be enough to restore the linear scaling and avoid the bandwidth limitations.

- The GPU port is the most energy efficient, although the gain is not exceptional. Keep in mind that there is no gamma only version of VASP yet for GPUs! Also the TDP of intel cpu's are generally are utilized at maximum due the turbo boost feature. However in typical VASP runs gpu's only draw 100-150W approximetely!

- supplemental, miscellaneous power draw per node is only a heuristic approximate value.

Performance of GPU port

Limits on NCORE flag and, and intel MKL smp parralelization

- NCORE = 1 (default) offers linear scaling up to ~100 MPI threads, on intel Xeon cpu's connected with infiniband. SGI MPT optimimizations offers a little more. - Generally I suggest the following: NCORE = "cpu cores with shared L3, cache and memory controller" This setting will limit one band calculation into single cpu die. If you set NCORE more than the cpu cores, that will lead insufficient communication bandwidth between cpu sockets, and worse between nodes. On intel CPU's plese set NCORE = "number of cores", for highest performance and minimal memory footprint.

If you run out of memory: intel MKL smp parallelization

- You can use intel MKL's built-in shared memory parallelization. This is done not in the vasp binary, it is embedded inside intel MKL runtime. You just have to add the following lines into your jobfile, and divide the number MPI threads (and NCORE) with that value.

export OMP_NUM_THREADS=2                                                                                                       
export MKL_DYNAMIC=FALSE                                                                                                            
export MKL_NUM_THREADS=2

This will yield almost linear scaling up to 2 threads per MPI thread, but offers no scaling over 4. This method will greatly reduce the memory footprint! You will have nearly identical memory usage with 24 MPI threads with no MKL paralellization, or if you start that 24 MPI threads with OMP_NUM_THREADS=2 ! - For comparison, there is a 216 carbon atom supercell calculation with GGA functional, on 2 nodes with AMD cores, 48 cores total:

MKL_NUM_THREADS =	1	2	3	6
SCF iteration time (s) IALGO = 38	0.5285	0.58	0.6956	1.1354
performance per core: IALGO = 38	1.00	0.91	0.76	0.47
SCF iteration time (s) IALGO = 48	0.1835	0.2236	0.2796	0.4748
performance per core: IALGO = 48	1.00	0.82	0.66	0.39
NCORE	6	3	2	1
MPI threads per node	24	12	8	4
Total cores used per node	24	24	24	24
nodes used	2	2	2	2
Total cores used	48	48	48	48

- This scaling is the same for intel cpu's too. - The scaling is worse with RMDIIS: (IALGO = 48), than blocked davidson (IALGO = 38)

Scaling of VASP

Contents

HSE06 hybrid calculations

Notes:

Performance of GPU port

Limits on NCORE flag and, and intel MKL smp parralelization

If you run out of memory: intel MKL smp parallelization

Navigation menu

Search