Scaling of VASP
HSE06 hybrid calculations
Test benchmark case: VASP 5.4.1, gamma only (except of GPU is vasp_std)
ENCUT = 600.00 eV (ENAUG = 1200 eV)
PRECFOCK=Fast
NAME | budapest2 | uv2 | debrecen2 | szeged | fast |
---|---|---|---|---|---|
Perf. per Node (1/s) | 120.25 | 615.78 | 543.88 | 131.23 | 89.29 |
Perf. per Core (1/s) | 6.01 | 6.41 | 181.29 | 2.73 | 7.44 |
Total Perf. (1/s) | 0.84 | 0.62 | 2.18 | 0.52 | 0.36 |
time of one SCF iteration (s) | 1188 | 1624 | 460 | 1905 | 2800 |
NCORE | 5 | 1 | 1 | 6 | 1 |
Total Cores | 140 | 96 | 12 | 192 | 48 |
Node count | 7 | 1 | 4 | 4 | 4 |
MPI processes per Node | 20 | 96 | 3 | 48 | 12 |
IALGO | DAV | DAV | DAV | DAV | DAV |
CPU | E5-2680V2 | E5-4610 | Nvidia K20x | Opteron 6174 | E5-2620 |
TDP (W) | 115 | 95 | 235 | 115 | 115 |
count per Node | 2 | 16 | 3 | 4 | 2 |
suplement (W) | 50 | 50 | 150 | 75 | 50 |
perNode (W) | 280 | 1570 | 855 | 535 | 280 |
total (W) | 1960 | 1570 | 3420 | 2140 | 1120 |
efficiency (1/W) | 4295 | 3922 | 6361 | 2453 | 3189 |
Notes:
- There is one MPI thread per pysical GPU. There are 3 GPU's per node in debrecen2. As well as there is no gamma only VASP version available, the performance would be even higher.
- AMD Opteron 6174 is considerably slower than recent intel CPU's. They have only 1/2 of the speed of an intel core. It is best to use NCORE = 6 flag. This will group the six cores of the real cpu dies. Opteron 6174 actually two six core cpu's inside the same chassis. Thus at szeged's nodes there are 8 hexa core cpu's, totaling 48 cores.
- On intel cpu's NCORE > 1 is important when there are more than 100 MPI threads. A value 2 might be enough to restore the linear scaling and avoid the bandwidth limitations.
- The GPU port is the most energy efficient, although the gain is not exceptional. Keep in mind that there is no gamma only version of VASP yet for GPUs! Also the TDP of intel cpu's are generally are utilized at maximum due the turbo boost feature. However in typical VASP runs gpu's only draw 100-150W approximetely!
- supplemental, miscellaneous power draw per node is only a heuristic approximate value.
Performance of GPU port
Limits on NCORE flag and, and intel MKL smp parralelization
- NCORE = 1 (default) offers linear scaling up to ~100 MPI threads, on intel Xeon cpu's connected with infiniband. SGI MPT optimimizations offers a little more. - Generally I suggest the following: NCORE = "cpu cores with shared L3, cache and memory controller" This setting will limit one band calculation into single cpu die. If you set NCORE more than the cpu cores, that will lead insufficient communication bandwidth between cpu sockets, and even worse between nodes. On intel CPU's I suggest NCORE = "number of cores", for highest performance and minimal memory footprint. But you should stop at NCORE values 4-8, on very recent CPU's with very high core counts (24-core Broadwell-E series)
If you run out of memory: intel MKL smp parallelization
- You can use intel MKL's built-in shared memory parallelization. This is done not in the vasp binary, it is embedded inside intel MKL runtime. You just have to add the following lines into your jobfile, and divide the number MPI threads (and NCORE) with that value.
export OMP_NUM_THREADS=2 export MKL_DYNAMIC=FALSE export MKL_NUM_THREADS=2
This will yield almost linear scaling up to 2 threads (~1.7) per MPI thread with DFT (such as PBE) functionals, but offers no scaling over 4 (~2.5). This method will greatly reduce the memory footprint! Hybrid functionals howver offers smaller scaling. For example expect ~1.4 speedup with 2 threads with HSE functional. - For comparison, there is a 216 carbon atom supercell calculation with GGA functional, on 2 nodes with AMD cores, 48 cores total:
MKL_NUM_THREADS = | 1 | 2 | 3 | 6 |
---|---|---|---|---|
SCF iteration time (s) IALGO = 38 | 0.5285 | 0.58 | 0.6956 | 1.1354 |
performance per core: IALGO = 38 | 1.00 | 0.91 | 0.76 | 0.47 |
SCF iteration time (s) IALGO = 48 | 0.1835 | 0.2236 | 0.2796 | 0.4748 |
performance per core: IALGO = 48 | 1.00 | 0.82 | 0.66 | 0.39 |
NCORE | 6 | 3 | 2 | 1 |
MPI threads per node | 24 | 12 | 8 | 4 |
Total cores used per node | 24 | 24 | 24 | 24 |
nodes used | 2 | 2 | 2 | 2 |
Total cores used | 48 | 48 | 48 | 48 |
- This scaling is the same for intel cpu's too. - The scaling is worse with RMDIIS: (IALGO = 48), than blocked davidson (IALGO = 38)
Hybrid OpenMP + MPI scalability on Komondor (64-core AMD gen 4 EPYC processors)
You can use the following jobfile
/project/c_dqubits/share/vasp_example/5.4.1_mpich/mkl4_thread
It is set up an exemplary 1728-atom diamond system with a N3V defect inside
The trick is to use the following:
#SBATCH --nodes=2 #SBATCH --ntasks-per-node=32 #(was 128 because 1 node contains 128 cpu cores) #SBATCH --cpus-per-task=4 #(was not used)
do NOT set NCORE, or remain NCORE=1 as is. NCORE=2,4 here degrades the performance
This will start 2*32=64 VASP (MPI) processes across two nodes. Thus VASP StdOut will start something like this:
running on 64 total cores distrk: each k-point on 64 cores, 1 groups distr: one band on 1 cores, 64 groups
Additionally, each VASP (MPI) process can use 4 CPU cores for each. You can see this in the StdOut file: 1 MPI thread hold 4 intel MKL OpenMP threads This low level paralellism is done via the intel math kernel (MKL) library: by setting since cpus-per-task=4 is being used:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export MKL_DYNAMIC=FALSE export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
Because we not use fewer process count per node, the memory footprint will be lot lower.
For this specific case, the old NCORE=16, and --ntasks-per-node=32 solution
used: 1.5 GB memory per MPI process, thus nearing the 192/240 GB limit.
This new solution use approx 1.7 GB per MPI process and uses only 54.4/240 GB per node memory
Please make your own tests, you may go
(to use even less memory, or better multinode scalability)
#SBATCH --ntasks-per-node=16 #SBATCH --cpus-per-task=8
or (1-node calculations may work even faster with this ~5%, but beyond 2,4 nodes it is slower)
#SBATCH --ntasks-per-node=64 #SBATCH --cpus-per-task=2
just make sure the multiplication of the two numbers make 128 cores, the core count of a node.