Page 1 of 1

computing performance of vasp on modern high end systems

Posted: Sat Oct 08, 2011 12:05 am
by acadien
I'm looking to purchase a high performance server, starting with 1 box and expanding later depending on performance and funds.

From what I can tell, the current high performance server market has come down to 2 choices.
-More cores (8-12) and low clock speed (~2.5GHz)
-Fewer cores (4-6) and high clock speed (~3.5GHz).

Can I expect VASP to perform better in either of these configurations?

Also, I've read in a few locations that for some reason VASP performs better when the number of threads run is a multiple of 3 (please correct me if this is wrong, it seems like voodoo magic to me). So this would seem to limit the choices to the Intel Xeon X5600 series (6 cores) and the AMD 12 Core Opterons. I haven't found any solid benchmarks comparing these when running something like scaLAPACK. Any info or links to relevant information would be greatly appreciated!

computing performance of vasp on modern high end systems

Posted: Sat Oct 08, 2011 6:30 pm
by jber
Hi,

benchmarking of a single machine is a tricky bussines and a system composed of many servers
is even more difficult due to choices of network.
Let's limit ourselves to just one machine and assume that it will be always utilizing all the cores.
I don't think vasp supports openmp so i don't comment on your question about performance gains
by running with specific number of threads (you can read about it for example at
http://www.nersc.gov/users/computationa ... on-hopper/ ).
One of the questions to ask in our case will be if your jobs
are memory demanding or not, due to the so called "memory bandwidth" problem.
You can read about it at http://en.wikipedia.org/wiki/Opteron#Mu ... r_features,
with a note that the memory bandwidth for current xeons does look so bad as in the past.
The memory bandwidth will cause for example multiple serial vasp processes to run slower than
just the same single process run with other cores idle.

On the other hand if a job can be efficiently MPI parallelized you gain almost twice
by using 12 AMD cores compared to 6 Intel cores (assuming AMD cores are as fast as Intel,
and they are not).
In most cases vasp will parallelize good on such small number of cores,
but the actual gain will depend on a job.

As a conclusion: there are guides on the internet about parallel vasp performance,
but forget about high performance strong-scaling results on large number of cores -
usually they don't tell anything about real life applications.
The best thing you can do is to create your own
set of benchmarks (the typical jobs you will be running) and run them on the candidate servers.
Make sure to spend your time on bancharking different compilers
(gfortran, intel, open64 - if you make it work) and libraries (acml, intel mkl, https://github.com/xianyi/OpenBLAS, atlas)
You may run into compilation troubles, therefore in most cases I would recommend to take
the http://en.wikipedia.org/wiki/Rational_ignorance into account.
I guess your jobs will be too small to notice effect of scalapack,
but of course you can try to build it.
Start with free products, in oder to avoid situations when vasp does not
compile or run correctly with the latest versions of commercial compilers/libraries.
Benchmark demo versions of commercial products agains the free ones.
I'm happy with gfortran 4.1.2 and acml 4.0.1 on intel xeon.

<span class='smallblacktext'>[ Edited Sat Oct 08 2011, 09:49PM ]</span>

computing performance of vasp on modern high end systems

Posted: Mon Oct 10, 2011 10:34 am
by alex
Hi acadien,

I agree with jber and I'd like to emhasize the point of memory bandwidth. Modern CPUs have increased numbers wrt. cores and GHz, but the memory interface keeps hardly the pace. AMD Opteron was much better than Intel there in the past, but Intel CPUs improved a lot.
If you are planning to utilize your memory a lot this will be potential a bottleneck you should address during benchmarking.

Cheers,

alex

computing performance of vasp on modern high end systems

Posted: Thu Oct 13, 2011 4:33 am
by acadien
jber & Alex thanks for your responses. I understand the memory bandwidth problem, but I was under the impression it is uncommon for VASP to hit this bottleneck. Especially since L3 caches keep on growing.

I have looked into the architecture of AMD's 4-P motherboard and it appears that each processor gets its own dedicated memory bus. This, combined with triple channel memory with low latency reads will hopefully eliminate the possibility of hitting the memory bandwidth. If I have the time, I'll run a series of benchmarks, likely the same ones posted by terragrid. I think results like this are highly relevant as Intel prepares to roll out its first 4P servers and AMD extends its line. It would be great to get a few other VASP users on board with this.

-Adam

computing performance of vasp on modern high end systems

Posted: Thu Oct 13, 2011 11:01 am
by alex
Short comment: 100% chance to hit the memory bottleneck. Except for the case you are calculating tiny systems.
Just think about your typical system sizes and the memory used: more than 500 MBytes/core? More than 1 GB/core?

Well, how large is the cache?

Here you go.

Cheers,

alex