本文共 12010 字,大约阅读时间需要 40 分钟。
运行环境:Centos 5.8 Final
Cuda 5驱动及Toolkit Lammps-16Nov12 fftw-2.1.5 openmpi-1.4.5 硬件环境:Intel XEON E5 2640 128GB DDR3 1600 ECC WD 1TB HDD Nvidia Tesla C2050 Nvidia Kelper K10 LAMMPS即Large-scale Atomic/Molecular Massively Parallel Simulator,可以翻译为大规模原子分子并行模拟器,主要用于分子动力学相关的一些计算和模拟工作,一般来讲,分子动力学所涉及到的领域,LAMMPS代码也都涉及到了。LAMMPS由美国Sandia国家实验室开发,以GPL license发布,即开放源代码且可以免费获取使用,这意味着使用者可以根据自己的需要自行修改源代码。LAMMPS可以支持包括气态,液态或者固态相形态下、各种系综下、百万级的原子分子体系,并提供支持多种势函数。且LAMMPS有良好的并行扩展性。 编译 Lammps的并行需要能无密码ssh访问本机,首先配置ssh ssh-keygen -t rsa 不断回车后得到.ssh/id_rsa和 .ssh/id_rsa.pub cd ~/.ssh cp id_rsa.pub authorized_keys 现在已经能无密码访问本机了。安装FFTW2
tar zxvf fftw-2.1.5.tar.gz cd fftw-2.1.5 ./configure --prefix=/opt/fftw2 --enable-float --enable-shared make make install 安装及配置OPENMPI tar –zxvf openmpi-1.4.5.tar.gz cd openmpi-1.4.5 ./configure --prefix=/opt/opnmpi make make install设置环境变量
gedit ~/.bashrc PATH=/opt/cuda5/bin:/opt/openmpi/bin:$PATH LD_LIBRARY_PATH=/opt/cuda5/lib64:/opt/openmpi/lib:/opt/fftw2/lib:$LD_LIBRARY_PATH最后source ~/.bashrc
测试openmpi是否安装成功
which mpicc
which mpiexec which mpirun 配置lammps tar xvf lammps.tar.gz首先编译gpu package
cd lammps/lib/gpu
修改Makefile.linux
CUDA_HOME = /opt/cuda5 # Kelper CUDA CUDA_ARCH = -arch=sm_30 (将其他CUDA_ARCH注释掉)
最后make -f Makefile.linux 生成nvc_get_devices,可以运行一下,看看GPU的信息修改Makefile.lammps
gpu_SYSINC = -I/opt/cuda5/include gpu_SYSLIB = -lcudart -lcuda gpu_SYSPATH = -L/opt/cuda5/lib64
然后编译自定义包,我们需要用到user-cuda
cd ../cuda修改Makefile.common
CUDA_INSTALL_PATH = /opt/cuda5然后make:
make CUDA_INSTALL_PATH=/opt/cuda5 cufft=2 precision=2 arch=30 最后会生成liblammpscuda.a 然后安装所需要的包: make yes-asphere make yes-class2 make yes-colloid make yes-dipole make yes-granular make yes-user-misc make yes-user-cg-cmm安装GPU和USER-CUDA package
make yes-gpu make yes-user-cuda 编译lammps使用/src/MAKE/Makefile.openmpi作为模版
cp Makefile.openmpi Makefile.gpu vi Makefile.gpuMPI_INC = -I/opt/openmpi/include MPI_PATH = MPI_LIB = -L/opt/openmpi/lib -lmpi
FFT_INC = -I/opt/fftw2/include -DFFT_FFTW FFT_PATH = FFT_LIB = -L/opt/fftw2/lib -lfftw
然后回到lammps/srcmake gpu
编译完成并行的可执行文件lmp_gpu
测试(分别使用CPU及GPU,CUDA模块)
cd lammps/bench/GPUNvidia Kelper K10
4194304 atoms CPU time mpirun -np 12 ../../src/lmp_gpu -c off -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.cpu LAMMPS (16 Nov 2012) Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 Created orthogonal box = (0 0 0) to (107.494 214.988 214.988) 1 by 3 by 4 MPI processor grid Created 4194304 atoms Setting up run ... Memory usage per processor = 115.99 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733681 0 -4.6133686 -5.0196696 1000 0.70371346 -5.6760464 0 -4.6204765 0.70456724 Loop time of 445.893 on 12 procs for 1000 steps with 4194304 atomsPair time (%) = 344.521 (77.2653)
Neigh time (%) = 37.3499 (8.37643) Comm time (%) = 34.5695 (7.75287) Outpt time (%) = 0.00629385 (0.00141152) Other time (%) = 29.4467 (6.60397)Nlocal: 349525 ave 349810 max 349270 min
Histogram: 3 0 0 3 0 1 2 0 2 1 Nghost: 88501 ave 88753 max 88106 min Histogram: 1 0 0 1 0 1 6 2 0 1 Neighs: 1.31018e+07 ave 1.313e+07 max 1.30777e+07 min Histogram: 1 0 4 1 0 2 2 1 0 1Total # of neighbors = 157221517
Ave neighs/atom = 37.4845 Neighbor list builds = 50 Dangerous builds = 0real 7m28.357s
user 88m58.623s sys 0m6.306sGPU
time mpirun -np 2 ../../src/lmp_gpu -sf gpu -c off -v g 2 -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.gpu LAMMPS (16 Nov 2012) Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 Created orthogonal box = (0 0 0) to (107.494 214.988 214.988) 1 by 1 by 2 MPI processor grid Created 4194304 atoms--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut: - with 1 proc(s) per device. -------------------------------------------------------------------------- GPU 0: Tesla K10.G1.8GB, 1536 cores, 3.4/3.5 GB, 0.74 GHZ (Single Precision) GPU 1: Tesla K10.G1.8GB, 1536 cores, 3.4/0.74 GHZ (Single Precision) --------------------------------------------------------------------------Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.Setting up run ...
Memory usage per processor = 336.665 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733679 0 -4.6133684 -5.01967 1000 0.70407139 -5.6765788 0 -4.620472 0.70226909 Loop time of 163.778 on 2 procs for 1000 steps with 4194304 atomsPair time (%) = 102.784 (62.7581)
Neigh time (%) = 5.78165e-05 (3.53018e-05) Comm time (%) = 10.0776 (6.15322) Outpt time (%) = 0.0124401 (0.0075957) Other time (%) = 50.9039 (31.081)Nlocal: 2.09715e+06 ave 2.09736e+06 max 2.09695e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1 Nghost: 285880 ave 286182 max 285579 min Histogram: 1 0 0 0 0 0 0 0 0 1 Neighs: 0 ave 0 max 0 min Histogram: 2 0 0 0 0 0 0 0 0 0Total # of neighbors = 0
Ave neighs/atom = 0 Neighbor list builds = 50 Dangerous builds = 0 --------------------------------------------------------------------- GPU Time Info (average): --------------------------------------------------------------------- Data Transfer: 11.9795 s. Data Cast/Pack: 29.8788 s. Neighbor copy: 0.0003 s. Neighbor build: 27.8765 s. Force calc: 33.9176 s. GPU Overhead: 0.0555 s. Average split: 1.0000. Threads / atom: 4. Max Mem / Proc: 2850.45 MB. CPU Driver_Time: 0.0564 s. CPU Idle_Time: 45.4944 s. --------------------------------------------------------------------- real 3m9.960s user 5m33.879s sys 0m21.650sCUDA
time mpirun -np 2 ../../src/lmp_gpu -sf cuda -v g 2 -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.cuda LAMMPS (16 Nov 2012) # Using LAMMPS_CUDA USER-CUDA mode is enabled (lammps.cpp:393) # CUDA: Activate GPU # Using device 0: Tesla K10.G1.8GB Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 Created orthogonal box = (0 0 0) to (107.494 214.988 214.988) # Using device 1: Tesla K10.G1.8GB 1 by 1 by 2 MPI processor grid Created 4194304 atoms # CUDA: VerletCuda::setup: Allocate memory on device for maximum of 2100000 atoms... # CUDA: Using precision: Global: 8 X: 8 V: 8 F: 8 PPPM: 8 Setting up run ... # CUDA: VerletCuda::setup: Upload data... Test TpA Test BpA# CUDA: Timing of parallelisation layout with 10 loops:
# CUDA: BpA TpA 7.604725 1.637228 # CUDA: Total Device Memory useage post setup: 1363.265625 MB Memory usage per processor = 329.441 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733681 0 -4.6133686 -5.0196696 1000 0.7037135 -5.6760465 0 -4.6204766 0.70456647 Loop time of 171.094 on 2 procs for 1000 steps with 4194304 atomsPair time (%) = 119.582 (69.8926)
Neigh time (%) = 34.4807 (20.153) Comm time (%) = 12.0482 (7.04183) Outpt time (%) = 0.00174761 (0.00102143) Other time (%) = 4.98143 (2.91151)Nlocal: 2.09715e+06 ave 2.09761e+06 max 2.09669e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1 Nghost: 285910 ave 286389 max 285431 min Histogram: 1 0 0 0 0 0 0 0 0 1 Neighs: 0 ave 0 max 0 min Histogram: 2 0 0 0 0 0 0 0 0 0 FullNghs: 1.57222e+08 ave 1.57269e+08 max 1.57174e+08 min Histogram: 1 0 0 0 0 0 0 0 0 1Total # of neighbors = 314443080
Ave neighs/atom = 74.9691 Neighbor list builds = 50 Dangerous builds = 0 # CUDA: Free memory...real 3m31.330s
user 6m17.069s sys 0m21.483sNvidai Tesla C2050 2097152 atoms CPU time mpirun -np 12 ../../src/lmp_g++ -c off -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.cpu LAMMPS (16 Nov 2012) Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 Created orthogonal box = (0 0 0) to (107.494 107.494 214.988) 2 by 2 by 3 MPI processor grid Created 2097152 atoms Setting up run ... Memory usage per processor = 59.9782 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733681 0 -4.6133691 -5.0196698 1000 0.70398846 -5.6764793 0 -4.6204971 0.7035921 Loop time of 255.275 on 12 procs for 1000 steps with 2097152 atoms
Pair time (%) = 189.553 (74.2546)
Neigh time (%) = 19.7922 (7.75329) Comm time (%) = 31.5617 (12.3638) Outpt time (%) = 0.00327303 (0.00128216) Other time (%) = 14.3645 (5.62708)Nlocal: 174763 ave 175050 max 174540 min
Histogram: 1 2 0 2 3 1 2 0 0 1 Nghost: 55156.6 ave 55337 max 55013 min Histogram: 2 0 3 1 2 0 1 1 1 1 Neighs: 6.55081e+06 ave 6.56937e+06 max 6.53648e+06 min Histogram: 2 0 0 2 4 2 1 0 0 1Total # of neighbors = 78609680
Ave neighs/atom = 37.484 Neighbor list builds = 50 Dangerous builds = 0real 4m16.362s
user 0m0.067s sys 0m0.018sGPU
time mpirun -np 2 ../../src/lmp_g++ -sf gpu -c off -v g 2 -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.gpu LAMMPS (16 Nov 2012) Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 Created orthogonal box = (0 0 0) to (107.494 107.494 214.988) 1 by 1 by 2 MPI processor grid Created 2097152 atoms--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut: - with 1 proc(s) per device. -------------------------------------------------------------------------- GPU 0: Tesla C2050, 448 cores, 2.6/2.6 GB, 1.1 GHZ (Single Precision) GPU 1: Tesla C2050, 448 cores, 2.6/1.1 GHZ (Single Precision) --------------------------------------------------------------------------Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.Setting up run ...
Memory usage per processor = 173.566 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733679 0 -4.6133689 -5.0196703 1000 0.70365628 -5.6759221 0 -4.6204382 0.70516901 Loop time of 82.1602 on 2 procs for 1000 steps with 2097152 atomsPair time (%) = 49.8815 (60.7125)
Neigh time (%) = 6.53267e-05 (7.95114e-05) Comm time (%) = 5.40412 (6.57754) Outpt time (%) = 0.00573647 (0.00698206) Other time (%) = 26.8688 (32.7029)Nlocal: 1.04858e+06 ave 1.04859e+06 max 1.04856e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1 Nghost: 173222 ave 173223 max 173220 min Histogram: 1 0 0 0 0 0 0 0 0 1 Neighs: 0 ave 0 max 0 min Histogram: 2 0 0 0 0 0 0 0 0 0Total # of neighbors = 0
Ave neighs/atom = 0 Neighbor list builds = 50 Dangerous builds = 0 --------------------------------------------------------------------- GPU Time Info (average): --------------------------------------------------------------------- Data Transfer: 5.8268 s. Data Cast/Pack: 15.0256 s. Neighbor copy: 0.0002 s. Neighbor build: 13.8191 s. Force calc: 15.7533 s. GPU Overhead: 0.0495 s. Average split: 1.0000. Threads / atom: 4. Max Mem / Proc: 1426.04 MB. CPU Driver_Time: 0.0497 s. CPU Idle_Time: 21.3674 s. --------------------------------------------------------------------- real 1m29.050s user 0m0.065s sys 0m0.028sCUDA
time mpirun -np 2 ../../src/lmp_g++ -sf cuda -v g 2 -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.cuda LAMMPS (16 Nov 2012) # Using LAMMPS_CUDA USER-CUDA mode is enabled (lammps.cpp:393) # CUDA: Activate GPU # Using device 0: Tesla C2050 Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 Created orthogonal box = (0 0 0) to (107.494 107.494 214.988) # Using device 1: Tesla C2050 1 by 1 by 2 MPI processor grid Created 2097152 atoms # CUDA: VerletCuda::setup: Allocate memory on device for maximum of 1050000 atoms... # CUDA: Using precision: Global: 8 X: 8 V: 8 F: 8 PPPM: 8 Setting up run ... # CUDA: VerletCuda::setup: Upload data... Test TpA Test BpA# CUDA: Timing of parallelisation layout with 10 loops:
# CUDA: BpA TpA 2.088803 0.418611 # CUDA: Total Device Memory useage post setup: 726.984375 MB Memory usage per processor = 169.36 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733681 0 -4.6133691 -5.0196698 1000 0.70398844 -5.6764793 0 -4.6204971 0.70359222 Loop time of 49.6546 on 2 procs for 1000 steps with 2097152 atomsPair time (%) = 31.7106 (63.8622)
Neigh time (%) = 9.56514 (19.2634) Comm time (%) = 5.88421 (11.8503) Outpt time (%) = 0.00104213 (0.00209875) Other time (%) = 2.49368 (5.02204)Nlocal: 1.04858e+06 ave 1.04861e+06 max 1.04854e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1 Nghost: 173368 ave 173410 max 173325 min Histogram: 1 0 0 0 0 0 0 0 0 1 Neighs: 0 ave 0 max 0 min Histogram: 2 0 0 0 0 0 0 0 0 0 FullNghs: 7.86097e+07 ave 7.86114e+07 max 7.8608e+07 min Histogram: 1 0 0 0 0 0 0 0 0 1Total # of neighbors = 157219330
Ave neighs/atom = 74.968 Neighbor list builds = 50 Dangerous builds = 0 # CUDA: Free memory...real 0m59.271s
user 0m0.071s sys 0m0.023s
转载地址:http://srkli.baihongyu.com/