大约在两年前我撰写了一篇文章记录我为实验室部署GPU云的核心过程。现在给学院部署GPU云,本以为可以照着原来的文章很顺利就可以装完,结果却搞得心态爆炸,一度怀疑人生,所以很有必要再写一篇记录一下。
环境:
实体机:曙光 天阔 W780-G20
操作系统: VMware ESXi, 6.7.0, 14320388
管理平台: VMware vCenter Server Appliance, 6.7.0, 15129973
GPU:NVIDIA GP100GL [Tesla P100 PCIe 16GB] ×8
先简单说一下部署环境和上次的区别:第一是平台不再是商业大厂DELL EMC,很多文档资料缺失;第二是底层esxi和管理平台vcenter都升级到了6.7;第三是虚拟机准备安装Ubuntu Server 1804,之前的16在21年来看有点过老。最关键的是第四点:显卡质量和数量都上了一个巨大台阶,原来单台机器最多3卡(一个泰坦X,两个泰坦XP),遇到的显存最大的卡应该就是p40。当时在安装p40的时候就遇到奇怪的问题,为此还写了一篇记录文章。这次遇到的核心问题和p40那次应该是同一个,都是高级显卡显存过大,需要一些特定的底层配置参数,vmware专门有博客文章(下简称博文)进行了介绍,但是我一开始寻找解决方案的方向有很大偏差,绕了一大圈才找到。

0. 准备工作
0.1. 主机BIOS设置
根据博文内容,需要在运行虚拟GPU服务器的esxi主机的BIOS设置中查找并开启类似这样的选项:“above 4G decoding”、“memory mapped I/O above 4GB”、“PCI 64-bit resource handing above 4G”

0.2. 配置PCI设备直通
在esxi主机中配置好显卡的设备直通

0.3. 上传系统安装镜像
将ubuntu server 1804安装镜像上传到主机数据存储
1. 创建虚拟机
⚠️注意:必须在vsphere client中创建虚拟机,不可以在本地workstation创建虚拟机安装后再上传!
首先按照常规方法创建虚拟机,创建后不要启动!按以下步骤进行预先配置。
1.1. 设置虚拟机引导方式为EFI

在虚拟机页面,“操作”→“编辑设置”→“虚拟机选项”→“引导选项”→“固件”,选择“EFI”。
1.2. 设置虚拟机高级配置参数

在刚才1.1的“引导选项”下方找到“高级”,点击“配置参数”右侧的“编辑配置”,进入高级参数配置窗口。

在窗口中点击“添加配置参数”,然后添加2条:
1 2 |
pciPassthru.use64bitMMIO = "TRUE" pciPassthru.64bitMMIOSizeGB = "[⚠️此处按需填写⚠️]" |
⚠️注意:“pciPassthru.64bitMMIOSizeGB”条目的值需要根据虚拟机具体直通的显卡型号和数量进行计算后得出!
博文中提供了一种简单估算方法:GPU个数 × 一块GPU的显存大小(GB为单位),然后将结果向上取整到下一个2的整数次幂。
例如,要给这个虚拟机直通两个p100显卡,则值应为:2 × 16 = 32,向上取整到下一个2的整数次幂,得到64。但如果是直通v100显卡,64则只能直通一块,因为v100单卡显存就是32G,想要直通两块就需要改为128。
⚠️注意:与通常情况不同的是不可以添加“hypervisor.cpuid.v0 = FALSE”参数!
1.3. 修改虚拟机兼容性

在虚拟机页面,“操作”→“兼容性”→“升级虚拟机兼容性”,打开“配置虚拟机兼容性窗口”。

将虚拟机兼容性改为:ESXi 6.7 Update 2 及更高版本
2. 安装操作系统
现在可以进行第一次开机,并按照提示安装操作系统。安装后,顺便进行基本的系统配置,例如换源、配置网络、软件包升级等。
3. 为虚拟机添加显卡直通
3.1. 启用内存预留

在虚拟机页面,“操作”→“编辑设置”→“虚拟硬件”→“内存”,勾选“预留所有客户机内存(全部锁定)”。
3.2. 添加显卡设备

在虚拟机页面,“操作”→“编辑设置”→“虚拟硬件”→“添加新设备”→“PCI 设备”,然后在PCI设备选项中选择需要添加的显卡。
4. 安装NVIDIA驱动和CUDA
4.1. 检查GPU
1 |
lspci | grep -i nvidia |
4.2. 禁用nouveau
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# 若有输出则表示nouveau正在加载,需要禁用。 lsmod | grep nouveau sudo nano /etc/modprobe.d/blacklist-nouveau.conf # 文件中添加以下内容 blacklist nouveau options nouveau modeset=0 # “4.15.0-134”可能会根据机器差异发送变化,需要到具体路径下确认。 cd /lib/modules/4.15.0-134-generic/kernel/drivers/gpu/drm/nouveau sudo rm -rf nouveau.ko sudo rm -rf nouveau.ko.org sudo nano /etc/modprobe.d/blacklist.conf # 文件末尾追加以下内容 blacklist rivafb blacklist vga16fb blacklist nouveau blacklist nvidiafb blacklist rivatv # 更新并重启 sudo update-initramfs -u sudo reboot |
4.3. 安装依赖包
1 2 |
sudo apt-get install linux-headers-$(uname -r) sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev |
4.4. 安装NVIDIA驱动
默认已经下载好驱动文件
1 2 3 4 5 6 7 8 9 10 11 |
sudo apt-get remove --purge nvidia-* sudo chmod +x NVIDIA-Linux-x86_64-440.118.02.run sudo ./NVIDIA-Linux-x86_64-440.118.02.run --no-x-check --no-nouveau-check --no-opengl-files # 随后会进入一个简陋的彩色界面,根据它的指示安装就可以了。 # 重启,检查驱动是否已正常运行 sudo reboot nvidia-smi # 应该显示相应的显卡运行情况 |
4.5. 安装CUDA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
sudo chmod +x ./cuda_10.2.89_440.33.01_linux.run # 安装CUDA sudo sh ./cuda_10.2.89_440.33.01_linux.run # 经过一小段时间的等待,会出现用户许可协议,输入accept然后回车继续。 # 之后会有安装选项,是一个文字模拟的选项界面。注意如果已经安装驱动,就不需要再次安装了! # 安装完成后应该显示如下信息 =========== = Summary = =========== Driver: Not Selected Toolkit: Installed in /usr/local/cuda-10.2/ Samples: Installed in /home/ubuntu/ Please make sure that - PATH includes /usr/local/cuda-10.2/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.2/lib64, or, add /usr/local/cuda-10.2/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.2/bin Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.2/doc/pdf for detailed information on setting up CUDA. ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 440.00 is required for CUDA 10.2 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run --silent --driver Logfile is /var/log/cuda-installer.log # 安装完成后应该还需要配置环境变量等操作,但我不需要在宿主机运行cuda,因此略过。 # 重启,检查CUDA是否已正常运行 sudo reboot cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery sudo make ./deviceQuery # 应该返回以下信息 ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "Tesla P100-PCIE-16GB" CUDA Driver Version / Runtime Version 10.2 / 10.2 CUDA Capability Major/Minor version number: 6.0 Total amount of global memory: 16281 MBytes (17071734784 bytes) (56) Multiprocessors, ( 64) CUDA Cores/MP: 3584 CUDA Cores GPU Max Clock rate: 1329 MHz (1.33 GHz) Memory Clock rate: 715 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 4194304 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 11 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "Tesla P100-PCIE-16GB" CUDA Driver Version / Runtime Version 10.2 / 10.2 CUDA Capability Major/Minor version number: 6.0 Total amount of global memory: 16281 MBytes (17071734784 bytes) (56) Multiprocessors, ( 64) CUDA Cores/MP: 3584 CUDA Cores GPU Max Clock rate: 1329 MHz (1.33 GHz) Memory Clock rate: 715 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 4194304 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 19 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU1) : No > Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU0) : No deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 2 Result = PASS |