首先准备好Ubuntu Server 1604的虚拟机,只设定基本功能,不添加显卡直通等特殊功能,以便于基础环境完成后进行快照。
安装好系统后,进行ssh端口设置,apt换源,apt upgrade等基本操作,之后关机打快照。
1、 显卡直通
1.1、 添加显卡直通
勾选“预留所有客户机内存(全部锁定)”,并添加显卡PCI设备,确定修改。不在第一步直通显卡的话,之后的驱动安装可能有问题!

1.2、 修改虚拟机硬件配置参数
切换到“虚拟机硬件”选项卡。展开“高级”,修改“配置参数”。 在打开的“配置参数”对话框中,点击“添加配置参数”,填入“
hypervisor.cpuid.v0 = FALSE”,确定修改。完成后务必再次进入“配置参数”,检查参数确实已经添加并保存!


1.3、 启动虚拟机并配置基本环境
配置网络、换源、更新软件包等
完成基本配置后建议对虚拟机生成快照!
2、 安装显卡驱动
2.1、 下载驱动
可以在 NVIDIA驱动下载 或者 Geforce显卡驱动下载 页面搜索并下载需要的驱动程序,也可以通过直接拼接资源链接:
http://cn.download.nvidia.com/XFree86/Linux-x86_64/[驱动版本号]/NVIDIA-Linux-x86_64-[驱动版本号].run,进行下载。比如我要下载430.14版本号的Linux驱动,可以直接在服务器运行以下代码:
1 2 |
cd ~/downloads wget https://cn.download.nvidia.cn/XFree86/Linux-x86_64/430.14/NVIDIA-Linux-x86_64-430.14.run |
2.2、 检查GPU
应该显示相应的PCI设备
1 |
lspci | grep -i nvidia |
2.3、 禁用nouveau
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# 若有输出则表示nouveau正在加载,需要禁用。 lsmod | grep nouveau cd /etc/modprobe.d sudo touch blacklist-nouveau.conf sudo nano blacklist-nouveau.conf # 文件中添加以下内容 blacklist nouveau options nouveau modeset=0 # “4.4.0-83”可能会根据机器差异发送变化,需要到具体路径下确认。 cd /lib/modules/4.4.0-83-generic/kernel/drivers/gpu/drm/nouveau sudo rm -rf nouveau.ko sudo rm -rf nouveau.ko.org sudo nano /etc/modprobe.d/blacklist.conf # 文件末尾追加以下内容 blacklist rivafb blacklist vga16fb blacklist nouveau blacklist nvidiafb blacklist rivatv # 更新并重启 sudo update-initramfs -u sudo reboot |
2.3、 安装显卡驱动
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
sudo apt-get remove --purge nvidia-* cd ~/downloads sudo chmod +x NVIDIA-Linux-x86_64-430.14.run sudo ./NVIDIA-Linux-x86_64-430.14.run --no-x-check --no-nouveau-check --no-opengl-files # --no-x-check 安装驱动时关闭X服务 # --no-nouveau-check 安装驱动时禁用nouveau # --no-opengl-files 只安装驱动文件,不安装OpenGL文件 # 随后会进入一个简陋的彩色界面,根据它的指示安装就可以了。 # 重启,检查驱动是否已正常运行 sudo reboot nvidia-smi # 应该显示相应的显卡运行情况 |
安装完驱动建议对虚拟机生成快照!
然后开机安装docker,去清华源帮助文档里找清华源安装方法。安装好之后配置docker远程管理,参考“docker 配置 TLS 认证开启远程访问”和“Docker 守护进程+远程连接+安全访问+启动冲突解决办法 (完整收藏版)”。配置好docker之后关机打快照。
在ESXi中添加显卡直通,并启动。
2020年6月28日补充
2.4、安装CUDA
本处使用的CUDA安装文件是文章在19年6月初次完成时下载的,因此安装方法也是当时的安装方法。根据从英伟达官网的CUDA安装文档来看,最新的推荐安装方式似乎变成了用apt包管理器安装。但是根据最新的安装文档,我在操作过程中出现了两个问题:首先是所有的下载资源(包括软件源证书和软件包)都无法在中国大陆正常下载;另一个是安装后server系统会出现GUI界面。因此,我决定使用去年的安装包和安装方法。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
cd ~/downloads sudo chmod +x ./cuda_10.0.130_410.48_linux.run # 安装依赖包 sudo apt-get install linux-headers-$(uname -r) sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev # 安装CUDA sudo sh ./cuda_10.0.130_410.48_linux.run # 经过一小段时间的等待,会出现一个非常长的用户许可协议,按住回车键直到协议走完,不用担心按过头,最后需要输入accept才会继续。 # 之后会有安装选项,类似下面的内容。注意如果已经安装驱动,就不需要再次安装了! Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48? (y)es/(n)o/(q)uit: n Install the CUDA 10.0 Toolkit? (y)es/(n)o/(q)uit: y Enter Toolkit Location [ default is /usr/local/cuda-10.0 ]: Do you want to install a symbolic link at /usr/local/cuda? (y)es/(n)o/(q)uit: y Install the CUDA 10.0 Samples? (y)es/(n)o/(q)uit: y Enter CUDA Samples Location [ default is /home/ubuntu ]: Installing the CUDA Toolkit in /usr/local/cuda-10.0 ... Installing the CUDA Samples in /home/ubuntu ... Copying samples to /home/ubuntu/NVIDIA_CUDA-10.0_Samples now... Finished copying samples. # 安装完成后应该显示如下信息 =========== = Summary = =========== Driver: Not Selected Toolkit: Installed in /usr/local/cuda-10.0 Samples: Installed in /home/ubuntu Please make sure that - PATH includes /usr/local/cuda-10.0/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA. ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 10.0 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run -silent -driver Logfile is /tmp/cuda_install_6983.log # 安装完成后应该还需要配置环境变量等操作,但我不需要在宿主机运行cuda,因此略过。 # 重启,检查CUDA是否已正常运行 sudo reboot cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery sudo make ./deviceQuery # 应该返回以下信息 ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "GeForce RTX 2080 Ti" CUDA Driver Version / Runtime Version 10.2 / 10.0 CUDA Capability Major/Minor version number: 7.5 Total amount of global memory: 11019 MBytes (11554717696 bytes) (68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores GPU Max Clock rate: 1545 MHz (1.54 GHz) Memory Clock rate: 7000 Mhz Memory Bus Width: 352-bit L2 Cache Size: 5767168 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1024 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 3 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "GeForce RTX 2080 Ti" CUDA Driver Version / Runtime Version 10.2 / 10.0 CUDA Capability Major/Minor version number: 7.5 Total amount of global memory: 11019 MBytes (11554717696 bytes) (68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores GPU Max Clock rate: 1545 MHz (1.54 GHz) Memory Clock rate: 7000 Mhz Memory Bus Width: 352-bit L2 Cache Size: 5767168 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1024 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 3 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from GeForce RTX 2080 Ti (GPU0) -> GeForce RTX 2080 Ti (GPU1) : No > Peer access from GeForce RTX 2080 Ti (GPU1) -> GeForce RTX 2080 Ti (GPU0) : No deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.0, NumDevs = 2 Result = PASS |
安装完CUDA建议对虚拟机生成快照!
3、 安装Docker
3.1、 安装Docker
根据清华开源镜像站的帮助文档,安装Docker。
3.2、 安装nvidia-docker
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
cd ~ # 添加软件源 curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update sudo apt-get install -y nvidia-docker2 # 修改docker配置 sudo pkill -SIGHUP dockerd sudo nano /etc/docker/daemon.json # 配置文件样例,$端口号是你设定的端口号 { "hosts": ["unix:///var/run/docker.sock", "tcp://0.0.0.0:$端口号"], "registry-mirrors": ["https://******.mirror.aliyuncs.com"], "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } } # 主配置文件的启动参数和上述配置有冲突,重启docker会导致如下报错: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details. # 因此,同时修改主配置文件 sudo nano /lib/systemd/system/docker.service # 修改ExecStart参数为如下样式: ExecStart=/usr/bin/dockerd # -H fd:// --containerd=/run/containerd/containerd.sock # 刷新配置,重启docker systemctl daemon-reload sudo systemctl restart docker |
安装并配置好docker建议对虚拟机生成快照!
4、 配置TLS远程管理,及Portainer接入
配置TLS远程管理是在Docker服务器(运行docker守护进程的服务器)进行配置,因此下面的$HOST是Docker服务器的IP!
4.1、 修改openssl配置的CA部分
ubuntu 16.04 下 openssl 配置文件位置:/usr/lib/ssl/openssl.cnf,其他系统可参考
1 2 3 4 5 6 7 8 9 10 11 12 13 |
sudo su nano /usr/lib/ssl/openssl.cnf # 修改以下部分,路径仅作示例 [ CA_default ] dir = /etc/openssl # 创建相关目录和文件 mkdir /etc/openssl cd /etc/openssl mkdir -p {certs,private,tls,crl,newcerts} # 注意serial要有数字,不然会报错 echo 00 > serial touch index.txt |
4.2、 生成私钥并自签证书
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# 生成CA私钥,必须设定一个CA私钥密码,该密码很重要,之后的步骤要输入 openssl genrsa -out private/cakey.pem -des 4096 # 返回如下内容 Generating RSA private key, 4096 bit long modulus ..............++ .........................................................................................................................++ e is 65537 (0x10001) Enter pass phrase for private/cakey.pem: Verifying - Enter pass phrase for private/cakey.pem: # 用CA自签证书 openssl req -new -x509 -key private/cakey.pem -days 3650 -out cacert.pem # 返回如下内容 # 执行后首先会要求输入CA私钥密码 Enter pass phrase for private/cakey.pem: # 之后会要求填写一些信息,其中Common Name为必填项!$HOST是Docker服务器(运行docker守护进程的服务器)的IP You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:CN State or Province Name (full name) [Some-State]:Beijing Locality Name (eg, city) []:Beijing Organization Name (eg, company) [Internet Widgits Pty Ltd]:BISTU Organizational Unit Name (eg, section) []:PCIJLab Common Name (e.g. server FQDN or YOUR name) []:$HOST Email Address []: |
4.3、 颁发证书
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
# 生成要颁发证书的密钥文件 openssl genrsa -out private/docker.key 4096 # 返回如下内容 Generating RSA private key, 4096 bit long modulus ...................++ ...................................................++ e is 65537 (0x10001) # 生成证书请求 openssl req -new -key private/docker.key -days 3650 -out docker.csr # 需要填写请求信息,其中Common Name为必填项!$HOST是Docker服务器(运行docker守护进程的服务器)的IP You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:CN State or Province Name (full name) [Some-State]:Beijing Locality Name (eg, city) []:Beijing Organization Name (eg, company) [Internet Widgits Pty Ltd]:BISTU Organizational Unit Name (eg, section) []:PCIJLab Common Name (e.g. server FQDN or YOUR name) []:$HOST Email Address []: # 下面的密码和公司名都可以留空 Please enter the following 'extra' attributes to be sent with your certificate request A challenge password []: An optional company name []: # 颁发证书 openssl ca -in docker.csr -out certs/docker.crt -days 3650 # 会要求输入CA私钥密码 Using configuration from /usr/lib/ssl/openssl.cnf Enter pass phrase for /etc/openssl/private/cakey.pem: # 之后会显示相关信息,并要求确认 Check that the request matches the signature Signature ok Certificate Details: Serial Number: 0 (0x0) Validity Not Before: Jul 8 10:07:47 2020 GMT Not After : Jul 6 10:07:47 2030 GMT Subject: countryName = CN stateOrProvinceName = Beijing organizationName = BISTU organizationalUnitName = PCIJLab commonName = $HOST X509v3 extensions: X509v3 Basic Constraints: CA:FALSE Netscape Comment: OpenSSL Generated Certificate X509v3 Subject Key Identifier: 9A:21:68:DD:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF X509v3 Authority Key Identifier: keyid:19:D5:EF:76:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF:FF Certificate is to be certified until Jul 6 10:07:47 2030 GMT (3650 days) Sign the certificate? [y/n]:y 1 out of 1 certificate requests certified, commit? [y/n]y Write out database with 1 new entries Data Base Updated |
4.4、 配置Docker使用TLS认证
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# 转存生成的秘钥和证书 cd /etc/docker/ mkdir ssl cd /etc/docker/ssl cp /etc/openssl/cacert.pem ./ca.pem cp /etc/openssl/newcerts/00.pem ./cert.pem cp /etc/openssl/private/docker.key ./key.pem # 修改配置文件 nano /etc/docker/daemon.json # 配置文件样例,$端口号是你设定的端口号 { "tlsverify": true, "tlscacert": "/etc/docker/ssl/ca.pem", "tlscert": "/etc/docker/ssl/cert.pem", "tlskey": "/etc/docker/ssl/key.pem", "hosts": ["unix:///var/run/docker.sock", "tcp://0.0.0.0:$端口号"], "registry-mirrors": ["https://******.mirror.aliyuncs.com"], "default-shm-size": "32G", "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } } # 重启Docker systemctl restart docker |
4.5、 Portainer接入
如果已经添加过服务器,只更新TLS认证配置,则只需要上传相应证书:
将/etc/docker/ssl/
目录内的cert.pem
和key.pem
文件拷贝到本地,上传至Portainer。

如果是新服务器,按照下面步骤进行添加:


配置好TLS远程管理后建议对虚拟机生成快照!