[记录]配置Docker出现的错误

  为了方便管理实验室的GPU资源,上了多层虚拟化,在部署和运维Docker节点的过程中踩了一些坑,在这记录一下,以便日后查阅。

1、Docker服务无法启动

问题定位: Docker启动的时候传入了命令行参数,同时也指定了配置文件,两个配置发生了冲突。

解决方法:

[记录]英伟达Tesla K80显卡直通出现未知错误

  实验室添置了新显卡,超微塔式服务器没地方了,只好把旧显卡换到戴尔刀片里,之后在做显卡直通的时候遇到一些莫名其妙的问题,折腾了好久才解决,在这记录一下。

环境:

实体机:DELL PowerEdge R730
操作系统: VMware ESXi, 6.5.0, 6765664
管理平台: VMware vCenter Server Appliance, 6.7.0.21000, 11726888
设备: Nvidia Tesla K80

  问题是这样的:在主机上开虚拟机,并直通其他PCI设备(usb控制器),一切正常,且虚拟机正常启动,正常识别设备。给虚拟机直通Nvidia Tesla K80,虚拟机无法开机,报未知异常,如下图:

  尝试了很多国内外常见的解决方案都无法生效。怀疑是跟显卡型号有关,最后在官方社区找到一个出现完全相同问题的讨论:Passing through Tesla k80 Issue…。下面有一个官方人员的回答是:

Re: Passing through Tesla k80 Issue…

A previous version of this post included advice to add two VMX file entries (efi.legacyBoot.enabled and efi.bootOrder) as part of the solution. These two settings should NOT be used. Instead, following the directions below.
 ——–
You should be able to pass a single GPU (that is, half of a K80) to a VM running on ESX 6 by creating an EFI-bootable VM, doing an EFI installation of your guest OS, and then adding the following to the VM’s VMX file.
pciPassthru.use64bitMMIO=”TRUE”
Trying to pass more than one of these GPUs into the same VM will currently hit a platform memory limit and the VM will fail to boot. (NOTE: This limit has been removed in ESX 6.5).
A smaller card like the K2 does not have this issue: GPGPU Blog Entry
If the above does not work for you, send me email directly at “simons at vmware dot com”. In either case, please share your experience with others on the thread.
And if you have any other questions about running HPC applications in a VMware environment, I’d be happy to hear from you directly.
If you are interested in learning more of what we’ve been doing related to HPC, you can check out our HPC entries on the VMware CTO blog site here: HPC Blog Entries
 
Josh Simons
High Performance Computing
Office of the CTO
VMware, Inc.

  关键是要添加以下高级参数: