Original Post from vlenzker.net: https://vlenzker.net/2020/04/better-together-vsphere-7-and-horizon-7-12-with-nvidia-vgpus-in-high-end-vdi-environments/Current situations accelerate the demand for virtual desktops and a proper virtual desktop infrastructure. I am working for years in the field of the software-defined datacenter and virtual desktops delivered via VMware Horizon. While standard office virtual desktops have become something like a commodity, the usage of hardware graphics accelerated workload and high-end demand is still kind of ‘newer’ and not as mature as the non-gpu workload.I am involved in a large scale product that creates a virtual desktop landscape for engineers doing a lot computer aided engineering (CAE). This is a very interesting environment that comes with a lot of difficult user requirements and how it can be solved.
- 3D Acceleration required during normal “working” operations when modelling / analyzing components. (NVIDA vGPU for the win)
- Huge amount of local persistent storage to cache models loaded/checked out from a network location. (vSAN is everywhere in the cluster, included in Horizon Advanced & brutal fast if you do it right).
- High throughput between the virtual desktops and the network location (The big benefit of a VDI setup).
- Secure access from everywhere & sharing of the session (Another big benefit of VDI).
- Windows & Linux Desktops must be supported (Horizon can do that)
- Huge amount of memory per virtual desktop since sometimes models can have 100/200GB of size and needs to resist in the memory of the virtual machine. (vSphere scales indefinelty [almost :P]).
- Most engineers will require a dedicated linux desktop where he can individually use modules/kernels/packages based on his needs (an IT managements nightmare). (dedicated user assignment with persistent VDIs).
- Different working behaviours (more than 500 different user-interaction scenarios) that cannot easily be matched to a single use-case.
- Store certain states of the virtual desktop. Loading / pre-processing might take 1-XX hours and needs to be repeated several times (VM snapshots with memory ftw).
- Certain Desktops should be shared among team-members (working in parallel or sequential).
In theory we have for all requirements a solution or feature in place. From a sales talk perspective I could do the check; check; check and just sell & build the solution.
As always in IT the risks, problems, frictions are in the details & interoperability. Don’t get me wrong: such a solution works just fine and satisfies the user (honestly: thanks to such a solution a lot of high-skilled and important engineers can still do their daily work while work branches have been closed down).
But still I want to improve things / point to pain points that should be improved from my experience. I want to demonstrate why vSphere 7 & Horizon 7.12 work better together on certain aspects and which constraints or technical boundaries limit us to make such a platform even better.
First of all. What is the advantage of a hardware accelerated graphics processor within a virtual desktop?
- The usage of Applications that require 3d accelerator (CATIA, ANSYS, etc.)
- Better “subjective” performance of Windows 10
- Reduction of ESXi host CPU cycles (due to h.264/h.265 offloading to the GPU)
But we need to keep in mind certain limitations here. I have created a few slides on an internal NVIDIA workshop I held a few months ago. The problems especially came into play since we were not able to utilize floating / stateless VDIs with GPUs (which would make things a lot easier for us; but not-acceptable for most of the users).
What are technical limitations when utilizing vGPUs in a dedicated user assignment manor:
- No vSphere console once the GPU is used
- No Horizon Direct Connect agent for linux
- No fully version compatibility across branches for NVIDIA grid host driver & guest os driver.
- No different vGPU profiles (e.g. 2Q & 4Q with 2GB / 4GB framebuffer ) on a single hardware GPU
- Fully memory reservation required for vGPU usage
- No proper DRS placement intelligence within Horizon or DRS. Powering On VMs really become some kind of wheel of fortune here.
Due to the requirements we need to size each VDI in a way that it can also handle with a sudden requirments of a “huge” model. In the best case we would have some autoscaling & memory hot-plug during runtime (I am currently evaluating something like this).
Especially in a scenario where we would like to suspend VMs to free up resources (using vRealize Operations to find idle Desktops & Suspend them) and power them on quickly we would require a proper DRS to work since we need to take into consideration available memory in a cluster & available GPU resources on a single HOST.
So far we were just able to describe the placement characteristics on a single ESXi hosts via.
That means that on a single host with multiple physical GPUs the local placement tried to either consolidate the vGPUs or not.
Diagram demonstrates Group VMs on GPU until fill (GPU consolidation).
That was quite useful when you use different profiles within a cluster But still DRS was not capable of select a host that has a proper GPU with a free slot in place. (That’s why the usage of DRS is not supported by NVIDIA grid until vSphere 7).
The result has been weird error with in vSphere until a suitable host has been found.
For sure we worked around with some script detecting vGPU profiles & assign the VMs to proper DRS groups. But this something I want to have out-of-the box.
–> Voila: vSphere 7 has finally support for assignable hardware solving the DRS problem for initial placement.
Niels Hagoort gave us a great summary of the characteristics here.
Horizon 7.12 is already supported for vSphere 7. So once NVIDIA has certified vSphere 7 as a supported hypervisor we finally get this problem solved (I try to summarize first non-supported hands-on during the next weeks).
What hasn’t changed so far is the behaviour that a full memory reservation has been required
I will dive deeper into how we can rightsize & dynamically adjust virtual desktops during the runtime over the next months. If you like this post, let me know and I will keep you updated :)