Quantcast
Channel: Hyper-V forum
Viewing all articles
Browse latest Browse all 8743

VMQ issues with NIC Teaming

$
0
0

Hi All

Apologies if this is a long one but I thought the more information I can provide the better.

We have recently designed and built a new Hyper-V environment for a client, utilising Windows Server R2 / System Centre 2012 R2 however since putting it into production, we are now seeing problems with Virtual Machine Queues. These manifest themselves as either very high latency inside virtual machines (we’re talking 200 – 400 mSec round trip times), packet loss or complete connectivity loss for VMs. Not all VMs are affected however the problem does manifest itself on all hosts. I am aware of these issues having cropped up in the past with Broadcom NICs.

I'll give you a little bit of background into the problem...

Frist, the environment is based entirely on Dell hardware (Equallogic Storage, PowerConnect Switching and PE R720 VM Hosts). this environment was based on Server 2012 and a decision was taken to bring this up to speed to R2. This was due to a number of quite compelling reasons, mainly surrounding reliability. The core virtualisation infrastructure consists of four VM hosts in a Hyper-V Cluster.

Prior to the redesign, each VM host had 12 NICs installed:

  • Quad port on-board Broadcom 5720 daughter card: Two NICs assigned to a host management team whilst the other two NICs in the same adapter formed a Live Migration / Cluster heartbeat team, to which a VM switch was connected with two vNICs exposed to the management OS. Latest drivers and firmware installed. The Converged Fabric team here was configured in LACP Address Hash (Min Queues mode), each NIC having the same two processor cores assigned. The management team is identically configured.

  • Two additional Intel i350 quad port NICs: 4 NICs teamed for the production VM Switch uplink and 4 for iSCSI MPIO. Latest drivers and firmware. The VM Switch team spans both physical NICs to provide some level of NIC level fault tolerance, whilst the remaining 4 NICs for ISCSI MPIO are also balanced across the two NICs for the same reasons.

The initial driver for upgrading was that we were once again seeing issues with VMQ in the old design with the converged fabric design. The two vNics in the management OS for each of these networks were tagged to specific VLANs (that were obviously accessible to the same designated NICs in each of the VM hosts).

In this setup, a similar issue was being experienced to our present issue. Once again, the Converged Fabric vNICs in the Host OS would on occasion, either lose connectivity or exhibit very high round trip times and packet loss. This seemed to correlate with a significant increase in bandwidth through the converged fabric, such as when initiating a Live Migration and would then affect both vNICS connectivity. This would cause packet loss / connectivity loss for both the Live Migration and Cluster Heartbeat vNICs which in turn would trigger all sorts of horrid goings on in the cluster. If we disabled VMQ on the physical adapters and the team multiplex adapter, the problem went away. Obviously disabling VMQ is something that we really don’t want to resort to.

So…. The decision to refresh the environment with 2012 R2 across the board (which was also driven by other factors and not just this issue alone) was accelerated.

In the new environment, we replaced the Quad Port Broadcom 5720 Daughter Cards in the hosts with new Intel i350 QP Daughter cards to keep the NICs identical across the board. The Cluster heartbeat / Live Migration networks now use an SMB Multichannel configuration, utilising the same two NICs as in the old design in two isolated untagged port VLANs. This part of the re-design is now working very well (Live Migrations now complete much faster I hasten to add!!)

However…. The same VMQ issues that we witnessed previously have now arisen on the production VM Switch which is used to uplink the virtual machines on each host to the outside world.

The Production VM Switch is configured as follows:

  • Same configuration as the original infrastructure: 4 Intel 1GbE i350 NICs, two of which are in one physical quad port NIC, whilst the other two are in an identical NIC, directly below it. The remaining 2 ports from each card function as iSCSI MPIO interfaces to the SAN. We did this to try and achieve NIC level fault tolerance. The latest Firmware and Drivers have been installed for all hardware (including the NICs) fresh from the latest Dell Server Updates DVD (V14.10).

  • In each host, the above 4 VM Switch NICs are formed into a Switch independent, Dynamic team (Sum of Queues mode), each physical NIC hasRSS disabled and VMQ enabled and the Team Multiplex adapter also has RSS disabled an VMQ enabled. Secondly, each NIC is configured to use a single processor core for VMQ. As this is a Sum of Queues team, cores do not overlap and as the host processors have Hyper Threading enabled, only cores (not logical execution units) are assigned to RSS or VMQ. The configuration of the VM Switch NICs looks as follows when running Get-NetAdapterVMQ on the hosts:

Name                           InterfaceDescription              Enabled BaseVmqProcessor MaxProcessors NumberOfReceive
                                                                                                        Queues
----                           --------------------              ------- ---------------- ------------- ---------------
VM_SWITCH_ETH01                Intel(R) Gigabit 4P I350-t A...#8 True    0:10             1             7
VM_SWITCH_ETH03                Intel(R) Gigabit 4P I350-t A...#7 True    0:14             1             7
VM_SWITCH_ETH02                Intel(R) Gigabit 4P I350-t Ada... True    0:12             1             7
VM_SWITCH_ETH04                Intel(R) Gigabit 4P I350-t A...#2 True    0:16             1             7
Production VM Switch           Microsoft Network Adapter Mult... True    0:0                            28

Load is hardly an issue on these NICs and a single core seems to have sufficed in the old design, so this was carried forward into the new.

The loss of connectivity / high latency (200 – 400 mSec as before) only seems to arise when a VM is moved via Live Migration from host to host. If I setup a constant ping to a test candidate VM and move it to another host, I get about 5 dropped pings at the point where the remaining memory pages / CPU state are transferred, followed by an dramatic increase in latency once the VM is up and running on the destination host. It seems as though the destination host is struggling to allocate the VM NIC to a queue. I can then move the VM back and forth between hosts and the problem may or may not occur again. It is very intermittent. There is always a lengthy pause in VM network connectivity during the live migration process however, longer than I have seen in the past (usually only a ping or two are lost, however we are now seeing 5 or more before VM Nework connectivity is restored on the destination host, this being enough to cause a disruption to the workload).

If we disable VMQ entirely on the VM NICs and VM Switch Team Multiplex adapter on one of the hosts as a test, things behave as expected. A migration completes within the time of a standard TCP timeout.

VMQ looks to be working, as if I run Get-NetAdapterVMQQueue on one of the hosts, I can see that Queues are being allocated to VM NICs accordingly. I can also see that VM NICs are appearing in Hyper-V manager with “VMQ Active”.

It goes without saying that we really don’t want to disable VMQ, however given the nature of our clients business, we really cannot afford for these issues to crop up. If I can’t find a resolution here, I will be left with no choice as ironically, we see less issues with VMQ disabled compared to it being enabled.

I hope this is enough information to go on and if you need any more, please do let me know. Any help here would be most appreciated.

I have gone over the configuration again and again and everything appears to have been configured correctly, however I am struggling with this one.

Many thanks

Matt



Viewing all articles
Browse latest Browse all 8743

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>