Fault tolerance, or FT, is a means of providing zero downtime to select virtual machines by maintaining an exact mirror copy on a second physical server. FT is not needed for every workload; it was never intended for widespread use, and there are plenty of more appropriate alternatives for databases, mail servers, and other elements of the infrastructure designed with high-availability and redundancy in mind.
What FT is good for, however, is protecting critical workloads that do not have sufficient redundancy capabilities out of the box, such as a legacy application that fulfills a crucial role but was never architected for failover, or a next-generation workload still in early phases that lacks the robustness that comes with maturity.
Consider Hadoop. Doing interesting things with massive amounts of information is in the foreground these days, and Hadoop is the de facto standard when it comes to analysis of big data. Hadoop, designed to process jobs in a distributed fashion, is tolerant of compute node failures within a cluster, but a couple of the management components of Hadoop have not yet reached the same level of resiliency as the data processing nodes and remain single points of failure. A scenario like this is a perfect match for VMware vSphere Fault Tolerance, so it should be no surprise the the VMware Performance Team has recently published a study characterizing the scale of such a solution.
It’s clear that despite one of the popular objections to FT — single vCPU support — the feature fills a significant gap and critically enhances overall reliability of a distributed system, thereby contributing to the uptime of hundreds of compute nodes. And for those wishing for FT VMs with multiple CPUs, the SMP FT technology was previewed and demonstrated in a session at VMworld 2012.
Although it’s not widely known, fault tolerance is expected to be available in Windows Hyper-V someday. Don’t just take my word for it, check out this announcement to learn more about the “need” for fault tolerant Windows servers.
Evidently, the Hyper-V team has been “jealous” of this amazing zero-downtime vSphere advantage for years, so it’s no wonder that they pre-announced the capability for their own product well ahead of availability. You’ll know that they are getting close to finally providing FT capabilities when they cease to criticize VMware FT.
As with private cloud, just because Microsoft talks about something, doesn’t mean it exists.
…last I checked the single CPU wasn’t the only issue with FT. In fact, you had to write an entire article on all the requirements.
Let’s see – a few of the good ones:
* FT requires that the hosts for the Primary and Secondary VMs use the same CPU model, family, and stepping.
* You cannot back up an FT-enabled virtual machine using VCB, vStorage API for Data Protection, VMware Data Recovery or similar backup products that require the use of a virtual machine snapshot, as performed by ESX/ESXi.
* Storage VMotion is not supported for VMware FT VMs.
* No USB
* No IPv6
* No Snapshots
* No Microsoft Clustered VM’s
* Can’t be a template or linked clone
* VMDirectPath not available with FT
* EPT is automatically disabled
* no more than four VMware FT enabled virtual machine primaries or secondaries on any single ESX/ESXi host
* requires a dedicated Gigabit Ethernet network between the physical servers
* reduce the number of file system operations or ensure that the fault tolerant virtual machine is on a VMFS volume that does not have an abundance of other virtual machines that are regularly being powered on, powered off, or migrated using VMotion.
* When Fault Tolerance is turned on, vCenter Server unsets the virtual machine’s memory limit and sets the memory reservation to the memory size of the virtual machine. While Fault Tolerance remains turned on, you cannot change the memory reservation, size, limit, or shares.
* Disabling the virtual machine restart priority setting for a fault tolerant virtual machine causes the Turn Off Fault Tolerance operation to fail. In addition, fault tolerant virtual machines with the virtual machine restart priority setting disabled cannot be deleted.
Nice. Great feature. I like the way you call out the fact that you can only support VM’s with one CPU and leave all of this out.
That said, I’m a VMware certified consultant with dozens of implementations – I don’t have a single customer that after looking through this actually implemented it. Sounds great during a keynote…not so great in the real world.
the linked article is old, do these limitations still hold true for 5.1?
…the silence is deafining…
IMHO, this is a feature for the sake of a feature…the lack of development on 5.1 indicates that. It’s something VMware does that Microsoft doesn’t but generally speaking, it’s pretty worthless.
Think about it – there’s no application fault tolerance at all. Something goes awry on VM1…that’s immediately replicated to the fault tolerant VM2. There may be some big data uses as Eric points out – but what does that represent…5%? Probably less of VMware’s customers…
I dunno – like I said…I’ve put it through a few POC’s with 5.0 and the mess that we had to make of the servers with FT VM’s was enough to convince the customer that it was a pretty worthless feature. Perhaps they’ll get around to it after they get caught up to what Microsoft is doing. Definitely a first to see VMware ‘copying’ things like live migration with no shared storage, etc… Times they are a changin’…