Nutanix DR Multi Site Recovery

Nth Generation
May 19, 2020
5 min read

AOS supports multi-site disaster recovery text on a split yellow-blue background with white geometric symbols.

Read the recent article published by Sujaikumar Jayarao about Nutanix DR Multi Site Recovery:

"Alas, if Disaster were to come with a warning much ahead of time, we could have been better prepared.”

In a world where uncertainty is certain and IT disasters don't come with a warning, IT leaders cannot afford to take the risk of not being prepared.

We understand that the uptime needs of every application varies. For example, while mission-critical applications that deal with financial transactions, stock exchange trading, computerized hospital patient records, emergency call center, and life support services need an uptime of 24x7x365, applications related to engineering services, government services, and DevOps may not subscribe to such stricter uptime needs. In addition, IT operational challenges such as system upgrades, migration, handling corruption issues, etc., cannot be ignored.

Added to the complexity, today most enterprises are locked to a set of vendors and the hypervisor used. IT topology becomes even more complex and operational challenges go exponential when the multiple copies of application data need to be synchronized and maintained in different physical locations whether the sites are located in the same room or different buildings in the same campus or even across thousands of kilometers. The need for such a complex topology would arise due to compliance regulations, distributed business operations, collaboration etc., of course all abiding with the regional data privacy laws.

In short, the IT systems have to be resilient to handle faults and disasters, in order to ensure business continuity. A research report by the Ponemon Institute pegs datacenter outage costs at around $9,000 per minute.

Keeping these factors in mind, we have built High Availability (disks, network card, power supply fault management) and data protection into our AOS platform. Our Disaster Recovery solution extends the continuous availability to multiple clusters through recovery plans and run book planning. Disaster Recovery is measured in terms of the RPO (maximum amount of data a customer is willing to lose), RTO (time allowed to restore operations when IT failure occurs) and the cost. A DR topology is a combination of replication and the recovery orchestration. And we support different RPO/RTO times with DR topologies.

The Nutanix DR journey

Diagram comparing DR options: Async (RPO 1 Hour), NearSync (RPO 20 Seconds), MetroSync (RPO Zero). Each shows RTO as Minutes.

Let’s go through each of the DR topology supported.

1. Asynchronous DR (Async)

Asynchronous disaster recovery can be configured by backing up a group of entities (VMs and volume groups) locally to the Nutanix cluster and optionally configuring replication to one or more remote sites. Only schedules with RPO >= 60 minutes can be configured in this mode. Configuring Asynchronous DR provides more details on the implementation guidelines.

2. Near-Synchronous DR (Near-Sync):

Near-Sync is built on the Async snapshots. With Near-Sync we support Lightweight Snapshots (LWS, which are OpLog-based markers) running on SSDs. Since the time taken by LWS is a constant O(1), there is minimal impact to the User IO. This architecture makes LWS highly scalable and distributed. LWSs are replicated continuously to the remote site. An intermediate snapshot is created every hour and retained for 6 hours. One daily snapshot is created and retained for 5 days. The intermediate async snapshots act as checkpoints to help with RTO. In AOS 5.17 we support an RPO up to 20 seconds. Configuring Near Sync DR provides more details on the implementation guidelines.

Diagram showing "Local cluster" and "Remote cluster" with nodes S0, S1, S2 connected by arrows and labeled LWS. Green and blue visuals.

3. Metro DR (Metro/Sync)

With Metro or Sync DR, we can achieve a Zero RPO at the VM granularity level. Synchronous replication is supported between sites under 5ms latency. In order to achieve continuous availability of applications and zero data loss, a secondary copy of all data including VM data, VM metadata, and Protection Policies applied to VMs is maintained across two clusters. This ensures that there is no data loss in case of site failure. This allows the VM Live migration to be easily supported between sites.

Diagram of VMware HA Cluster with two VMs, synchronous replication between active and standby containers, and a <5ms RTT.

Note:

All the above DR topologies can be managed through Prism UI.
Unplanned failover from the primary site to the secondary site is supported with all the above topologies.
Planned failover from the primary site to the secondary site is only supported with Near-Sync and Sync DR topologies.

Nutanix Multi Site Replication

So far, we looked at how individual DR topology can help with RPO and RTO requirements. By adding metro and Near-Sync together, we now provide the gold standard for protecting business-critical workloads.

Highlights of the multi-site replication features

Diagram of production and recovery zones connected by lines showing RPO times. VMs are in the production zone. Text: "0 RPO", "20 sec RPO".

Provide a zero data loss environment for customers with the most stringent requirements across multiple sites
0 RPO for sites within 400km or less than 5ms latency
20 sec RPO for a recovery site with no distance limitation
30 min RPO for a fourth site with no distance limitation
Disaster recovery orchestration can be done by VMware SRM or scripts

Let’s now look at specific multi-site disaster scenarios and their recovery workflows with Nutanix DR

Note: In all our scenarios we have considered multi-site topology between 4 sites A, B, C and D with the following configuration.

Site A is the Primary Site and Site C is the DR site
Sites A and B are in the Production Availability Zone
Sites C and D are the Recovery Availability zone
Sync Replication (0 RPO) between Sites A 🡨🡪 B
Near-Sync Replication (20 sec RPO) between Sites A 🡨🡪 C
Async Replication (30 min RPO) between Sites A 🡨🡪 D
There are 4 different clusters in each site

SCENARIO 1: Production Site Failure (Single Site - Primary cluster)

Diagram showing Production and Recovery Availability Zones with VMs. Lightning bolt at A indicates disruption. Arrows mark data paths, RPO noted.

Recovery Procedure:

Metro Remote (Cluster B) has the most recent copy of the data
This data is sent to Site C (through an Out of Band snapshot)
Only a 20 second delta snapshot is transmitted.
Snapshot received at Site C is activated on Recovery Cluster C
Metro/Sync replication is established from Site C to Site D and application can resume
When Site A (Cluster A) is back online, 20 second RPO can be established back to Site A (Cluster C to Cluster A)

SCENARIO 2: Complete Region Failure (Two Sites – Production Availability Zone)

Diagram of a production and recovery availability zone with a red lightning bolt. Arrows show data flow, labeled with RPO times.

Recovery Procedure:

DR Site C (Cluster C) has the snapshot that is 20 seconds old
This latest Snapshot at Site C is activated on Recovery Cluster C
Metro/Sync replication is established from Site C to Site D and application can resume
When Site A (Cluster A) and Site B (Cluster B) are back online, 20 second RPO can be established back to Site A (Cluster C to Cluster A), and 30 minute RPO can be established back to Site A (Cluster D to Cluster A)

SCENARIO 3: Data Corruption Restore

Diagram shows Production and Recovery Availability Zones. Arrows indicate data flow between servers A, B, C, and D, labeled with RPO times. Green VMs and LWS marked.

Recovery Procedure:

Either restore to any available 20 second LWS snapshot or restore to one of the last hourly snapshots
Changes are then propagated to all the other Sites (Clusters).

Summary

Nutanix data protection and Disaster Recovery provides options to configure the applications based on criticality and business requirements

Comparison chart of DR options: Async with RPO 1 Hour, NearSync with RPO 20 Seconds, and Metro/Sync with RPO Zero. RTO is Minutes.

Now mission-critical applications can be protected with multiple copies stored in multiple sites and managed seamlessly through Prism.
All the above features are available in AOS 5.17 with VMware ESXi

________________________________________________________________________

Nutanix is a strategic partner of Nth Generation. To learn more about Nutanix, contact your Nth Representative at 800.548.1883, or email info@nth.com. ________________________________________________________________________

About Nutanix Nutanix is a global leader in cloud software and a pioneer in hyperconverged infrastructure solutions, making computing invisible anywhere. Organizations around the world use Nutanix software to leverage a single platform to manage any app at any location at any scale for their private, hybrid and multi-cloud environments. Learn more at www.nutanix.com or follow us on Twitter @nutanix.. For more information, visit Nutanix’s website.

Source: https://www.nutanix.com/blog/nutanix-dr-multi-site-recovery?utm_source=sprout&utm_medium=social&utm_campaign=nutanix-product&utm_content=aos-update-yellow Accessed 5/19/20

Nutanix DR Multi Site Recovery

Recent Posts

Comments

SOLUTIONS

SERVICES

RESOURCES

ABOUT

CONTACT