BLOG: IBM Spectrum Virtualize Continuous Availability and Three-Site Replication

April 26th, 2021 BLOG: IBM Spectrum Virtualize Continuous Availability and Three-Site Replication
Ian Wright
Systems Engineer

 

While it is not a requirement for every business, three-site replication has become a critical need for some. Prior to 2001, it was not uncommon for businesses to maintain a single disaster recovery location, sometimes near the production site to allow for Synchronous Replication. In New York City, for example, it was not uncommon for businesses to replicate from a data center in Manhattan to northern New Jersey. In Europe, some major data centers had their disaster recovery centers across the street.

After the attacks of September 11, 2001, there were many issues found with Disaster Recovery planning. They were summarized in the “Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System: Business continuity sound practices developed by the FRB, SEC, and OCC to ensure the continued functioning of critical financial services” published in April 2003. While I am not going to review the entire paper, one segment that became rather important then, and going forward, was “Maintain sufficient geographically dispersed resources to meet recovery and resumption objectives.”

The risk recognition of two data centers in proximity created a problem. In the financial services sector, a single transaction could be worth a significant amount of money, so synchronous replication was a crucial level of data protection. However, it was now important and, in some countries required, to have a disaster recovery center in a different geographic area, which meant that asynchronous replication was critical, despite the higher Recovery Point Objective (RPO). This led to the development of three-site replication, allowing businesses to maintain Recovery Point Objectives (RPO) of zero while also having out-of-region data centers to protect from regional disasters.

In the world of IBM Spectrum Virtualize, a key member of the IBM Storage family, there are a couple of different ways that three-site replication can be approached. This blog post will review some of the considerations for implementation, as well as the highlights of each solution.

Enhanced Stretched System

IBM Spectrum Virtualize was originally the software part of the IBM SAN Volume Controller or SVC, which provides a virtualization or hypervisor layer between the compute hosts and the managed Storage (IBM System Storage, or many other brands including EMC, Hitachi, Pure, Infinidat, and the list goes on). There were some early adopters who wanted to improve their availability by moving half of their nodes to a separate rack. It did not address replication, but because SVC was architected into IO Groups of active-active nodes, it would protect the availability of the data if one rack were to have network or power issues.

FIGURE 1: EXAMPLE OF A STANDARD, SINGLE SITE, IMPLEMENTATION OF SPECTRUM VIRTUALIZE

In 2009, IBM released SVC version 5.1, which includes the ability to mirror a volume. The idea was that the volume (or vdisk) was an abstraction that existed only at the cluster layer. The data was simply being presented from managed disk groups (or storage pools) that were virtualized. So, with volume mirroring, that single volume was presented, but every time the data was written, it was written to two storage pools instead of one.

At that point, it was possible to combine volume mirroring with stretching the nodes of a cluster, and this was eventually turned into the Enhanced Stretched Cluster or Enhanced Stretched System. The concept was to place half of your nodes in one data center and half in another data center located a short distance away but maintaining connectivity between all of them. Volume mirroring would maintain a copy of the data for the volume in each location, and because the volume exists at the cluster level, the volume would always be available. If one site went down, there was no process to make the volume available in the alternate data center because it was active-active by its nature.

This also opened another possibility. The volume in each site was a single virtualized volume. It was specifically not using the Metro Mirror function of the SAN Volume Controller, which could only have a single target and could not be part of a three-site replication environment. Because of that, it is possible to create a Global Mirror relationship from that volume to a third location.

This means that for any outages locally, it is possible to maintain continuous availability through the stretched cluster. Regional failures would need to failover to the out-of-region location (be it an IBM Spectrum Virtualize for Public Cloud hybrid-cloud implementation or another data center running IBM Spectrum Virtualize on SVC or IBM FlashSystems).

FIGURE 2: ENHANCED STRETCHED SYSTEM. THE “LOCAL” IO GROUP IS SPREAD ACROSS THE TWO SITES BUT STILL ONLY PRESENTING A SINGLE VOLUME. THAT VOLUME CAN BE REPLICATED FROM EITHER SITE TO AN OUT-OF-REGION LOCATION WITH A SEPARATE CLUSTER

Stretched Systems need to be understood for what they are — a single data center environment that has been stretched out to multiple locations. The volumes, because they are copied through the Volume Mirroring function, are not part of consistency groups (sets of volumes that maintain copies, as a unit, to the auxiliary site)

It is also important to understand that with a single IO Group stretched between two locations, a loss of half the nodes during a site failure has an impact on the remaining nodes. Specifically, SVC nodes can use their cache to improve the performance of the underlying storage. But if the nodes cannot mirror their cache to a partner (thus protecting it), they will not cache the data and will go into “write-thru” mode. This is not usually a major impact, and the cache of the virtualized disk systems will be used as normal, but it does need to be understood.

One final point is that Volume Mirroring is a built-in function of the base license of Spectrum Virtualize. It does not require a Copy Services license.

HyperSwap with Spectrum Virtualize Three-Site Orchestrator

The biggest issue that can be noticed with Enhanced Stretched Systems is that it only applies to SAN Volume Controller, because SVC is organized into pairs of independent nodes. With IBM FlashSystems (and earlier Storwize systems), the node “cannisters” are capable of the same things as SVCs and run IBM Spectrum Virtualize, but they run inside of a Control Enclosure rather than independently. So, the concept of splitting an IO Group to run in different locations does not apply to them.

HyperSwap was meant to resolve that issue. Like an Enhanced Stretched System, it is built on an active-active technology (Metro Mirror Active-Active – which does use the Spectrum Virtualize Copy Services License). In this environment, the cluster has IO groups assigned to each site rather than stretching the IO Groups between sites. It also allows volumes to be put into Consistency Groups. However, until IBM Spectrum Virtualize 8.4.0, it was not possible to replicate to a third location.

FIGURE 3: HYPERSWAP CLUSTER WITH THREE-SITE REPLICATION. IN THIS CASE, THE IO GROUPS ARE INTACT IN EACH SITE, BUT THE VOLUMES ARE STILL PRESENTED AS ACTIVE/ACTIVE IN EACH LOCATION. SPECTRUM VIRTUALIZE THREE-SITE ORCHESTRATOR SETS UP AND MANAGES THE THIRD SITE

With the release of 8.4.0 (still a recent version) also came the IBM Spectrum Virtualize Three-Site Orchestrator. This is a Linux-based tool that helps to configure and manage a three-site environment. Its original launch supported Metro Mirror and Global Mirror. With versions 2.0, HyperSwap was added, and with 3.0, a GUI was added.
The Orchestrator will allow the HyperSwap Cluster (or, if desired, the Metro Mirror Primary site and Auxiliary/Secondary site clusters), to replicate to a third site. In the event of a failure of one site, it will allow the remaining two sites to replicate without having to completely rebuild the replication environment.

What if I don’t have a third site?

The great news is that you do not need to build this entirely on “IBM Systems” in terms of hardware to get this functionality. The IBM corporation has worked hard to develop IBM Spectrum Virtualize for the Public Cloud. It is the same software that runs in SVC, IBM Storwize, and IBM FlashSystems. This means that if you feel like you need a third location, but you don’t have the budget for a third data center, you can build out your third copy using IBM Cloud or Amazon Web Services to receive replication from your Enhanced Stretched System or HyperSwap cluster. This creates a hybrid-cloud solution to provide out of region DR and your quorum witness.

Split Brain Syndrome

One requirement that is common to both HyperSwap and Enhanced Stretched Systems is the need to avoid a Split-Brain Syndrome. Split Brain is what happens if there is a network outage that causes a cluster (any cluster, not just Spectrum Virtualize) to lose visibility into its partners in the cluster. At that point, the only assumption that can be made by each part of the cluster is that the rest of the cluster is gone and that they should handle all the work as normal. This could lead to serious data consistency issues if, in fact, the rest of the cluster is not gone.

FIGURE 4: SPLIT BRAIN SYNDROME – BY USING QUORUM WITNESSES AS “TIE BREAKERS” IN THE EVENT OF A NETWORK OUTAGE, STRETCHED SYSTEMS AND HYPERSWAP AVOID THIS PROBLEM

To avoid this situation, the concept of a quorum witness is used. Spectrum Virtualize already uses quorum disks on virtualized storage systems. The quorum took up extents in the managed disks and was used for local network failures.

When Stretched Systems and Enhanced Stretched Systems were implemented, the quorum witness was originally a fiber channel attached disk system in a third location. Eventually, Spectrum Virtualize incorporated the idea of using software and an IP connection to present the quorum. But the job is the same.

In the event of a network failure that would result in a split brain, the nodes will all try to talk to the quorum. The quorum will then act as a tie breaker to determine which part of the cluster continues to serve, which will avoid data consistency problems.

Because of this unique role, we do not want the quorum to be on-premises with either part of the high available storage. The quorum should be in a third site and use a separate network from the one used to communicate internally to the cluster. If the goal is to run in a three-site environment, it is completely acceptable for the out-of-region disaster recovery site to also host the quorum witness or even to have the quorum witness hosted in a Public Cloud provider.

What this means is that, in addition to the replication traffic going from your Stretched System or HyperSwap system to the remote site, it would be a good idea to have the IP connection for your quorum traverse those links.

Which is better?

As with virtually any question in IT, this is best answered with “it depends.” Both options provide resiliency in the face of something that would normally cause a downtime while a recovery was performed. Prior to IBM Spectrum Virtualize 8.4.0, there was no option to do a three-site mirror with HyperSwap, so users who have configured their environment to use 8.3.x and just are not ready to go to 8.4 will want to use an Enhanced Stretched System if they need a third site. You can, if you want, use traditional copy services at 8.3.x (Metro Mirror and Global Mirror) to do three-site replication, but it will not provide the high availability that is offered from Stretched Systems or HyperSwap.

However, if you are using or are ready to move to 8.4.0, then there is no reason to use an Enhanced Stretched System when HyperSwap is available. Being able to do a three-site mirror with active-active volumes while also maintaining the integrity of your IO Groups is a definite improvement over what the Enhanced Stretched Cluster was designed to perform.

The Total Solution

It is worth remembering that this is addressing continuous availability at the storage level. For a complete solution more needs to be considered. As one example, automation is key to maintaining application availability. This can be done in a variety of ways, such as using VMWare Site Recovery Manager (with the IBM Spectrum Virtualize Family Storage Replication Adapter to allow integration) so that VMWare is able to coordinate the failover of the applications in the event of a site outage.

Another possibility is clustering hosts or applications between sites. You could, as another example, run a database such as Microsoft SQL or Oracle, with clustered instances in both sites (though you would need to manage your licenses appropriately). The volumes will always be available on both sides of the synchronous replication in either configuration.

It is also highly recommended that FlashCopy be included in these solutions (it is, in fact, a part of HyperSwap to maintain “change volumes”). But even beyond that regular backups and point in time copies help to protect against data corruption, accidental deletions, or even ransomware. The IBM Spectrum Virtualize software on all platforms is even capable of initiating FlashCopy to create a point in time copy in a bucket using Amazon’s AWS Simple Storage Service (S3) API (not necessarily on AWS itself – IBM products and other vendors, such as IBM Cloud Object Storage, also present S3) to keep it isolated from the rest of the environment.

You will also need to consider the angle of how you build your systems. In a HyperSwap or Enhanced Stretched System configuration, the volumes are active-active and being synchronously mirrored. Because of this you will want to have exact parity of performance. To the extent that it is possible, it is best to stick to Flash storage (SSDs, Storage Class Memory, IBM FlashCore Modules) to provide the highest performance. Easy Tier, which is a great choice in many environments, may not be the right one to use in this environment given that the primary site and the secondary site (both active) may have completely different read activity profiles in the heatmaps for their respective storage pools which could lead to data being tiering differently. However, in any case, the volumes being mirrored need to have the same storage capacity and, ideally, will be on the same storage hardware type, at least for the synchronous link. It is *possible* to replicate using HyperSwap between different tiers of Spectrum Virtualize systems but I would, as an extreme example, avoid running from a FlashSystem 9200 to a FlashSystem 5200 or to an old Storwize V7000 as the performance imbalance could seriously impact latency to the application.

I would also highly recommend taking advantage of IBM Spectrum Control and IBM Storage Insights. This will allow you to better manage what can be a complex environment and get ahead of any potential problems. It also allows IBM Support to quickly analyze any data if there is a problem and will help you get up and running much faster.

One final thing that will need to be considered is the amount of bandwidth and latency. Regardless of how the data is being replicated, you will need to make sure that the available network, be-it dark fiber or a WAN, can handle the write workload that will come from the copy service being used. And along with that, because Stretched Systems and HyperSwap both use synchronous replication for their respective copy services, you will want to make sure that the latency on the link is kept low enough that copying each write will not create an impact on your application.

Regardless of what data protection or availability strategy you feel matches your business needs, Mainline is available to help architect and implement the solution. For more information on continuous availability, storage infrastructure, or storage solutions in general, reach out to your Mainline Account Executive directly or contact us. We are an IBM Platinum business partner, the highest level in the IBM Business Partner program, with great experience and knowledge in these areas and we look forward to helping you.

Related Blogs and Videos

BLOG: IBM Spectrum Virtualize Software 8.3.1 Updates

IBM Spectrum Virtualize 8.4 Technical Update

BLOG: Disaster Recovery as a Service (DRaaS) with Spectrum Virtualize and Amazon Web Services

BLOG: Why IBM Spectrum Virtualize for Public Cloud is an Integral Part of Storage

Planning for IBM Spectrum Virtualize 3-Site Orchestrator (note: This documentation is for the 9200 but would apply to other HyperSwap capable FlashSystems)

IBM Spectrum Virtualize for Public Cloud – Overview | IBM

Mainline