This is part one of what will be a three part blog post.
- In part one we are covering the basics, discuss topics that are somewhat more general. We will also discuss the planning phase. These enable us to lay out the groundworks for the follow-up posts.
- In part two we will discuss the initial tasks to be able to set up the service.
- Finally, in part three we will create the actual DR configuration, talk about and show how the test validation, failover and failback works.
Customers are progressing in their cloud journey and an easy ramp-up is doing Disaster Recovery (DR) in the cloud. The beauty of doing DR in the cloud is that you don't need dedicated hardware to do so, you can either choose for a cold datacenter (on-demand) or hot datacenter (Pilot light) option.
Welcome to the modern way of DR.
A lot of companies that have been thinking about DR, even more have already thought about it and have actual written down DR procedures. Then comes the second and even more important part, plan the DR tests. After all, why do you have all those documented DR strategies for, if you are not absolutely sure they are working end to end. Testing should also not happen once but should be executed on a regular basis. As we are living in a software defined world, it becomes easier to shorten the time between tests. Remember the days when the first action of the DR plan was to drive to the bank vault and fetch the tapes, to drive to the secondary site and then the actual restore could start.
We see that a lot of companies have been thinking about and are protecting themselves against datacenter disasters but what about other types of disasters, like attacks, ransomware, human error, natural disasters and so on. With the high flight of ransomware attacks the last couple of years, a lot of companies are developing a ransomware strategy. If you haven't already, you should too.
So let's summarize the use cases:
- DR site in the cloud but also replace existing DR site(s)
- Crypto-attacks / Ransomware
- Human error
What is VMware Cloud Disaster Recovery (VCDR) and how can you benefit?
- Let's take a little step back in history. VMware has a Disaster Recovery as a Service (DRaaS) for quite some time. This was based on Site Recovery Manager (SRM) and VMware Cloud on AWS (VMConAWS) but with the acquisition of Datrium, this is being adapted to the Datrium way. If you have ever worked with SRM, you will feel quite at home in the DRaaS console. The concepts are very similar.
- Actually VCDR is a two part service, one being VMConAWS and the other part, VCDR, which is an DR layer on top.
- VCDR is leveraging proven technologies.
- It´s cloud native design helps to have your data replicated to the cloud and use it when disaster strikes. The seamless integration into VMConAWS makes DR in the cloud easy.
- The protected site can be an on-premises datacenter but it can also be a VMConAWS SDDC you already own and consume.
- VCDR leverages VMConAWS as the target site, this brings all benefits of VMConAWS. If during operation your data expands beyond the capacity threshold of vSAN, VMware’s Elastic DRS automatically adds a host if the Recovery SDDC’s free space drops below 25%. Hosts are never removed automatically, but can be removed in the VMware Cloud DR UI all the way down to three hosts. Keep in mind also that you cannot remove hosts from a 3-node Cluster Recovery SDDC.
- Beware that there are some repercussions, when not choosing the correct options for expanding the SDDC. See the topic in the documentation about adding and removing Hosts
What about costs?
- There is a possibility to optimize costs. You can choose to use the On-demand VMConAWS, which then only deploys the SDDC just-in-time, this way you don't pay for your VMConAWS SDDC until you need it, from that moment you pay for how long you need it. In this case do not forget to decommission it afterwards.
- In case of pilot light, the VMConAWS datacenter is sitting there until a disaster happens, which will shorten the RPO. What you could do here is, eg. use the VMConAWS SDDC as a development site or test site during normal operations and then shut down those workloads once needed. The idea behind the pilot light is that this is an infrastructure waiting for those applications that have very short RTO's.
What components are part of the VMware Cloud DR?
- Scale-out Cloud File System:
A cloud component that enables the efficient storage of backups of protected VMs in cloud storage and allows VMs to be recovered quickly, without requiring data rehydration.
A cloud component that presents a user interface (UI) to automate the disaster ecovery process.
- DRaaS Connector:
A virtual appliance installed in the VMware vSphere environment to protect VMs using snapshot replication from protection groups.
- Protection Groups:
A configuration component that allows you to create regularly scheduled snapshots of VMs which are replicated to the cloud file system.
- Disaster Recovery (DR) Plans:
An orchestration component that defines the steps required to recover VMs from snapshots from the cloud file system to a recovery SDDC. VMware Cloud DR cloud components (scale-out cloud file system and orchestrator) are deployed and managed by VMware in an AWS account dedicated to each tenant.
How does it work?
- Through the DRaaS connector, data will be replicated to the Scale-Out Cloud File System (SCFS) or the other way around. It will also act as the proxy between the SaaS Orchestrator to execute the necessary actions on the on-premises vCenter, like power on VMs.
- Data in the SCFS will be sitting there until you need it, data is stored in the SCFS with forever-incremental snapshots in an encrypted format. You'll find that the data is stored here in a native vSphere VM format.
- The VCDR SaaS orchestration component is the magic component that glues the solution together and which will orchestrate failover and failback.
- An On-Demand or Pilot Light VMConAWS datacenter is waiting for a disaster to happen and for you to consume it as a target.
- The SCFS can be live mounted into the VMConAWS datacenter as an NFS datastore to be able to run the replicated VMs
- High-frequency snapshots, can be used to keep the RPO as low as 30 minutes. We will talk about RPO's in more detail later on.
- It is possible to use a hub and spoke model. In this setup you are deploying DRaaS connectors on multiple sites and direct them to one SCFS, which then can be used as base for a recovery SDDC.
- Bandwidth can be throttled.
What are your options?
- When the VCDR service is enabled you have the option to use an On-Demand (cold) DR datacenter or a hot DC (Pilot light). Both give you the option to back up your data to the cloud, with the On-Demand or just-in-time DR datacenter you would need to account in about 120mins for the datacenter to power on.
Our thoughts on VCDR
- Simplicity is key when you need to keep cool in a stressful situation and complexity is your enemy in those situations.
- Currently, as we are seeing other hyper scalers being added to the VMware Cloud Console, I believe you can expect more hyperscalers will be supported for VCDR in the future.
- There is a trade-off between On-demand and Pilot light and RTO versus cost.
Please stay tuned for further information about VCDR an VMConAWS.
Contact us via email or linkedIN if you have further questions or discussions to follow up.