How To Create A Bullet-proof Disaster Recovery Plan

Disaster recovery (DR) ensures the business continuation of IT under disruptive events. Every serious business has a DR plan but when the actual disaster occurs very often this plan turns out to be unworkable. Read this article to understand how to create a better DR plan and protect your organization from service interruptions.

Before all, you should know that creating and maintaining a DR environment is costly – it requires investment in technology and maintenance. When you consider having a DR environment you should realize what exact resources you need to feel comfortable in any disaster. If your organization cannot provide these resources you are better off a DR plan at all. Operating from a budget, handicapped DR environment may cause you more troubles when a disaster occurs compared to just waiting for the disaster to pass.

Attention to the details in the DR design

Where to start from - find a different geographical location

The first job to consider when creating your DR plan is choosing a different geographic location. This is important for isolating local outages and disasters. That's why always DR environments are placed in a different city or even country. In the new location choose different vendors and suppliers such as different ISP, different electricity company, etc, so that specific company wide outages are avoided.

Still, make sure to choose a location relatively close to your business because network delay may be an issue. It's also a good idea to be able to physically access the DR environment in case hardware manipulations have to be done.

Being able to operate your IT infrastructure from a different geographic location does matter but its importance is frequently overestimated. In fact, most legacy DR plans are all about this. However, with today's large number of varying threats such geo oriented DR plans prove outdated and unworkable. That's why the following plan items should be taken into consideration.

Diverse and simplify the technology

Try to diverse as much as reasonable the technology for your DR environment. If your main site is in JAVA, consider creating a simple PHP based, or even plain HTML, DR site. In the DR you should support only the most essential, core business processes which cannot be interrupted for even for a few minutes. Sacrificing extra, non-essential functionality in the DR environments is reasonable. It will help you under higher load and heavier traffic. It may also prevent security bugs and flaws from being exploited.

Furthermore, pay special attention to the front end servers. In the DR plans there should be always some proxying servers in front of every publicly accessible service. These proxying servers should allow you to audit, inspect and filter the incoming traffic. To learn how this can be done for the web service read the article protect and audit your web server with ModSecurity.

Also, use a different software supplier for your DR environment. For example, if you run Apache or Nginx in front, for the DR substitute it with Varnish as described in the article on how to improve performance and security with Varnish. This will protect you against vulnerabilities and bugs, including unknown ones which are most critical. With the case of Varnish it will also allow you to handle a larger number of visitors in the DR.

Delay updates from the production to the DR environment

When something goes wrong in the production environment you must be able to stop the propagation of any changes to the DR environment. Not doing so corrupts your DR environment and makes it as faulty as your production one. Such unwanted updates usually include synchronization of files and databases.

An obvious measure to address dangerous updates in DR environments is to avoid any manual updates simultaneously on production and DR environments. However, many updates are automatic such as periodic rsync synchronizations of files and database replication.

When configuring files synchronizations try to make them as rare as your business needs allow. Also, try to differentiate updates between incremental which add new content / update existing and such that delete content. If you use rsync there is the option --delete to distinguish the two kinds of updates. In some cases it's fine to make frequent, incremental updates so that newer production content is synced to the DR environment. However, set a different, less frequent, scheduled job (cron) for rsync jobs that are meant to delete (use the --delete option) the content on the destination server. This may save your day if content gets somehow deleted on the production environment and you switch to the DR one.

Furthermore, consider that some disasters are connected to databases and more precisely database queries. Often DR environments are working with slave database servers which are updated from master servers in production environments. Thus the database disaster from production is automatically and instantaneously transferred to the DR environment.

You can prevent unwanted database updates in two ways. First, use a database firewall / proxy as described in the mysql proxy article. With such a tool you can filter certain types of database queries including potentially dangerous ones coming from production to the DR environment.

Second, you can configure a delayed database replication. For MySQL use the command CHANGE MASTER TO MASTER_DELAY = n; where n is the number of seconds you wish the DR slave server to delay the replication. The value should be high enough for you to be able to stop the replication on the DR slave server when needed, for example 1200 seconds.

Being able to delay and control updates between production and DR environment is more than highly recommended. Ultimately, it allows you to rely on an automatic, unattended, fast fail-over from production to DR. This is very important because most DR plans include such a fail-over which de facto provides the high availability. In contrast, if the wrong updates reach the DR environment then you have to reside to restoring the failing components (files, databases, etc.) from backups or trying to reverse the changes. This is always connected to time-consuming, manual work which further results in downtime, data loss and compromise of the whole DR plan.

Ensure more resources

The occurrence of many disasters results in higher resource demands – more CPU, memory and network bandwidth is needed. The good thing is that these resources are not needed for a long time but only during the disaster. Thus you may hourly rent on-demand or elastic cloud resources for you DR environment. On one hand this will give you the flexibility to easily expand as much as needed. On the other hand, it will save you money when the DR environment is idle and you don't use the resources.

A special attention has to be paid to the network bandwidth during certain types of disasters – malicious DDOS attacks. Such attacks may take up to 10-20 Gbits and more of your bandwidth. They are usually just flooding your network without specifically targeting a service such as web or mail. To protect from such attacks and potential disasters you need a network filtering solution. Such a solution can be provided by your datacenter or by specialized third parties. Coping with such attacks on your own is usually not feasible nor cost effective.

Enhance security

Security level in the DR environment should be tougher than the production one. This includes taking the following actions:

  • Patch security flaws as soon as possible. Still, make sure that you don't apply the same security patches both on the production and the DR environment at the same time. Such patches sometimes fix a bug but may introduce another one. Thus you may end with two secure but faulty environments.
  • Restrict access and allow no one but the senior administrators to access the DR environment. Very often disasters occur because of compromised accounts. Thus less people with access to the DR servers, the better.
  • Use a different authentication mechanism than the one(s) used in production. Centralized authentication such as LDAP is very convenient but it introduces a single point of failure which should be avoided in the DR environment. That's why if you use LDAP in your production, use local accounts for the DR servers.
  • Harden every service, especially the publicly accessible ones, and the operating system as a whole. Read the article about Linux server hardening for detailed instructions and best practices.

In addition to tougher security in the DR environment you may also increase the content caching as explained in the link to the Varnish article above. This may help you serve better heavier web traffic.

Allow a safe fail back, i.e. return to production

Last but not least, when you design your DR you should ensure that you can safely return the operation to your main site. This is not always easy because of a few problems.

The first and most common problem with failing back is the database replication, especially when the DR database server is only a slave. In order to transfer the database changes from the DR to the main site you have to promote the DR database slave server to being a master. In this case the server in the main site becomes a slave to the DR database server. Risks in this scenario arise from the fact that the replicating servers might have been disconnected for a long time from each one during the disaster. In the worst case the main site continues to function partially during the disaster. This ultimately results in serious database inconsistency.

Inconsistency should be first logically prevented by design. You must ensure that if a site remains partially operational it will not cause database inconsistency. If you cannot ensure that you have to make sure such partially operational sites are shut down as soon as possible. For this purpose as soon as the main site begins experiencing problems all services, including the database, should be locked or shut down. Only then it's safe to transfer the business to the DR environment. Otherwise, you may end up with overlapping, conflicting database records. In such fatal cases the only solution frequently is purging the conflicting records on one of the database servers.

Furthermore, you should take full advantage of all the database replication features that may help you with the seamless promotion of slave servers to masters. For MySQL there are many advanced features and constant improvements in this regard which you can learn from the article improve replication reliability and performance with MySQL 5.6.

Switching between production and DR environments

Switching to the DR environment should be as fast as possible – from immediate seamless transfer to up to a few minutes. This implies that the switch is made automatic and unattended. For this purpose use a monitoring script that checks a specific functionality or service availability. Once this script detects a problem it should make the switch to the DR. There are two common ways to execute the actual switch.

The first and most common way is by changing DNS records. Once a disaster occurs the A or CNAME DNS record should be pointed to the DR environment. For efficient DNS switches two prerequisites have to be met:

  • DNS should be hosted separately and should not be affected by most common disasters. It's more and more common nowadays to outsource the DNS service to external providers who in turn ensure the availability of the DNS service.
  • The time to live (TTL) of the DNS records, which are to be changed, should be low enough – 5 minutes for example. This ensures that a switch to and back from the DR environment can happen fast without delays from DNS caching.

The second popular way for performing production to DR switches is through manipulating the network traffic. This can be a simple virtual network tunnel from the production to the DR site but this has one obvious disadvantage. Such a switch will fail if the network equipment in your primary site is unavailable.

A better and more preferred network switch can be done by using the border gateway protocol (BGP). It allows to transfer the network dynamically and independently from the main site to the DR site. When a disaster happens you start advertising new routes for your network autonomous systems (AS) to the DR environment. For this purpose you don't need access to the production environment because these advertisements can be done from a router inside the DR environment. The drawback of this method is that it requires more complex configuration and full control over the network, including owning an AS.

No matter what switching method you choose, manual switches to the DR environment should be also supported. They are helpful when the monitoring script fails to detect a problem even though the production environment suffers from a serious problem. Such switches should follow the exact switching logic implemented in the monitoring script.

It's a good practice to schedule regular switches between the DR and the production environment. Thus you can test the readiness of your DR environment and ensure flawless operation when an actual disaster occurs.

As a conclusion I should say that operating a reliable DR environment is a very serious and hard task. Unfortunately, most DR plans are not well designed and this is frequently proved by the outages of even the most substantial businesses all over the world.


blog comments powered by Disqus