Lucky or Good?

Disaster Recovery Planning = Good.

July, 2011

Lucky We've all heard the phrase, "I'd rather be lucky than good". Let's look at a couple of real-life examples and see how it holds up.

My phone rang at about 4 AM. My collegue was at the data center, and the news wasn't good. There had been a power failure, and all the CRAC units (Computer Room Air Conditioner) were down. It was over 130 degrees in the data center. I pulled out of my driveway in less than five minutes. As my brain began to warm up, the questions started to come quickly. Did he say ALL the units were down? How could that be (typical data centers have dual, distinct power paths as well as battery backup)? How did it get so hot before we were alerted? I got right back on the phone.

All the CRAC units were on "B" power (which failed) due to maintenance on the "A" side. The temperature monitoring and alerting system did not work as designed, and we were only notified after systems began to fail. When he arrived, the on-call engineer began flipping power switches on racks and other equipment (forcing very dangerous, hard system shutdowns) just to get the temperature down enough that he could stand to be in the room for more than a few minutes. He had fired up a portable air conditioner unit, but there was no way it was going to cool down the entire data center. The electricians were on their way. When I walked in, I was briefed by the Director of Operations. Essentially, he was circling the airport until the electricians arrived. There was nothing for him to do until we got down to at least 80 degrees. This was where I disagreed.

Where is the DR plan? What was the power-up procedure? Which systems would you bring up first and why? Second? Third? What if one didn't recover? Did you have a mitigation plan for every system? What was the communication plan? Have we alerted our key vendors to stage replacement parts for critical systems? Have we alerted our business partners? What was the notification process? The answers were not at all comforting. Many were down right disturbing, like when he said he didn't need a plan because "he knew what he was doing".

When it was all over some 18 hours later, we had cost the business an entire day's worth of revenue. Not so lucky. The lucky part was that it didn't last into a second day. Every single system recovered, although we had to take subsequent scheduled outages to replace disk and other equipment that we feared might have suffered from the heat. There was no formal post-mortem analysis conducted, and no processes or procedures implemented or modified as preventative measures resulting from lessons learned. Everyone seemed to accept that it was just an unfortunate accident and we got lucky. This company was in grave danger.

Now let's look at an example of "good".

Every quarter, half of the IT personnel were killed in a horrible, catastrophic disaster. Sometimes it was a hurricane, other times it was flood, and once it was an evil, terrorist attack. We even had a fire in the building one time. If you were dead, you weren't notified until about three o'clock in the morning. The remaining operations managers were paged by the HelpDesk after the VP of IT officially declared a disaster. A Director-level leader (a different one each time) was then appointed to coordinate the recovery operation. Managers mobilized their teams and several pre-assigned conference bridges were activated to facilitate seamless communication. User communities were alerted (the ones left alive, of course) and put on stand-by. According to documentation, redundant systems were brought up in specific, priority order. Database replication was halted, and datasets verified. Backup application servers were pointed to the replicate databases and they were certified by operations personnel. DNS records were updated and the user communities were then asked to certify that the backup systems were up and running. Of course this was all a drill and conducted with zero impact to production systems and the business.

Though the SLA was four hours or less for critical systems, we were often ready in less than two. There were plans for more than 400 applications, all classified by business criticality and accordingly prioritized. Some needed to be up in 4 hours, others 24, and still others were of such low importance that we could recover them in 5 days. It was all pretty routine. We had done it countless times before. This was a test, and we did it every quarter.

These guys were good. They were deliberate. Every quarter, a different scenario was played out, preparing us for nearly anything. There was simply no way an organization this prepared would be lucky enough to cost the business a day's worth of revenue. Some would say we had been lucky to not have a power failure like in the previous example, but I'd rather say that our data center guys were good enough that they would never move all the CRAC units to the same power, and that our thermal alert systems were tested every quarter as well. I think you'd agree that we'd rather be good than lucky.

Plan

So what to do? If you haven't already classified your data and systems, you'll need to do that. If you're subject to any sort of compliance, this is usually a first step, but an essential one for any size or type of organization. Once that's done, you'll need to conduct a detailed business risk and impact analysis. The purpose of the business impact analysis is to study the risks various disaster scenarios present to sustained business operations. During this step, the critical business functions of the enterprise are enumerated. Recovery time and recovery point objectives are defined by the owners of the business processes to be protected and recovered according to the requirements for restoring each one. Each business process is then mapped to its underlying information and communication technologies and data assets. Scenarios are described wherein one or more systems are made unavailable so that the ability of the business to function without them is clearly understood. A list of possible disaster scenarios, both natural and man-made, is then listed along with their associated probability of occurrence and effect on the availability of various systems. Possible disaster scenarios can include fire, inclement weather, pandemic, computer hacking, civil unrest, war, earthquake, or any natural or man-made incident that could occur. From this information, a probability and impact matrix can be assembled as an aid to prioritize the development of disaster recovery solutions as well as implementation of disaster recovery plans. The probability and impact matrix is also effective as a tool for securing commitment for the disaster recovery program from leadership and business process owners.

Next, you need to sell it. No disaster recovery program can be successful without the active participation of the owners of the business processes that are to be protected and recovered. Solutions will consist of an array of technical measures, including but not limited to off-site data back-ups, redundant and high-availability systems, uninterruptible power, and stand-by facilities. Solution planning, implementation, regular assurance testing, and maintenance all require the business process owners and their organizations close involvement to be effective. It is especially important that these business process owners be made a part of the disaster recovery program from its inception so that they can help identify risks, establish recovery objectives, design workable recovery solutions, and dedicate sufficient resources for solution development, implementation, testing, and maintenance. Involvement early in the process will not only contribute to the effectiveness of the overall program, but will foster "buy-in" as the business process owners will have helped develop their recovery solutions rather than have it forced upon them. Meetings with each business process owner will be held during each phase of development of the disaster recovery program during which the very goals and strategies of the disaster recovery program will be formulated. A senior leader for each business process will be identified and designated as a champion with responsibility for successful implementation of the recovery strategy. The champion will participate in all phases of development of the disaster recovery program, allocating resources and serving as a communication liaison to their business unit.

Armed with detailed project plans for the construction and implementation of each disaster recovery solution, each one is implemented according to the priorities established during the business risk and impact analysis. Resources from both the information technology staff as well as the business are to be involved in solution implementation. The implementation strategies can be iterative so as to provide incremental protection and recovery capabilities that can develop until recovery objectives can be fully met. Incremental testing of communication and decision making tactics will be included in each implementation so that implementation plans can be adjusted as opportunities for improvement are discovered.

Once a given solution has been implemented, each solution will transition to the testing and maintenance phase. Regular testing of each solution, the frequency of which according to its priority and complexity, will be scheduled to prove their effectiveness and readiness in the event of an actual disaster. Robust testing of all disaster recovery solutions including notification, communication, decision-making, activation, recovery, as well transition to normal operations will be performed. Tests will be conducted at regular intervals, as well as at random to properly gauge the quality of the disaster recovery solutions. Tests will be based on disaster scenarios with resource availability staged to the particular scenario. Personnel of both the information technology staff and business units will be rotated to eliminate the dependency on individuals to execute recovery as well as assess the quality of disaster recovery procedural documentation. The tests must be able to prove that systems can be recovered and support their corresponding business processes without compromising the actual business process during the test. Subsequent to each test, a rigorous assessment will be conducted to grade its effectiveness. Feedback resulting from these assessments will serve as input for improving the recovery solutions and procedures. Even failed tests are valuable as they certainly provide the highest quality feedback.

Good

Because each technology infrastructure is different, the specifics of a disaster recovery program development plan will be unique to each environment. This development plan will no doubt have to be adjusted to conform to the technological architecture, business goals, social culture, and other circumstances and conditions present at a given organization. With a robust and comprehensive disaster recovery program in place, your customers, employees, stakeholders, and management can rest easy that the company has the capability to sustain business operations when faced with any unfortunate event.