Re: Disaster Recovery Planning and Testing

Post #3,458 by cforde 8/1/01 2:06:32 PM Reply	Re: Disaster Recovery Planning and Testing DRP is s big miserable job. It must have visible support from some senior VP and there must be measurable targets with fixed dates. Eg. the test will be on this date and we expect these results (whatever they are), failing that there will be sufficient data collected to enable those results to be achieved next time. The point of a recovery is to restore normal operations as quickly as possible. The best way to achieve that is for everyone to be responsible for the same stuff that they are responsible for in normal operations. This means the operations people are responsible for the platform (OS and all software and data files) and have the applications people are responsible for the applications (correct operation and data integrity). This also helps with the scheduling of tasks during tests. The applications people come in after the operations people have recovered the system to a usable state to make sure that the systems are in fact usable. With that basis it's pretty clear who is responsible for what. It is then up to each group to develop their own policies and procedures to make their recoveries as quick and painless as possible. To be effective these policies and procedures must affect the normal day-to-day operations. This is where you will run into a great deal of resistance. This is also why you must have regularly scheduled tests with fixed dates. No fudging allowed. In a real recovery situation you don't get to move the date and you only take into the test what has been previously planned for. To answer your questions: Who should develop the plan? There should be one master plan that is very high level providing the overall goals and framework. There should be separate plans for each system that covers how backups are done and how recovery is to be performed (the order of machines and procedures for each machine). The operations and applications people will need separate plans developed by and for themselves. What role did the hot-site play in planning and testing. In my experience the recovery site provided hardware and people to mount tapes. The rest was up to us. Who should pay for the testing (App dev, Computer Opns, Client, etc.) This should not even be a question. DRP (Disaster Recovery Planning) has a huge cost in both time and money. If the company wants to do it, the company should be paying for it. Meaning it should be part of everyone's budget. How did you do the testing? (By application, site, platform?) We shipped the tapes and some very experienced operations people to the recovery site. There the systems were restored and connectivity provided to another site, not the regular business offices, but someplace local to that, where the applications people verified the recovery and the integrity of their applications. Frequency of the testing? What was the determination? Initially 3-4 times a year until we were confident in our ability to perform a succefully recovery. Then twice a year. In normal operations software and hardware changes occur daily. It is very easy to forget something. Even if something hasn't been forgotten, you don't really know if a new system can be recovered until you've done it. What was the level of involvement by the application developers during the test? Same as everyone else -- they had to plan for the recovery of their applications, the operation of those applications at a remote site and the integrity of their data. What was the determination of a successful test? Set objectives for each test and check whether they were met. In the first tests this was as simple as were the OSes restored and was connectivity achieved to the offsite location? With practice this became were the systems recovered within the target timeframe sufficiently to allow production usage. (production usage is the goal) Add anything else that was related to the writing or testing. Recovery is something you must design your systems for. It is something that if done right will permeate your entire computing services operations. It all starts with the backup procedures. Have fun, Carl Forde
Post #3,532 by tseliot 8/1/01 7:56:51 PM Reply	Good summary. DRP is s big miserable job. It must have visible support from some senior VP and there must be measurable targets with fixed dates. Eg. the test will be on this date and we expect these results (whatever they are), failing that there will be sufficient data collected to enable those results to be achieved next time. And in a large company (or un-namable government org ahem) you need somebody, usually your department's VP, to work with the MBA's under him and come up with some hard numbers on risk vs. cost. Even if the ratio of one to the other sounds obvious to a large public service org. This will not only give your project the strength to get going but will give the board or other higher-ups something to chew on when your first test fails horribly: you can show in dollars what the cost would be. ...The best way to achieve that is for everyone to be responsible for the same stuff that they are responsible for in normal operations....To be effective these policies and procedures must affect the normal day-to-day operations. This is where you will run into a great deal of resistance. Which is why you need higher-ups to provide the political clout; if you don't have them then run away screaming NOW. Line workers and even their immediate managers have their own goals for the departments which seem in direct opposition to your goals. Hence the need for a higher power. What role did the hot-site play in planning and testing. In my experience the recovery site provided hardware and people to mount tapes. The rest was up to us. Absolutely. Although there are service providers you can subcontract to for more than this. My question is, Joe, when you speak of "the hot site team", what do these people do day-to-day when they're not recovering from disaster? Or is it a disaster contractor? That's her, officer! That's the woman that programmed me for evil!
Post #3,599 by jbrabeck 8/2/01 9:37:25 AM Reply	Re: Good summary. Absolutely. Although there are service providers you can subcontract to for more than this. My question is, Joe, when you speak of "the hot site team", what do these people do day-to-day when they're not recovering from disaster? Or is it a disaster contractor? The hot-site is owned by the gov.org. Their day-to-day mission is to perform disaster tests for all the applications, for all the sites. There are multiple computer sites across the US. Their staff must review the disaster recovery plans that are submitted and then execute the test plan (on a scheduled basis). We are mandated to test the "critical" applications once a year. For the MF environment the hot-site has operators, schedulers and support staff. I don't know how the unix side is staffed. That's for next FY. Joe
Post #3,601 by jbrabeck 8/2/01 9:41:25 AM Reply	Execellent. Sent copy to my boss. She sends her thanks. Do you mind if we use some of your statements in our meetings? Joe
Post #3,616 by cforde 8/2/01 12:06:54 PM Reply	Re: Execellent. Sent copy to my boss. She sends her thank No problem. It's been over 5 years since my DRP involvement, but some of the memories are still fresh... :-> Have fun, Carl Forde

Welcome to IWETHEY!