Disaster Recovery Planning and Testing

Post #3,434 by jbrabeck 8/1/01 12:21:56 PM Reply	Disaster Recovery Planning and Testing vent on I have been contracted to assist with the writing and testing of disaster recovery plans. I never realized how political an item this is. The application developers claim no responsiblity... it belongs to operations. Operations say that they don't know the applications .... Now I'm hearing that the disaster recovery hot site should take the responsibility of providing the disaster recovery plans..... \\vent off This site consists of 2 mainframe and 200 unix boxes (scheduled to grow to 800-1500). My group is responsible for the coordination of the disaster recovery planning and testing. If you have done disaster recovery testing... Who should develop the plan? What role did the hot-site play in planning and testing. Who should pay for the testing (App dev, Computer Opns, Client, etc.) How did you do the testing? (By application, site, platform?) Frequency of the testing? What was the determination? What was the level of involvement by the application developers during the test? What was the determination of a successful test? Add anything else that was related to the writing or testing. My boss is being asked these questions. I can give my experiences, but it's only being counted as my "opinion" (not by my boss, but above her). So, I need to gather information. Tell my your experiences. Please. Joe and now you know why the stamp price went up again!
Post #3,442 by addison 8/1/01 12:54:28 PM Reply	Re: Disaster Recovery Planning and Testing The application developers claim no responsiblity... it belongs to operations. Operations say that they don't know the applications Sounds like you need a meeting to discuss this. Not joking - have a meeting, and whatever comes out, is binding. That'll make sure everybody shows up. :) The problem is you need a champion from the top of the company, to give direction (and assign the needed people). Who should develop the plan? Isn't that you? By definition? I've always found $$$ to be the barrier in DR. Basically, you need the company to tell you what has to be recoverable, and in what time frame. Probably several scenarios - no point in taking orders if Manufacturing is under a volcano. :) What role did the hot-site play in planning and testing. Full roll-over for the DR test. Who should pay for the testing (App dev, Computer Opns, Client, etc.) Oh, shit. You're in it deep. The company should be - this should be a full line item from SOMEBODY up higher than those departments. How did you do the testing? (By application, site, platform?) Yes. Well, simulated. Bring up the DR system, and see if connectivity from various test places could be made, and if the systems came up correctly. Frequency of the testing? What was the determination? Not enough. usually once to test methodology, and again to double-check/test "fixes" that had been found. What was the level of involvement by the application developers during the test? Some were assigned to be available for the DR, and to take notes and see what happened, etc. What was the determination of a successful test? If the systems designated as recoverable were usable in the specified time frame. Addison
Post #3,479 by jbrabeck 8/1/01 4:02:12 PM Reply	Re: Disaster Recovery Planning and Testing I need to put it into perspective. I'm at a gov org. Sounds like you need a meeting to discuss this. Not joking - have a meeting, and whatever comes out, is binding. That'll make sure everybody shows up. :) The problem is you need a champion from the top of the company, to give direction (and assign the needed people). There is no "champion", nor a "top of the company". The building I work in houses multiple entities with the same org. And each entity has is own reporting chain. Next level up is vp of dev, opns, each client function. Up from that is where dev and opns merge. Who should develop the plan? Isn't that you? By definition? I've always found $$$ to be the barrier in DR. Basically, you need the company to tell you what has to be recoverable, and in what time frame. Probably several scenarios - no point in taking orders if Manufacturing is under a volcano. :) Should have phrased that a little better. Should have asked "When should the plan be developed?" For the existing systems, I am trying to put together plans. However, this org is bringing in new, mission critical, apps all the time, without planning for disaster. Mainframe opns is told to find time in the schedule, UNIX side is given new equipment and told to run. And some of the new UNIX apps are apps that are being moved down from the mainframe. My position is "before an app can be put into production, we should have the disaster plan written". As to the scope, this site performs the payroll, hr and accounting services for the entire org. If we were to go down, and these services interrupted, the entire US would be affected. What role did the hot-site play in planning and testing. Full roll-over for the DR test. I have managers trying to say that the hot-site should "learn" the applications and take an active role in the development of the test plans. (The hot-site is owned by the org.) My think we (the development staff) should create the plan, and that the hot-site should be able to recovery the system based upon our documentation. Who should pay for the testing (App dev, Computer Opns, Client, etc.) Oh, shit. You're in it deep. The company should be - this should be a full line item from SOMEBODY up higher than those departments. DR testing has been "overhead". From what I've been told, next FY the clients are being told to budget for it. I don't know the amount being requested, nor do I know if the client can "opt-out". Might be able to change the level of criticality. (How fast it has to be recovered.) I was orginally contracted just to prepare the plans for the applications that had been tested. Then to write the plans for the apps to be tested. Now I, and the group I work with, have been tasked with planning the disaster tests for next FY (required by regulation to be tested annually) and yet the person wer work for does not have functional authority (only administrative authority) over the developer and operation personnel we need to work with. Therefore, we are stuck in the position "It's their responsibility....No, it's not, it's yours...." /frustration.... Joe one "letter" more that UPS
Post #3,458 by cforde 8/1/01 2:06:32 PM Reply	Re: Disaster Recovery Planning and Testing DRP is s big miserable job. It must have visible support from some senior VP and there must be measurable targets with fixed dates. Eg. the test will be on this date and we expect these results (whatever they are), failing that there will be sufficient data collected to enable those results to be achieved next time. The point of a recovery is to restore normal operations as quickly as possible. The best way to achieve that is for everyone to be responsible for the same stuff that they are responsible for in normal operations. This means the operations people are responsible for the platform (OS and all software and data files) and have the applications people are responsible for the applications (correct operation and data integrity). This also helps with the scheduling of tasks during tests. The applications people come in after the operations people have recovered the system to a usable state to make sure that the systems are in fact usable. With that basis it's pretty clear who is responsible for what. It is then up to each group to develop their own policies and procedures to make their recoveries as quick and painless as possible. To be effective these policies and procedures must affect the normal day-to-day operations. This is where you will run into a great deal of resistance. This is also why you must have regularly scheduled tests with fixed dates. No fudging allowed. In a real recovery situation you don't get to move the date and you only take into the test what has been previously planned for. To answer your questions: Who should develop the plan? There should be one master plan that is very high level providing the overall goals and framework. There should be separate plans for each system that covers how backups are done and how recovery is to be performed (the order of machines and procedures for each machine). The operations and applications people will need separate plans developed by and for themselves. What role did the hot-site play in planning and testing. In my experience the recovery site provided hardware and people to mount tapes. The rest was up to us. Who should pay for the testing (App dev, Computer Opns, Client, etc.) This should not even be a question. DRP (Disaster Recovery Planning) has a huge cost in both time and money. If the company wants to do it, the company should be paying for it. Meaning it should be part of everyone's budget. How did you do the testing? (By application, site, platform?) We shipped the tapes and some very experienced operations people to the recovery site. There the systems were restored and connectivity provided to another site, not the regular business offices, but someplace local to that, where the applications people verified the recovery and the integrity of their applications. Frequency of the testing? What was the determination? Initially 3-4 times a year until we were confident in our ability to perform a succefully recovery. Then twice a year. In normal operations software and hardware changes occur daily. It is very easy to forget something. Even if something hasn't been forgotten, you don't really know if a new system can be recovered until you've done it. What was the level of involvement by the application developers during the test? Same as everyone else -- they had to plan for the recovery of their applications, the operation of those applications at a remote site and the integrity of their data. What was the determination of a successful test? Set objectives for each test and check whether they were met. In the first tests this was as simple as were the OSes restored and was connectivity achieved to the offsite location? With practice this became were the systems recovered within the target timeframe sufficiently to allow production usage. (production usage is the goal) Add anything else that was related to the writing or testing. Recovery is something you must design your systems for. It is something that if done right will permeate your entire computing services operations. It all starts with the backup procedures. Have fun, Carl Forde
Post #3,532 by tseliot 8/1/01 7:56:51 PM Reply	Good summary. DRP is s big miserable job. It must have visible support from some senior VP and there must be measurable targets with fixed dates. Eg. the test will be on this date and we expect these results (whatever they are), failing that there will be sufficient data collected to enable those results to be achieved next time. And in a large company (or un-namable government org ahem) you need somebody, usually your department's VP, to work with the MBA's under him and come up with some hard numbers on risk vs. cost. Even if the ratio of one to the other sounds obvious to a large public service org. This will not only give your project the strength to get going but will give the board or other higher-ups something to chew on when your first test fails horribly: you can show in dollars what the cost would be. ...The best way to achieve that is for everyone to be responsible for the same stuff that they are responsible for in normal operations....To be effective these policies and procedures must affect the normal day-to-day operations. This is where you will run into a great deal of resistance. Which is why you need higher-ups to provide the political clout; if you don't have them then run away screaming NOW. Line workers and even their immediate managers have their own goals for the departments which seem in direct opposition to your goals. Hence the need for a higher power. What role did the hot-site play in planning and testing. In my experience the recovery site provided hardware and people to mount tapes. The rest was up to us. Absolutely. Although there are service providers you can subcontract to for more than this. My question is, Joe, when you speak of "the hot site team", what do these people do day-to-day when they're not recovering from disaster? Or is it a disaster contractor? That's her, officer! That's the woman that programmed me for evil!
Post #3,599 by jbrabeck 8/2/01 9:37:25 AM Reply	Re: Good summary. Absolutely. Although there are service providers you can subcontract to for more than this. My question is, Joe, when you speak of "the hot site team", what do these people do day-to-day when they're not recovering from disaster? Or is it a disaster contractor? The hot-site is owned by the gov.org. Their day-to-day mission is to perform disaster tests for all the applications, for all the sites. There are multiple computer sites across the US. Their staff must review the disaster recovery plans that are submitted and then execute the test plan (on a scheduled basis). We are mandated to test the "critical" applications once a year. For the MF environment the hot-site has operators, schedulers and support staff. I don't know how the unix side is staffed. That's for next FY. Joe
Post #3,601 by jbrabeck 8/2/01 9:41:25 AM Reply	Execellent. Sent copy to my boss. She sends her thanks. Do you mind if we use some of your statements in our meetings? Joe
Post #3,616 by cforde 8/2/01 12:06:54 PM Reply	Re: Execellent. Sent copy to my boss. She sends her thank No problem. It's been over 5 years since my DRP involvement, but some of the memories are still fresh... :-> Have fun, Carl Forde
Post #3,683 by Steve Lowe 8/3/01 12:31:12 AM Reply	Well, at least we now know That the TROR(+1) isn't powered by NT :-) ----- Steve

Welcome to IWETHEY!