DRP is s big miserable job. It must have visible support from some senior VP and there must be measurable targets with fixed dates. Eg. the test will be on this date and we expect these results (whatever they are), failing that there will be sufficient data collected to enable those results to be achieved next time.
The point of a recovery is to restore normal operations as quickly as possible. The best way to achieve that is for everyone to be responsible for the same stuff that they are responsible for in normal operations. This means the operations people are responsible for the platform (OS and all software and data files) and have the applications people are responsible for the applications (correct operation and data integrity). This also helps with the scheduling of tasks during tests. The applications people come in after the operations people have recovered the system to a usable state to make sure that the systems are in fact usable. With that basis it's pretty clear who is responsible for what. It is then up to each group to develop their own policies and procedures to make their recoveries as quick and painless as possible.
To be effective these policies and procedures must affect the normal day-to-day operations. This is where you will run into a great deal of resistance. This is also why you must have regularly scheduled tests with fixed dates. No fudging allowed. In a real recovery situation you don't get to move the date and you only take into the test what has been previously planned for.
To answer your questions:- Who should develop the plan?
- There should be one master plan that is very high level providing the overall goals and framework. There should be separate plans for each system that covers how backups are done and how recovery is to be performed (the order of machines and procedures for each machine). The operations and applications people will need separate plans developed by and for themselves.
- What role did the hot-site play in planning and testing.
- In my experience the recovery site provided hardware and people to mount tapes. The rest was up to us.
- Who should pay for the testing (App dev, Computer Opns, Client, etc.)
- This should not even be a question. DRP (Disaster Recovery Planning) has a huge cost in both time and money. If the company wants to do it, the company should be paying for it. Meaning it should be part of everyone's budget.
- How did you do the testing? (By application, site, platform?)
- We shipped the tapes and some very experienced operations people to the recovery site. There the systems were restored and connectivity provided to another site, not the regular business offices, but someplace local to that, where the applications people verified the recovery and the integrity of their applications.
- Frequency of the testing? What was the determination?
- Initially 3-4 times a year until we were confident in our ability to perform a succefully recovery. Then twice a year. In normal operations software and hardware changes occur daily. It is very easy to forget something. Even if something hasn't been forgotten, you don't really know if a new system can be recovered until you've done it.
- What was the level of involvement by the application developers during the test?
- Same as everyone else -- they had to plan for the recovery of their applications, the operation of those applications at a remote site and the integrity of their data.
- What was the determination of a successful test?
- Set objectives for each test and check whether they were met. In the first tests this was as simple as were the OSes restored and was connectivity achieved to the offsite location? With practice this became were the systems recovered within the target timeframe sufficiently to allow production usage. (production usage is the goal)
- Add anything else that was related to the writing or testing.
- Recovery is something you must design your systems for. It is something that if done right will permeate your entire computing services operations. It all starts with the backup procedures.