Here's a little report that I whipped up
that may interest some people here. I may
flesh it out and present it at Sage.
-----------------------------------------
Oracle restore via checkpoints
I attempted to do a single tablespace checkpoint rollback.
It was BAD. I ended up trashing my base.
The good: I got my 330G Oracle base back without resorting
to tape.
The bad: It took about 4 hours to copy the files around
when it should have taken 10 seconds of rolling back the
checkpoints.
The ugly: This notice is part of the 1st paragraph of the the
Veritas vxckptadm man page, which is needed to rollback from
a checkpoint.
DESCRIPTION
The vxckptadm utility is not an end-user supported utility
and should not be run by users. The VxDBA utility inter-
faces to this command, allowing management of Storage Check-
points.
If the utility WORKED, I wouldn't need to deal with this command.
There was a bug in the initial Veritas checkpoint save script that
put bad entries in the Veritas list of files to restore, which then
caused the restore step to abort.
The stupid: If I had spent a few more minutes rereading and
testing, I could have avoided the copy time. After finishing the
"copy restore", I reread the man page, checked the failure log,
and realised that the rollback needed file names, and would
not work on directories or file systems. So I then rolled it
all back via a large list of file names, and it worked fine.
On the other hand, this is why we test.
It seems that all the user level checkpoint tools deal with the
creation and mounting of checkpoint, not rolling back. fsckptadm
is the command for most of the work, but it has no way of rolling
back. The vxckptadm talks about rolling back Oracle instances
(which failed), and individual files, but NOT the entire file system,
which is what I wanted, but could not have.
Lessons learned:
Never trust Veritas stuff without testing. We always knew that,
but it is worth repeating. In this case they use 'sqlplus' for some
system setup, but will get confused with the results if things such
as "set timing on" are in the glogin.sql file.
Keep a percentage of all file systems empty to be used for checkpoint
data. They work great, but you need to know what you are doing. The
amount of free space required is based on amount of data being changed.
Checkpoints survive system reboot, so you can keep a LONG history
around if you want to use the disk. Checkpoints become stale and
unusable as you run out of disk.
Before any major operation, take a cold Oracle checkpoint. This takes
about 2 minutes. You could do a hot one, but the restore is far more
complex, and since the data warehousing instance it not archivelogmode,
hot backups (even via checkpoint) are chancy.
It takes 5 hours to backup 345GB to EZ17 via mounted checkpoints. You
can do this while the base it up and active.
It takes 5 hours to restore 345GB from EZ17. To make it easier to restore
it is better to have a single large partition to bring it back into,
rather than deal with multiples and the associated links.
It takes 4 hours of copying from a mounted checkpoint via disk to restore
those same files. You can mount writable checkpoints to test program changes,
without fear of damaging your original base. The docs say you can
automatically generate a test instance with a new SID, but I haven't tested
that yet.
It takes 10 seconds of rollback via checkpoint to restore those same
files.