Recovering from a failed boot disk is not a very difficult procedure using Solstice DiskSuite when the system has been properly setup and documented initially. Recovering from failures of this nature are good examples of the power of DiskSuite, even in a simple two disk mirroring situation. For the purposes of this demonstration, we will use the same setup presented in Mirroring Disks with Solstice DiskSuite.
The first step is obviously to identify which piece of hardware
failed. In this example it will be the boot disk, which is is
/dev/dsk/c0t0d0
. Once the failed disk has been identified it
is important to boot up the system from the second half of the mirror
before the failed device is replaced. Booting the second half of
the mirror can be a simple procedure or a complicated one, depending on
how the system was configured initially. If an alternate boot device had
been defined as an nvalias
simply boot off of that device:
ok boot altdiskIf the alternate boot alias was not created, it may be possible to boot off of one of the built-in alternate devices provided in the Open Boot PROM. These are numbered
disk0
through disk6
and generally only apply to disks on the system's internal SCSI
controller. If all else fails, use probe-scsi-all
to
determine the device path to the secondary disk and create an alias to
boot from.
When the system comes up it will complain about stale database replicas
and will only allow you to boot into single-user
mode. Technically, the
system cannot boot without the minimum number of database
replicas. Since we have lost an entire disk with half of the total
number of replicas, DiskSuite does not have enough information to
boot. We must log in to the system in single-user mode and delete the
database replicas that were on the failed disk. Use the
metadb
command without any arguments to list the replicas and
which have failed. Then delete the stale replicas using metadb
-d
, as follows:
# metadb # metadb -d /dev/dsk/c0t0d0s3 # metadb -d /dev/dsk/c0t0d0s5 # metadb -d /dev/dsk/c0t0d0s6
The system can now be shutdown and rebooted with the new disk installed. Remember, you will still need to boot from the second half of the mirror.
The first task once the system is back up is to partition the replacement
disk. This disk must be partitioned identical to its mirror. While there
are several ways of doing this, the simplest by far is using
prtvtoc
to print out the volume table of contents (VTOC) of
the good disk, and then using that with the fmthard
command
to write the table to the new disk. An example is below:
# prtvtoc /dev/rdsk/c0t1d0s2 > /tmp/format.out # fmthard -s /tmp/format.out /dev/rdsk/c0t0d0s2
Once the new disk has been partitioned, the state database replicas deleted earlier must now be recreated. These are created in the exact same way they were created originally.
# metadb -a /dev/dsk/c0t0d0s3 # metadb -a /dev/dsk/c0t0d0s5 # metadb -a /dev/dsk/c0t0d0s6
The failed submirrors must now be detached from the mirror. Detaching the failed submirrors stops any kind of read or write operations to that half of the mirror when activity occurs on the metadevice. Since the submirrors to be detached are reported in an error state, the detach must be forced, as follows:
# metadetach -f d0 d10 # metadetach -f d1 d11 # metadetach -f d4 d14 # metadetach -f d7 d17
Once the failed submirrors have been detached they need to be cleared. This will allow us to later reinitialize the submirrors and reattach them to the metadevice.
# metaclear d10 # metaclear d11 # metaclear d14 # metaclear d17
Now that the failed submirrors have been detached and the metadevices cleared, it is possible to recreate the submirrors and reattach them to the metadevices, causing an immediate resynchronization of the submirrors. Recreating and reattaching the submirrors uses exactly the same steps as creating them in the first place. In our example, we will do the following:
# metainit d10 1 1 c0t0d0s0 # metainit d11 1 1 c0t0d0s1 # metainit d14 1 1 c0t0d0s4 # metainit d17 1 1 c0t0d0s7 # metattach d0 d10 # metattach d1 d11 # metattach d4 d14 # metattach d7 d17The status of the resynchronization process can be monitored using the
metastat
command. Resynching the submirrors can take a long
time depending on the amount of data on the partitions. Once
resynchonization is completed the system can be rebooted. Ensure that the
system is now properly booting from its primary bootdisk and that
everything is operating normally.
# metastat | more # init 6
DiskSuite provides robust tools for mirroring disks, including the
critical boot disk, which usually includes the /
(root),
/usr
, and swap partitions necessary for the proper operation
of the system. Working with metadevices can be a daunting task,
especially if you do not work with DiskSuite on a daily basis. Many
administrators are faced with the scenario of setting up mirroring on a
server, not working with it once it is running, and then having to deal
with recovery many months down the road. Sometimes even worse than this
is having to deal with recovery of a system that you didn't configure in
the first place. The key to a successful recovery is having the proper
system documentation in the first place and outlining an exact procedure
on paper before beginning the recovery process. Hopefully the
example presented above will give most administrators the framework for
quickly modifying the procedure for their individual situation.