unixdev.net


Switch to SpeakEasy.net DSL

The Modular Manual Browser

Home Page
Manual: (OSF1-V5.1-alpha)
Page:
Section:
Apropos / Subsearch:
optional field



volwatch(8)							  volwatch(8)



NAME

  volwatch - Monitors the Logical Storage Manager (LSM)	for failure events
  and performs hot sparing

SYNOPSIS

  /usr/sbin/volwatch [-m] [-s] [-o] [mail-addresses...]

OPTIONS

  -m  Runs volwatch with the mail notification support to notify root (by
      default) or other	specified users	when a failure occurs. This option is
      started by default.

  -s  Runs volwatch with hot spare support.

  -o volrecover_arg
      Specifies	an argument to pass directly to	volrecover if it is running
      and hot spare support is enabled.

DESCRIPTION

  The volwatch command monitors	LSM waiting for	exception events to occur.
  When an exception event occurs, the volwatch command uses mailx(1) to	send
  mail to:

    +  The root	account.

    +  The user	accounts specified when	you use	the rcmgr command to set the
       VOLWATCH_USERS variable in the /etc/rc.config.common file.

    +  The user	account	that you specify on the	command	line with the
       volwatch	command.

  The volwatch command uses the	volnotify command to wait for events to
  occur. When an event occurs,	there is a 15 second delay before the failure
  is analyzed and the message is sent.	This delay allows a group of related
  events to be collected and reported in a single mail message.	By default,
  the volwatch command automatically starts when the system boots.

  You can enter	the volwatch -s	command	to start the volwatch command with
  hot-spare support.  Hot-spare	support:

    +  Detects LSM events resulting from the failure of	a disk,	plex, or
       RAID5 subdisk.

    +  Sends mail to the root account (and other specified accounts) with
       notification about the failure and identifies the affected LSM
       objects.

    +  Determines which	subdisks to relocate, finds space for those subdisks
       in the disk group, relocates the	subdisks, and notifies the root
       account	(and other specified accounts) of these	actions	and their
       success or failure.

       When a partial disk failure occurs (that	is, a failure affecting	only
       some subdisks on	a disk), redundant data	on the failed portion of the
       disk is relocated and the existing volumes comprised of the unaffected
       portions	of the disk remain accessible.

				     Note

       Hot-sparing is only performed for redundant (mirrored or	RAID5) sub-
       disks on	a failed disk. Non-redundant subdisks on a failed disk are
       not relocated, but you are notified of the failure.

       Only one	volwatch daemon	can be running on a system or cluster node at
       any time.

       Hot-sparing does	not guarantee the same layout of data or the same
       performance after relocation. You may want to make some configuration
       changes after hot-sparing occurs.

  Mail Notification Support


  The following	is a sample mail notification when a failure is	detected:

       Failures	have been detected by the Logical Storage Manager:

       failed disks:

       medianame

	...

       failed plexes:

       plexname

	...

       failed log plexes:

       plexname

	...

       failing disks:

       medianame
	...

       failed subdisks:

       subdiskname

	...

       The Logical Storage Manager will	attempt	to find	spare disks,
       relocate	failed subdisks	and then recover the data in the failed	plexes.

  The following	describes the sections of the mail message:

    +  The medianame list under	failed disks specifies disks that appear to
       have completely failed;

    +  The medianame list under	failing	disks indicates	a partial disk
       failure or a disk that is in the	process	of failing. When a disk	has
       failed completely, the same medianame list appears under	both failed
       disks: and failing disks.

    +  The plexname list under failed plexes shows plexes that have been
       detached	due to I/O failures experienced	while attempting to do I/O to
       subdisks	they contain.

    +  The plexname list under failed log plexes indicates RAID5 or dirty
       region log (DRL)	plexes that have experienced failures. The sub-
       diskname	list specifies subdisks	in RAID5 volumes that have been
       detached	due to I/O errors.

  Enabling Hot-Sparing


  By default, hot-sparing is disabled. To enable hot-sparing, enter the
  volwatch command with	the -s option, for example:

  # volwatch -s

  To use hot-spare support you should configure	a disk as a spare, which
  identifies the disk as an available site for relocating failed subdisks.
  Disks	that are identified as spares are not used for normal allocations
  unless you explicitly	specify	otherwise. This	ensures	that there is a	pool
  of spare disk	space available	for relocating failed subdisks and that	this
  disk space is	not consumed by	normal operations.

  Spare	disk space is the first	space used to relocate failed subdisks.	 How-
  ever,	if no spare disk space is available or if the available	spare disk
  space	is not suitable	or sufficient, free disk space is used.

  You must initialize a	spare disk and place it	in a disk group	as a spare
  before it can	be used	for replacement	purposes. If no	disks are designated
  as spares when a failure occurs, LSM automatically uses any available	free
  disk space in	the disk group in which	the failure occurs. If there is	not
  enough spare disk space, a combination of spare disk space and free disk
  space	is used.

  When hot-sparing selects a disk for relocation, it preserves the redundancy
  characteristics of the LSM object to which the relocated subdisk belongs.
  For example, hot-sparing ensures that	subdisks from a	failed plex are	not
  relocated to a disk containing a mirror of the failed	plex. If redundancy
  cannot be preserved using available spare disks and/or free disk space,
  hot-sparing does not take place. If relocation is not	possible, mail is
  sent indicating that no action was taken.

  When hot-sparing takes place,	the failed subdisk is removed from the confi-
  guration database and	LSM takes precautions to ensure	that the disk space
  used by the failed subdisk is	not recycled as	free disk space.

  Initializing and Removing Hot-Spare Disks


  Although hot-sparing does not	require	you to designate disks as spares,
  Compaq recommends that you initialize	at least one disk as a spare within
  each disk group; this	gives you control over which disks are used for	relo-
  cation. If no	spare disks exist, LSM uses available free disk	space within
  the disk group.  When	free disk space	is used	for relocation purposes, it
  is likely that there may be performance degradation after the	relocation.

  Follow these guidelines when choosing	a disk to configuring as a spare:

    +  The hot-spare feature works best	if you specify at least	one spare
       disk in each disk group containing mirrored or RAID5 volumes.

    +  If a given disk group spans multiple controllers	and has	more than one
       spare disk,  set	up the spare disks on different	controllers (in	case
       one of the controllers fails).

    +  For a mirrored volume, the disk group must have at least	one disk that
       does not	already	contain	one of the volume's mirrors. This disk should
       either be a spare disk with some	available space	or a regular disk
       with some free space.

    +  For a mirrored and striped volume, the disk group must have at least
       one disk	that does not already contain one of the volume's mirrors or
       another subdisk in the striped plex. This disk should either be a
       spare disk with some available space or a regular disk with some	free
       space.

    +  For a RAID5 volume, the disk group must have at least one disk that
       does not	already	contain	the volume's RAID5 plex	or one of its log
       plexes. This disk should	either be a spare disk with some available
       space or	a regular disk with some free space.

    +  If a mirrored volume has	a DRL log subdisk as part of its data plex
       (for example, volprint does not list the	plex length as LOGONLY),
       that plex cannot	be relocated. Therefore, place log subdisks in plexes
       that contain no data (log plexes). By default, the volassist command
       creates log plexes.

    +  For mirroring the root disk, the	rootdg disk group should contain an
       empty spare disk	that satisfies the restrictions

    +  Although	it is possible to build	LSM objects on spare disks, it is
       preferable to use spare disks for hot-spare only.

    +  When relocating subdisks	off a failed disk, LSM attempts	to use a
       spare disk large	enough to hold all data	from the failed	disk.

  To initialize	a disk as a spare that has no associated subdisks, use the
  voldiskadd command and enter y at the	following prompt:

       Add disk	as a spare disk	for newdg? [y,n,q,?] (default: n) y

  To initialize	an existing LSM	disk as	a spare	disk, enter:

  # voledit set	spare=on medianame

  For example, to initialize a disk called test03 as a spare disk, enter:

  # voledit set	spare=on test03

  To remove a disk as a	spare, enter:

  # voledit set	spare=off medianame

  For example, to make a disk called test03 available for normal use, enter:

  # voledit set	spare=off test03

  Replacement Procedure


  In the event of a disk failure, mail is sent,	and if volwatch	was config-
  ured to run with hot sparing support with the	-s option, volwatch attempts
  to relocate any subdisks that	appear to have failed. This involves finding
  appropriate spare disk or free disk space in the same	disk group as the
  failed subdisk.


  To determine which disk from among the eligible spare	disks to use,
  volwatch tries to use	the disk that is closest to the	failed disk.  The
  value	of closeness depends on	the controller,	target,	and disk number	of
  the failed disk. For example,	a disk on the same controller as the failed
  disk is closer than a	disk on	a different controller;	a disk under the same
  target as the	failed disk is closer than one under a different target.

  If no	spare or free disk space is found, the following mail message is sent
  explaining the disposition of	volumes	on the failed disk:

       Relocation was not successful for subdisks on disk dm_name
       in volume v_name	in disk	group dg_name.
       No replacement was made and the disk is still unusable.

       The following volumes have storage on medianame:

       volumename
       ...

       These volumes are still usable, but the redundancy of
       those volumes is	reduced. Any RAID-5 volumes with storage
       on the failed disk may become unusable in the face of further
       failures.

  If non-RAID5 volumes are made	unusable due to	the failure of the disk, the
  following is included	in the mail message:

       The following volumes:

       volumename
       ...

       have data on medianame but have no other	usable
       mirrors on other	disks. These volumes are now unusable
       and the data on them is unavailable.  These volumes must
       have their data restored.

  If RAID5 volumes are made unavailable	due to the disk	failure, the follow-
  ing message is included in the mail message:

       The following RAID-5 volumes:

       volumename
       ...

       have storage on medianame and have experienced
       other failures. These RAID-5 volumes are	now unusable
       and data	on them	is unavailable.	 These RAID-5 volumes must
       have their data restored.

  If spare disk	space is found,	LSM attemps to set up a	subdisk	on the spare
  disk and use it to replace the failed	subdisk. If this is successful,	the
  volrecover command runs in the background to recover the contents of data
  in volumes on	the failed disk.

  If the relocation fails, the following mail message is sent:

       Relocation was not successful for subdisks on disk dm_name in
       volume v_name in	disk group dg_name.  No	replacement was	made
       and the disk is still unusable.

       error message

  If any volumes (RAID5	or otherwise) are rendered unusable due	to the
  failure, the following is included in	the mail message:

       The following volumes:

       volumename
       ...

       have data on dm_name but	have no	other usable mirrors on	other
       disks. These volumes are	now unusable and the data on them is
       unavailable. These volumes must have their data restored.

  If the relocation procedure completes	successfully and recovery is under
  way, the following mail message is sent:

       Volume v_name Subdisk sd_name relocated to newsd_name,
       but not yet recovered.

  Once recovery	has completed, a message is sent relaying the outcome of the
  recovery procedure. If the recovery was successful, the following is
  included in the mail message:

       Recovery	complete for volume v_name in disk group dg_name.

  If the recovery was not successful, the following is included	in the mail
  message:

       Failure recovering v_name in disk group dg_name.

SEE ALSO

  mailx(1), rcmgr(8), voldiskadm(8), voledit(8), volintro(8), volrecover(8)