unixdev.net


Switch to SpeakEasy.net DSL

The Modular Manual Browser

Home Page
Manual: (OSF1-V5.1-alpha)
Page:
Section:
Apropos / Subsearch:
optional field



OLAR_intro(5)							OLAR_intro(5)



NAME

  OLAR_intro, olar_intro - Introduction	to Online Addition and Removal (OLAR)
  Management

DESCRIPTION

  Introduction to Online Addition and Removal (OLAR) Management


  Online addition and removal management is used to expand capacity, upgrade
  components, and replace failed components, while the operating system	ser-
  vices	and applications continues to run. This	functionality, sometimes
  referred to as "hot-swap", provides the benefits of increased	system uptime
  and availability during both scheduled and unscheduled maintenance. Start-
  ing with Tru64 UNIX Version 5.1A, CPU	OLAR is	supported. Additional OLAR
  capabilities are planned to be added for subsequent releases of the operat-
  ing system.

  OLAR management is integrated	with the SysMan	suite of system	management
  applications,	which provides the ability to manage all aspects of the	sys-
  tem from a centralized location.

  You must be a	privileged user	to perform OLAR	management operations.	Or,
  you may configure privileges for selective authorized	user or	group access
  using	Division of Privileges (DOP), as described below.

  Note that only one administrator at a	time can initiate OLAR operations;
  other	administrators will be prevented from initiating OLAR operations
  while	another	operation completes.

  CPU OLAR Overview


  Tru64	UNIX supports the ability to add, replace, and/or remove individual
  CPU modules on supported AlphaServer systems while the operating system and
  applications continue	to run.	Newly inserted CPUs are	automatically recog-
  nized	by the operating system, but will not start scheduling and executing
  processes until the CPU module is powered on and placed online through any
  of the supported management applications as described	below. Conversely,
  before a CPU can be physically removed from the system, it must be placed
  offline and then powered off.	Processes queued for execution on a CPU	that
  is to	be placed offline are simply migrated to run-queues of other running
  (online) processors.

  By default, CPUs that	are placed offline will	persist	across reboot and
  system initialization, until the CPU is explicitly placed online. This
  behavior differs from	the default behavior of	previous versions of Tru64
  UNIX,	where a	CPU that was placed offline would return to service automati-
  cally	after reboot or	system restart.	Note that for backward compatibility,
  the psradm(8)	and offline(8) commands	still provide the non-persistent off-
  line behavior.  While	the psradm(8) and offline(8) commands are still	pro-
  vided, they are not recommended for performing OLAR operations.

  On platforms supporting this functionality, any CPU can participate in an
  OLAR operation, including the	primary	CPU and/or I/O interrupt handling
  CPUs.	These roles will be delegated to other running CPUs in the event that
  a currently running primary or I/O interrupt handler needs to	be placed
  offline or removed.

  Why Perform OLAR on CPUs


  OLAR of CPUs may be performed	for the	following reasons:

  Computational	Capacity Expansion

      A	system manager wants to	provide	additional computational capacity to
      the system without having	to bring the system down.  As an example, an
      AlphaServer GS320	with available CPU slots can have it's CPU capacity
      expanded by adding additional CPU	modules	to the system while the
      operating	system and applications	continue to run.

  Maintenance Upgrade

      A	system manager wants to	upgrade	specific system	components to the
      latest model without having to bring the system down. As an example, a
      GS160 with earlier model Alpha CPU modules can be	upgraded to later
      model CPUs with higher clock rates, while	the operating system contin-
      ues to run.

  Failed Component Replacement

      A	system component is indicating a high incidence	of correctable errors
      and the system manager wants to perform a	proactive replacement of the
      failing component	before it results in a hard failure.  As an example,
      the Component Indictment facility	(described below) has indicated
      excessive	correctable errors in a	CPU module and has therefore recom-
      mended its replacement. Once the CPU module has been placed offline and
      powered off, either through the Automatic	Deallocation Facility (also
      described	below) or through manual intervention, the CPU module can be
      replaced while the operating system continues to run.

  Cautions Before Performing OLAR on CPUs


  Before performing an OLAR operation, be aware	of the following cautions:

    +  When offlining or removing one or more CPUs, processes scheduled	to
       run on the affected CPUs	will be	scheduled to execute on	other running
       CPUs, thus redistributing the processing	capacity among the remaining
       CPUs. In	general, this will result in a system performance degrada-
       tion, proportional to the number	of CPUs	taken out of service and the
       current system load, for	the period of the OLAR operation. Multi-
       threaded	applications that are written to take advantage	of known CPU
       concurrencies can expect	to encounter significant performance degrada-
       tion during the period of the OLAR operation.

    +  The OLAR	management utilities do	not presently operate with processor
       sets. Processor sets are	groups of processors that are dedicated	for
       use by selected processes (see processor_sets(4)). If a process has
       been specifically bound to run on a processor set (see runon(1),
       assign_pid_to_pset(3) ),	and an OLAR operation is attempted on the
       last running CPU	in the processor set, you will not be notified by the
       OLAR utilities that you are effectively shutting	down the entire	pro-
       cessor set. Offlining the last CPU in a processor set will cause	all
       processes bound to that processor set to	suspend	until the processor
       set has at least	one running CPU. Therefore, use	caution	when perform-
       ing CPU OLAR operations on systems that have been configured with
       processor sets.

    +  If a process has	been specifically bound	to execute on a	CPU (see
       runon(1), bind_to_cpu(3), and bind_to_cpu_id(3) for more	information),
       and an OLAR operation is	attempted on that CPU, you will	be notified
       by the OLAR utilities that processes have been bound to the CPU prior
       to any operation	being performed. You may choose	to continue or cancel
       the OLAR	operation. By choosing to continue, processes bound to a CPU
       will suspend their execution until such time that the process is	un-
       bound, or the CPU is placed back	online.	 Note that choosing to off-
       line a CPU that has processes bound may cause detrimental consequences
       to the application, depending upon the characteristics of the applica-
       tion.

    +  If a process has	been specifically bound	to execute on a	Resource
       Affinity	Domain (RAD) (see runon(1) and rad_bind_pid(3) for more
       information), and an OLAR operation is attempted	on the last running
       CPU in the RAD, you will	be notified by the OLAR	utilities that
       processes have been bound to the	RAD and	that the last CPU in the RAD
       has been	requested to be	placed offline.	By choosing to continue,
       processes bound to the RAD will suspend their execution until such
       time that the process is	un-bound, or at	least one CPU in the RAD is
       placed online. Note that	choosing to offline the	last CPU in a RAD
       with processes bound may	cause detrimental consequences to the appli-
       cation, depending upon the characteristics of the application.

    +  If you are using	program	profiling utilities such as dcpi, kprofile,
       or uprofile, that are aware of the system's CPU configuration,
       unpredictable results may occur when performing OLAR operations.	It is
       therefore recommended that these	profiling utilities be disabled	prior
       to performing an	OLAR operation.	Ensure that all	the processes includ-
       ing any associated daemons that are related to these utilities have
       been stopped before performing OLAR operations on system	CPUs.

       The device drivers used by these	profiling utilities are	usually	con-
       figured into the	kernel dynamically, so the tools can be	disabled
       before each OLAR	operation with the following commands:


	    # sysconfig	-u pfm


	    # sysconfig	-u pcount

       The appropriate driver can be re-enabled	with one of the	following:


	    # sysconfig	-c pfm


	    # sysconfig	-c pcount

       The automatic deallocation of CPUs, enabled through the Automatic
       Deallocation Facility, should be	disabled whenever the pfm or pcount
       device drivers are configured into the kernel, or vice versa.  Refer
       to the documentation and	reference pages	for these utilities for	addi-
       tional information.










  General Procedures for Online	Addition and Removal of	CPUs




				    Caution

       Pay attention to	the system safety notes	as outlined in the
       GS80/160/320 Service Manual.

    +  Removing	a CPU Module

       To perform an online removal of a CPU module, follow these steps	using
       your preferred management application, described	in the section "Tools
       for Managing OLAR".

	1.  Off-line the CPU. The operating system will	stop scheduling	and
	    executing tasks on this CPU. Using your preferred OLAR management
	    application, make note of the quad building	block (QBB) number
	    where this CPU is inserted.	 This is the "hard" (or	physical) QBB
	    number, and	does not change	if the system is partitioned.

	2.  Power the CPU module off. The LED on the CPU module	will
	    illuminate yellow, indicating that the CPU module is un-powered,
	    and	safe to	be removed.

	3.  Physically remove the CPU module. Note that	the operating system
	    automatically recognizes that the CPU module has been physically
	    removed.  There is no need to perform a scan operation to update
	    the	   hardware configuration.

    +  Adding a	CPU module

       To perform an online addition of	a CPU module, follow these steps
       using your preferred management application, described in the section
       "Tools for Managing OLAR".

	1.  Select an available	CPU slot in one	of the configured quad build-
	    ing	blocks (QBB). If there are available slots in several QBBs,
	    it is typically best to equally distribute the number of CPUs
	    among the configured QBBs.

	2.  Insert the CPU module into the CPU slot. Ensure that you align
	    the	color-coded decal on the CPU module with the color-code	decal
	    on the CPU slot. The LED on	the CPU	module will illuminate yel-
	    low, indicating that the CPU module	is un-powered. Note that the
	    CPU	will be	automatically recognized by the	operating system,
	    even though	it is un-powered. There	is no need to perform a	scan
	    operation for the operating	system to identify the CPU module.

	3.  Power the CPU module on. The CPU module will undergo a short
	    self-test (7-10 secs), after which the LED will illuminate green,
	    indicating the module is powered-on	and has	passed its self-test.

	4.  On-line the	CPU. Once the CPU is on-line, the operating system
	    will automatically begin to	schedule and execute tasks on this
	    CPU.









  Tools	for Managing OLAR


  When it is necessary to perform an OLAR operation, use the following tools
  which	are provided as	part of	the SysMan suite of system management
  utilities.

  Manage CPUs


  "Manage CPUs"	is a task-oriented application that provides the following
  functions:

    +  Change the state	of a CPU to online or offline

    +  Power on	or power off a CPU

    +  Determine the status of each inserted CPU

  The "Manage CPUs" application	can be run equivalently	from an	X Windows
  display, a terminal with curses capability, or locally on a PC (as
  described below), thus providing a great deal	of flexibility when perform-
  ing OLAR operations.

				     Note

       You must	be a privileged	user to	run the	"Manage	CPUs" application.
       Non-root	users may also run the "Manage CPUs" application if they are
       assigned	the "HardwareManagement" privilege. To assign a	user the
       "HardwareManagement" privilege, issue the following command to launch
       the "Configure DOP" application:

	    # sysman dopconfig [-display <hostname>]

       Please refer to the dop(8) reference page and the on-line help in the
       'dopconfig' application for further information.	Additionally, the
       Manage CPUs application provides	online help capabilities that
       describe	the operation of this application.

  The "Manage CPUs" application	can be invoked using one of the	following
  methods:

    +  SysMan Menu

	1.  At the command prompt in a terminal	window,	enter the following
	    command:

	    [Note that the "DISPLAY" shell environment variable	must be	set,
	    or the "-display" command line option must be used,	in order to
	    launch the X Windows version of SysMan Menu.  If there is no
	    indication of which	graphics display to use, or if invoking	from
	    a character	cell terminal, then the	curses version of SysMan Menu
	    will be launched.]


		 # sysman [-display <hostname>]

	2.  Highlight the "Hardware" entry and press "Select"

	3.  Highlight the "Manage CPUs"	entry and press	"Select"

    +  SysMan command line accelerator

       To launch the Manage CPUs application directly via the command prompt
       in a terminal window, enter the following command:


	    # sysman hw_manage_cpus [-display hostname]

       [Note that the "DISPLAY"	shell environment variable must	be set,	or
       the "-display" command line option must be used,	in order to launch
       the X Windows version of	Manage CPUs.  If there is no indication	of
       which graphics display to use, or if invoking from a character cell
       terminal, then the curses version of Manage CPUs	will be	launched.]

    +  System Management Station

       To launch the Manage CPUs application from the System Management	Sta-
       tion, do	the following:

	1.  At the command prompt in a terminal	window from a system that
	    supports graphical display,	enter the following command:


		 # sysman -station [-display hostname]

	    When the System Management Station launches, two separate windows
	    will appear. One window is the Status Monitor view,	and the	other
	    window is the Hardware view, providing a graphical depiction of
	    the	hardware connected to your system.

	2.  Select the Hardware	view window.

	3.  Select the CPU for an OLAR operation by left-clicking once with
	    the	mouse.

	4.  Select Tools from the menu bar, or right-click once	with the
	    mouse. A list of menu options will appear.

	5.  Select Daily Administration	from the list.

	6.  Select the Manage CPUs application.

    +  Manage CPUs from	a PC or	Web Browser

       You can also perform OLAR management from your PC desktop or from
       within a	web browser. Specifically, you can run Manage CPUs via the
       System Management Station client	installed on your desktop, or by
       launching the System Management Station client from within a browser
       pointed to the Tru64 UNIX System	Management home	page. For a detailed
       description of options and requirements,	visit the Tru64	UNIX System
       Management home page, available from any	Tru64 UNIX system running
       V5.1A (or higher), at the following URL:

       http://hostname:2301/SysMan_Home_Page

       where "hostname"	is the name of a Tru64 UNIX Version 5.1A, (or higher)
       system.

  hwmgr	Command	Line Interface (CLI)


  In addition to its set of generic hardware management	capabilities, the
  hwmgr(8) command line	interface incorporates the same	level of OLAR manage-
  ment functionality as	the Manage CPUs	application. You must be root to run
  the hwmgr command; this command does not currently operate with DOP.

  The following	describes the OLAR specific commands supported by hwmgr. To
  obtain general help on the use of hwmgr, issue the command:


       # hwmgr -help

  To obtain help on a specific option, issue the command:

       # hwmgr -help "option"

  where	option is the name of the option you want help on.

   1.  To obtain the status and	state information of all hardware components
       the operating system is aware of, issue the following command:
	    # hwmgr -status comp
			     STATUS   ACCESS  HEALTH	  INDICT
	     HWID: HOSTNAME  SUMMARY  STATE   STATE	  LEVEL	  NAME
	    -------------------------------------------------------------
	       3:  wild-one	      online   available	   dmapi
	      49:  wild-one	      online   available	   dsk2
	      50:  wild-one	      online   available	   dsk3
	      51:  wild-one	      online   available	   dsk4
	      52:  wild-one	      online   available	   dsk5
	      56:  wild-one	      online   available	   Compaq
								   Alpha Server
								   GS160 6/731
	      57:  wild-one	      online   available	   CPU0
	      58:  wild-one	      online   available	   CPU2
	      59:  wild-one	      online   available	   CPU4
	      60:  wild-one	      online   available	   CPU6



       or, to obtain status on an individual component,	use the	hardware id
       (HWID) of the component and issue the command:


	    # hwmgr -status comp -id 58

			       STATUS	ACCESS	  HEALTH    INDICT
	     HWID:  HOSTNAME   SUMMARY	STATE	  STATE	    LEVEL   NAME
	    -------------------------------------------------------------
	       58:  wild-one		online	  available	    CPU2



       To see the complete list	of options for "-status", issue	the command:


	    # hwmgr -help status

   2.  To view a hierarchical listing of all hardware components the operat-
       ing system is aware of, issue the command:


	    # hwmgr -view hier
	     HWID: hardware hierarchy (!)warning (X)critical (-)inactive (see -status)
	     -------------------------------------------------------------------------
		1: platform Compaq AlphaServer GS160 6/731
		9:   bus wfqbb0
		10:	connection wfqbb0slot0
		11:	  bus wfiop0
		12:	    connection wfiop0slot0
		13:	      bus pci0
		14:		connection pci0slot1

		 o
		 o
		 o

		57:	cpu qbb-0 CPU0
		58:	cpu qbb-0 CPU2


       This example shows that CPU0 and	CPU2 are children of bus name
       "wfqbb0", and that their	physical location is (hard) qbb-0.  Note that
       hard QBB	numbers	do not change as the system partitioning changes.

       To quickly identify which QBB a CPU is associated with, issue the com-
       mand:


	    # hwmgr -view hier -id 58
	    HWID:   hardware hierarchy
	    -----------------------------------------------------
		58:   cpu CPU0 qbb-0

   3.  To offline a CPU	that is	currently in the online	state, issue the com-
       mand


	    # hwmgr -offline -id 58

       or


	    # hwmgr -offline -name CPU2

       Note that device	names are case sensitive. In this example, CPU2	must
       be upper	case. To verify	the new	status of CPU2,	issue the command:


	    # hwmgr -status comp -id 58

			       STATUS	ACCESS	  HEALTH     INDICT
	     HWID:  HOSTNAME   SUMMARY	STATE	  STATE	     LEVEL   NAME
	    --------------------------------------------------------------
	       58:  wild-one   critical	offline	  available	     CPU2



       Note that the offline state will	be saved across	future reboots of the
       operating system, including power cycling the system. If	you want the
       component to return to the online state the next	time the operating
       system is booted, use the "-nosave" switch.


	    # hwmgr -offline -nosave -id 58

       or


	    # hwmgr -offline -nosave -name CPU2

       Once again, to verify the status	of CPU2, issue the command:


	    # hwmgr -status comp -id 58

			      STATUS   ACCESS		 HEALTH	      INDICT
	     HWID:  HOSTNAME   SUMMARY	STATE		 STATE	      LEVEL   NAME
	    ----------------------------------------------------------------------
	       58:  wild-one   critical	offline(nosave)	 available	      CPU2



   4.  To power	off a CPU that is currently in the offline state, issue	the
       command:


	    # hwmgr -power off -id 58

       or


	    # hwmgr -power off -name CPU2

       Note that a component must be in	the offline state before power can be
       removed using hwmgr. Once power has been	removed	from a component, is
       it safe to remove that component	from the system.

   5.  To power	on a CPU that is currently powered off,	issue the command:


	    # hwmgr -power on -id 58

       or


	    # hwmgr -power on -name CPU2

   6.  To place	a CPU online so	that the operating system can start schedul-
       ing processes to	run on that CPU, issue the command:


	    # hwmgr -online -id	58

       or


	    # hwmgr -online -name CPU2



  Refer	to the hwmgr(8)	reference page for additional information on the use
  of hwmgr.

  Component Indictment Overview


  Component indictment is a proactive notification from	a fault	analysis
  utility, indicating that a component is experiencing high incidence of
  correctable errors, and therefore should be serviced and/or replaced.	Com-
  ponent indictment involves the process of analyzing specific failure pat-
  terns	from error log entries,	either immediately or over a given time
  interval, and	recommending a component's removal. The	fault analysis util-
  ity signals the running operating system that	a given	component is suspect,
  causing the operating	system to distribute this information via an EVM
  indictment event such	that interested	applications, including	the System
  Management Station, Insight Manager, and the Automatic Deallocation Facil-
  ity can update their state information, as well as take appropriate action
  if so	configured (see	the discussion on Automatic Deallocation Facility
  below).

  It is	possible for more than one component to	be indicted simultaneously if
  the exact source of error cannot be pinpointed.  In these cases, the most
  likely suspect will be indicted with a `high`	probability. The next likely
  suspect will be indicted with	a `medium` probability,	and the	least likely
  suspect will be indicted with	a `low`	probability. When this situation
  arises, the indictment events	can be tied together by	examining the
  "report_handle" variable within the indictment events. Indictment events
  for the same error will contain the same "report_handle" value.

  The indicted state of	a component will persist across	reboot and system
  initialization if no action is taken to remedy the suspect component,	such
  as an	online repair operation. Once an indictment has	occurred for a given
  component, another indictment	event will not be generated for	that com-
  ponent unless	the utility has	determined, through additional analysis, that
  the original indictment probably should be updated.  In this case, the com-
  ponent will be re-indicted with the new probability. Once the	indicted com-
  ponent has been serviced, it is necessary to manually	clear the indicted
  component state with the following hwmgr command:

       # hwmgr -unindict -id <&lt;hwid>&gt;

  where	<id> is	the hardware id	(HWID) of the component

  Allowing the operator	to manually clear the indicted problem state, ensures
  positive identification of when a replaced component is operating properly.

  All component	indictment EVM events have an event prefix of
  sys.unix.hw.state_change.indicted. You may view the complete list of all
  possible component indictment	events that may	be posted, including a
  description of each event, by	issuing	the command:

       # evmwatch -i -f	'[name sys.unix.hw.state_change.indicted]' | evmshow -t
	  "@name" -x | more

  You may view the list	of indictment events that have occurred	by issuing
  the command:

       # evmget	-f '[name sys.unix.hw.state_change.indicted]' |	evmshow	-t "@name"

  CPU modules and memory pages are currently supported for component indict-
  ment.

  Compaq Analyze, included as part of the Web-Based Enterprise Services
  (WEBES) 4.0 product (or higher), is the fault	analysis utility that sup-
  ports	component indictment on	a Tru64	UNIX (V5.1A or higher) system. The
  WEBES	product	is included as part of the Tru64 UNIX operating	system dis-
  tribution, and must be installed after installation of the base operating
  system. Please refer to the Compaq Analyze documentation, distributed	with
  the WEBES product, for a list	of AlphaServer platforms that support the
  component indictment feature.

  Automatic Deallocation Facility Overview


  The Automatic	Deallocation Facility provides the ability to automatically
  take an indicted component out of service, thus providing the	automated
  ability for the system to heal itself	while furthering the reliability and
  availability of the system. The Automatic Deallocation Facility currently
  supports the ability to stop using CPUs and memory pages that	have been
  indicted.

  The ability to tailor	the behavior of	the automatic deallocation facility
  can be user-controlled on both single	and clustered systems, through the
  use of the text-based	OLAR Policy Configuration files. When operating	in a
  clustered environment, automatic deallocation	policy applies to all members
  in a cluster by default. This	is specified through the cluster-wide file
  /etc/olar.config.common. However, individual cluster-wide policy variables
  can be overridden using the member-specific configuration file
  /etc/olar.config.

  The OLAR Policy Configuration	files contain configuration variables that
  control specific behaviors of	the Automatic Deallocation Facility.
  Behaviors such as whether or not to enable automatic deallocation, and what
  times	of the day automatic deallocation should be enabled can	be defined.
  Additionally,	the ability to specify a user-supplied script or executable
  that provides	the gating factor as to	whether	an automatic deallocation
  operation should occur, can be provided as well.

  Automatic deallocation is supported for those	platforms that support the
  component indictment feature,	as described in	the Component Indictment
  Overview section above.

  Refer	to the olar.config(4) reference	page for additional information	about
  the OLAR Policy Configuration	files.

SEE ALSO

  Commands: sysman(8), sysman_menu(8), sysman_station(8), hwmgr(8), codcon-
  fig(8), dop(8)

  Files: olar.config.common(4)

  System Administration

  Configuring and Managing Systems for Increased Availability Guide