CRASH(8)                    System Manager's Manual                   CRASH(8)

       crash - what to do when the system crashes

       This  section  gives  at  least a few clues about how to proceed if the
       system crashes.  It can't pretend to be complete.

       Bringing it back up.  If the reason for the crash is not  evident  (see
       below for guidance on `evident') you may want to try to dump the system
       if you feel up to debugging.  At the moment a dump can be taken only on
       magtape.  With a tape mounted and ready, stop the machine, load address
       44, and start.  This should write a copy of all of  core  on  the  tape
       with  an EOF mark.  Caution: Any error is taken to mean the end of core
       has been reached.  This means that you must be sure the ring is in, the
       tape  is  ready, and the tape is clean and new.  If the dump fails, you
       can try again, but some of the registers will be lost.  See  below  for
       what to do with the tape.

       In  restarting  after  a crash, always bring up the system single-user.
       This is accomplished by following the directions in boot(8) as modified
       for  your particular installation; a single-user system is indicated by
       having a particular value in the switches (173030 unless you've changed
       init)  as  the  system starts executing.  When it is running, perform a
       dcheck and icheck(1) on all file systems which could have been  in  use
       at  the  time  of  the  crash.  If any serious file system problems are
       found, they should be repaired.  When you are satisfied with the health
       of your disks, check and set the date if necessary, then come up multi-
       user.  This is most easily accomplished  by  changing  the  single-user
       value  in the switches to something else, then logging out by typing an

       To even boot UNIX at all, three files (and the directories  leading  to
       them) must be intact.  First, the initialization program /etc/init must
       be present and executable.  If it is not, the CPU  will  loop  in  user
       mode  at location 6.  For init to work correctly, /dev/tty8 and /bin/sh
       must be present.  If  either  does  not  exist,  the  symptom  is  best
       described  as  thrashing.  Init will go into a fork/exec loop trying to
       create a Shell with proper standard input and output.

       If you cannot get the  system  to  boot,  a  runnable  system  must  be
       obtained  from  a backup medium.  The root file system may then be doc-
       tored as a mounted file system as described below.  If  there  are  any
       problems  with  the root file system, it is probably prudent to go to a
       backup system to avoid working on a mounted file system.

       Repairing disks.  The first rule to keep in mind is that an addled disk
       should be treated gently; it shouldn't be mounted unless necessary, and
       if it is very valuable yet in quite bad shape,  perhaps  it  should  be
       dumped  before  trying surgery on it.  This is an area where experience
       and informed courage count for much.

       The problems reported by icheck typically fall into two  kinds.   There
       can  be  problems  with  the free list: duplicates in the free list, or
       free blocks also in files.  These can be cured easily  with  an  icheck
       -s.   If the same block appears in more than one file or if a file con-
       tains bad blocks, the files should be deleted, and the free list recon-
       structed.   The  best way to delete such a file is to use clri(1), then
       remove its directory entries.  If any of the affected files  is  really
       precious, you can try to copy it to another device first.

       Dcheck  may  report files which have more directory entries than links.
       Such situations are potentially dangerous;  clri  discusses  a  special
       case  of the problem.  All the directory entries for the file should be
       removed.  If on the other hand there  are  more  links  than  directory
       entries,  there  is  no  danger of spreading infection, but merely some
       disk space that is lost for use.  It is sufficient to copy the file (if
       it has any entries and is useful) then use clri on its inode and remove
       any directory entries that do exist.

       Finally, there may be inodes reported by dcheck that have 0 links and 0
       entries.   These  occur  on  the root device when the system is stopped
       with pipes open, and on other file systems when the system  stops  with
       files  that  have  been deleted while still open.  A clri will free the
       inode, and an icheck -s will recover any missing blocks.

       Why did it crash?  UNIX types a message on the console typewriter  when
       it  voluntarily  crashes.   Here  is the current list of such messages,
       with enough information to provide a hope at least of the remedy.   The
       message has the form `panic: ...', possibly accompanied by other infor-
       mation.  Left unstated in all cases is the possibility that hardware or
       software error produced the message in some unexpected way.

            The  getblk  routine was called with a nonexistent major device as
            argument.  Definitely hardware or software error.

            Null device table entry for the major device used as  argument  to
            getblk.  Definitely hardware or software error.

            An I/O error reading the super-block for the root file system dur-
            ing initialization.

       out of inodes
            A mounted file system has no more i-nodes when  creating  a  file.
            Sorry, the device isn't available; the icheck should tell you.

       no fs
            A  device  has  disappeared  from the mounted-device table.  Defi-
            nitely hardware or software error.

       no imt
            Like `no fs', but produced elsewhere.

       no inodes
            The in-core  inode  table  is  full.   Try  increasing  NINODE  in
            param.h.  Shouldn't be a panic, just a user error.

       no clock
            During initialization, neither the line nor programmable clock was
            found to exist.

       swap error
            An unrecoverable I/O error during a swap.  Really shouldn't  be  a
            panic, but it is hard to fix.

       unlink - iget
            The  directory  containing  a  file  being deleted can't be found.
            Hardware or software.

       out of swap space
            A program needs to be swapped out,  and  there  is  no  more  swap
            space.  It has to be increased.  This really shouldn't be a panic,
            but there is no easy fix.

       out of text
            A pure procedure program is being executed, and the table for such
            things is full.  This shouldn't be a panic.

            An unexpected trap has occurred within the system.  This is accom-
            panied by three numbers: a `ka6', which is  the  contents  of  the
            segmentation  register for the area in which the system's stack is
            kept; `aps', which is the location where the hardware  stored  the
            program  status  word  during  the  trap;  and a `trap type' which
            encodes which trap occurred.  The trap types are:

       0         bus error
       1         illegal instruction
       2         BPT/trace
       3         IOT
       4         power fail
       5         EMT
       6         recursive system call (TRAP instruction)
       7         11/70 cache parity, or programmed interrupt
       10        floating point trap
       11        segmentation violation

       In some of these cases it is possible for octal 20 to be added into the
       trap  type; this indicates that the processor was in user mode when the
       trap occurred.  If you wish to examine the stack  after  such  a  trap,
       either  dump  the  system, or use the console switches to examine core;
       the required address mapping is described below.

       Interpreting dumps.  All file system problems should be taken  care  of
       before  attempting  to look at dumps.  The dump should be read into the
       file /usr/sys/core; cp(1) will do.  At this point, you  should  execute
       ps  -alxk  and who to print the process table and the users who were on
       at the time of the crash.  You should dump ( od(1)) the first 30  bytes
       of  /usr/sys/core.   Starting  at location 4, the registers R0, R1, R2,
       R3, R4, R5, SP and KDSA6 (KISA6 for 11/40s) are stored.   If  the  dump
       had  to  be restarted, R0 will not be correct.  Next, take the value of
       KA6 (location 022(8) in  the  dump)  multiplied  by  0100(8)  and  dump
       01000(8) bytes starting from there.  This is the per-process data asso-
       ciated with the process running at the time of the crash.  Relabel  the
       addresses  140000  to  141776.   R5  is  C's  frame or display pointer.
       Stored at (R5) is the old R5 pointing to the previous stack frame.   At
       (R5)+2  is  the  saved PC of the calling procedure.  Trace this calling
       chain until you obtain an R5 value of 141756, which is where the user's
       R5 is stored.  If the chain is broken, you have to look for a plausible
       R5, PC pair and continue from there.  Each PC should be  looked  up  in
       the  system's  name  list  using  adb(1)  and its `:' command, to get a
       reverse calling order.  In most cases this procedure will give an  idea
       of  what  is  wrong.  A more complete discussion of system debugging is
       impossible here.

