AIX Tip of the Week

AIX Tip of the Week: Reconfiguring AIX's System Dump

Audience: AIX Administators

Date: July 24, 1999

I generally recommend changing the default setup for AIX's system dump facility. The default setting stops the system from rebooting after an unexpected halt.

A system dump copies selected areas of the kernel to disk (or tape) if the system halts unexpectedly. The default dump location is the page space. When a system attempts to reboot after an unexpected halt, it stops to warn the operator the page space contains a dump. The reboot stops until the operator tells the system what do to with the dump.

The dump facility can be reconfigured to reboot automatically by changing the dump device from the page space to a raw partition on the disk. The procedure for making this change is in the attached HTML file. See your AIX documentation for more information.


Managing System Dump Devices [manage.dump.32-42.cmd]


Managing System Dump Devices

-------------------------------------------------------------------------------

Contents

About this document
Related documentation
Managing system dump devices
Determining proper size for dump device
Setting a tape drive as a dump device
Extended options in AIX 4.x
Dumping a mirrored logical volume
Remote dumps over to a network
-------------------------------------------------------------------------------

About this document

This document discusses how to manage storage devices used by AIX to store a
system dump in the event of a catastrophic operating system software failure.

Its intent is to help the system administrator ensure that a system dump will
be complete and usable for troubleshooting purposes.

This document applies to AIX versions 3.2 and 4.x.
-------------------------------------------------------------------------------

Related documentation

For more in-depth coverage of this subject, the following IBM documents are
recommended:

  o AIX Version 4.1 Software Problem Debugging and Reporting for the RISC
    System/6000 (GG24-2513)
  o Common Diagnostics and Service Guide (SA23-2687)
  o Diagnostic Information For Micro Channel Bus Systems (SA23-2765)
  o Diagnostic Information for Multiple Bus Systems (SA38-0509)
  o Problem Solving Guide and Reference (SA23-2204) (SA23-2606)
  o System Management Guide, V3.2 (SC23-2457)
  o System Management Guide, V4 (SC23-2525)

-------------------------------------------------------------------------------

Managing system dump devices

When an unexpected system halt occurs, the system dump facility automatically
copies selected areas of kernel data to the primary dump device. These areas
include kernel segment 0 as well as other areas registered in the Master Dump
Table by kernel modules or kernel extensions.

There are two dumps devices (a primary and secondary). To view information
about the current dump devices, enter:

sysdumpdev -l

Example:

# sysdumpdev -l
primary              /dev/hd7
secondary            /dev/sysdumpnull

In this example, the primary dump device is the logical volume hd7.

When the operating system is installed, the primary dump device is
automatically configured.

In AIX 3.2, the default primary dump device is /dev/hd7. This is a logical
volume dedicated for system dumps.

In AIX 4.x, the default dump device is /dev/hd6. This is the primary paging
space logical volume.

In both AIX 3.2 and 4.x the default secondary dump device is /dev/sysdumpnull.
This is a null device and any dump written to this device is lost.
-------------------------------------------------------------------------------

Determining proper size for dump device

The default dump device created for system use may NOT be large enough for a
complete dump. To determine how large the dump device is, first determine what
the primary dump device is using the procedure mentioned in this section. If
the dump device is not currently set to a tape drive, then this device should
be a logical volume. To retrieve information about this logical volume enter:

lslv <LOGICAL VOLUME NAME>

Example:

lslv hd7

This command will return a screen of information. Obtain the values for LPs and
PP SIZE. Multiply these two values to get the size of the dump device in
megabytes.

Next, determine how large the dump device for your machine should be.

To view an estimate of how large the dump device should be, enter:

sysdumpdev -e

Example:

# sysdumpdev -e
Estimated dump size in bytes: 4526080

NOTE: This value will be what the CURRENT running machine would require. This
value can change based on the activity of the machine. It is best to run this
command when the machine is under its heaviest work load.

This will return a value in bytes. The primary dump device should be a size
that is at or greater than the value returned. In this case, the dump space
needs to be 4.5 megabytes. A normal system will have a physical partition size
of 4 megabytes for rootvg. The dump device has to be increased in multiples of
this size. A dump space of 4 megabytes would not be large enough to hold this
dump, so the next size would have to be 8 megabytes.

At AIX levels prior to 3.2.4, this command option may not be available. If this
is the case, a general rule of thumb is to make the dump device 1/4 of the size
of your total RAM. To obtain the size of your total RAM, enter:

bootinfo -r

If the dump device is a standard dump logical volume, such as hd7, then use the
command extendlv to increase its size. If it is the primary paging space hd6,
use the command chps.
-------------------------------------------------------------------------------

Setting a tape drive as a dump device

If you do not have sufficient space on the system to store a dump, use a tape
drive as the dump device. To accomplish this, put a blank tape in the desired
tape drive and enter:

sysdumpdev -Pp /dev/rmt#

In this case, rmt# refers to the specific tape drive you want to use for this
(for example, rmt0, rmt1, rmt2, etc.)

Be aware that the tape drive will not be usable by any other application until
you re-assign the dump device to another location.
-------------------------------------------------------------------------------

Extended options in AIX 4.x

At AIX 4.x, there are three extra attributes that are not available in AIX 3.2.
sysdumpdev -l will show these extra options.

Example:

# sysdumpdev -l
primary              /dev/hd6
secondary            /dev/sysdumpnull
copy directory       /var/adm/ras
forced copy flag     TRUE
always allow dump    TRUE

The copy directory entry specifies a filesystem in the rootvg volume group
where the dump will be copied upon reboot after a system dump. This only
applies if the primary dump is the primary paging space (hd6).

The force copy flag entry specifies if the system will prompt you to copy this
dump to external media if there is not enough space in the specified
filesystem. If this is set to FALSE and the system cannot copy this dump to the
filesystem, then it will discard the contents of the dump.

The always allow dump flag is a security measure. If this is set to FALSE, then
the only way to force a system dump would be to turn the service key to Service
and then press Reset. It also prevents forcing a dump of any kind on machines
with no service key, such as all PCI based machines.

If the primary dump device is the primary paging device, the only way it can
copy the dump to the filesystem save area is if there is enough free space in
that filesystem. The free space in the filesystem can be determined with the df
command. If the free space in that filesystem is not at least as large as the
space required for the dump (sysdumpdev -e), then either increase the size of
that filesystem to have enough free space, remove files in that filesystem
until enough free space is available, or move the save area to another
filesystem with the required space. The latter can be accomplished with the
sysdumpdev command. This filesystem must be in the rootvg volume group.
-------------------------------------------------------------------------------

Dumping to a mirrored logical volume

AIX does not support dumping to a mirrored logical volume. This is because the
dump only dumps to one copy of the logical volume. In other words, one of the
mirrors will contain the dump. Since the logical volume is not being handled
like a mirrored logical volume, the new data written, (for example, the dump)
will not be synched with the other mirrors. Thus, when crash tries to read the
dump, it can obtain data from both mirrors, only one of which actually contains
the dump. That is, crash sees good dump data mixed with garbage data, and will
not read the dump.

By splitting up the logical volume, creating one logical volume per copy of the
original, one of them would contain a good dump. This can be accomplished with
the splitlvcopy command. The procedure for splitting the logical volume is:
Run lslv to get the LV IDENTIFIER:

# lslv hd7
LOGICAL VOLUME: hd7                 VOLUME GROUP: rootvg
LV IDENTIFIER:  0000335216021417.12 PERMISSION:   read/write
VG STATE:       active/complete     LV STATE:     opened/syncd
TYPE:           dump                WRITE VERIFY: off
MAX LPs:        128                 PP SIZE:      4 megabyte(s)
COPIES:         2                   SCHED POLICY: parallel
LPs:            4                   PPs:          8
STALE PPs:      0                   BB POLICY:    relocatable
INTER-POLICY:   minimum             RELOCATABLE:  yes
INTRA-POLICY:   middle              UPPER BOUND:  32
MOUNT POINT:    N/A                 LABEL:        None
MIRROR WRITE CONSISTENCY: on
EACH LP COPY ON A SEPARATE PV ?: no

Notice that there are two copies. This means that there is one mirror. Use the
splitlvcopy command to split the logical volume, hd7 in this case, into two
logical volumes.

# splitlvcopy 0000335216021417.12 1

A message similar to the following may appear:

splitlvcopy: WARNING! The logical volume being split, hd7,
        is open. Splitting an open logical volume may cause
        data loss or corruption and is not supported by IBM.
        IBM will not be held responsible for data loss or
        corruption caused by splitting an open logical
        volume. Do you wish to continue? y(es) n(o)? lv02

Enter y. The command will complete and show the name of the new logical volume
it created, e.g., lv02. At this point, hd7 contains one copy of the original
hd7, and lv02 contains the other. This is exactly what we need.

If there had been three copies, shown by lslv, then lv02 would contain two
copies of the original hd7.

Run crash on /dev/hd7 first to see if that was the right copy. If crash does
not give error messages, the correct one has been found. If the dump is
unusable, run crash on /dev/lv02, if lv02 has only one copy, that is, if the
original hd7 contained two copies. If lv02 has two copies now, because the
original hd7 had 3, run lslv /dev/lv02 to get the LV IDENTIFIER. Then run
splitlvcopy <LVID> 1 to split lv02 to obtain one copy of each of its mirrors.

This may not work for dumps taken to mirrored paging space, because the pager
may have already overwritten the dump.
-------------------------------------------------------------------------------

Remote dumps over a network

Currently, the system dump does not handle ARP requests received from the
server, or the gateway used, during the dump. If an ARP request is received
while taking a dump, this causes the dump to hang. If your system takes a
system dump and hangs on 0c7, this is likely the problem. At this point, power
the system off and reboot.

To avoid this problem, create a permanent ARP entry for the client (the dumping
machine) on the server or gateway. The machine that needs the permanent ARP
entry is the machine on the same local network or ring as the client. This can
be thought of as the logical server, since, if it is not the real server, the
dump data must pass through it to get to the real server.

NOTE: "Real server" refers to the machine designated in the remote dump
specification on the client.

Run the following steps on the real server to establish a permanent ARP entry
on the server or gateway machine.

  1. Ensure an ARP entry exists by pinging the client. Example:

         ping myclient.xyz.com

  2. Use arp -a to see the ARP table. Example:

      # arp -a

     The following four lines of text should appear as two full lines.

      myclient.xyz.com (128.3.56.9)
                        at 10:0:5a:9:e:7d [token ring]
      myserver.xyz.com(128.3.56.20)
                        at 10:0:5a:8f:12:bf [token ring]|

  3. Now use the arp command to make the dumping client's entry permanent.
     Example:

          # arp -s 802.5 myclient.xyz.com 10:0:5a:9:e:7d

The 802.5 refers to a token-ring network. Valid network types are listed in the
ARP documentation of the product documentation, and are currently ether(802.3),
fddi, and 802.5.

NOTE: If the dump hangs and the client must be rebooted, the partial dump on
the server may still be useful.

Techdocs Ref:90605210214768                    4FAX Ref:6221