Troubleshooting UNIX

What to do when you have system problems?

The basic idea is to follow an organized procedure and not just randomly jump from one idea to another and quickly loose sight of what you have checked. A good basic technique is to divide the problem into areas and check each area until you are reasonably certain that the problem is not there and then after noting what you have checked, go on to another area.

File System


Tour of Unix File system


/ is the root of the unix file system. Everything grows from here.

/bin This contains a set of binaries used when the machine is first booted. Usually only a small subset of commands are available here.

/etc This contains system configuration files and executables. The password file and boot scripts are two examples of configuration files.

/dev This contains special files that are used to communicate with disks, tape, cdrom, and terminals.

/home This usually has the users home directories.

/list+found This contains lost files. Files that are recovered by fsck are placed in this directory.

/mnt A temporary mount point directory. It should be empty. Used for temporarily mounting directories.

/tmp Temporary directory available to all users as scratch space. Used by editors for temporary working copies of files. This is why you cannot send mail, edit or such activities when /tmp is full.

/var Spool files, accounting data and other varying data.

Display Free Disk Space

To see what the file system is looking like, including free and used space, then use the df -k command to see what the space is doing.

[www]:[5:01pm]:[/var/log] > df -k
Filesystem  1K-blocks     Used    Avail Capacity  Mounted on
/dev/da0s1a    496111    38559   417864     8%    /
/dev/da0s1f  23324075 10301228 11156921    48%    /usr
/dev/da0s1e   9924475   194425  8936092     2%    /var
procfs              4        4        0   100%    /proc
[www]:[5:01pm]:[/var/log] >

One of the more common problems that UNIX will run into but is difficult to intially see is the file system filling up. There are several prominent reasons that this can happen:

  • Little system maintenance has been done in the past and the file system is simply full, most likely from temp files.
  • A hard drive didn't mount properly, either due to a configuration issue or more likely to a problem with the hard drive itself. See fsck below.
  • Configuration issue where a file was supposed to go to a tape drive or remote file system, but instead was stored on the local system and filled up the hard drive.

I/O Statistics


/usr/etc/iostat 5

This reports every 5 seconds.

This gives the I/O activity for disks, by default only 4 disks. This can help determine if disk activity is causing your problems.

 tin tout sps tps msps  sps tps msps  sps tps msps  us ni sy id
   0   15   2   0  2.0    2   0 23.7   16   1 13.7   4  9  3  0 84
   0    0   0   0  0.0    0   0  0.0    0   0  0.0   0  0  0  0100
   0    0   0   0  0.0   19   1 26.5   13   1 12.5   0  0  0  0100

Checking the Memory and CPU Environment


Top CPU use processes


top

This gives an updating display of the top 10 to 15 cpu processes and some cpu statistics. The default update interval is 5 seconds. This may not be available on some systems as if installed with privs it can kill jobs that do not belong to you.

load averages:  0.00,  0.00,  0.00    04:50:47
35 processes:  1 running, 34 sleeping

Memory: Real: 3992K/10M Virt: 8308K/112M Free: 588K


  PID USERNAME PRI NICE  SIZE   RES STATE   TIME   WCPU    CPU COMMAND
27519 dlq       30    0  212K  356K run     0:00  0.00%  0.00% top
27471 dlq       18    0  224K  300K sleep   0:01  0.00%  0.00% csh
27368 dlq       18    0  224K  304K sleep   0:00  0.00%  0.00% csh
25144 dlq       18    0  220K  304K sleep   0:00  0.00%  0.00% csh
27401 dlq       18    0  224K  300K sleep   0:00  0.00%  0.00% csh
   94 root      18  -12  316K  216K sleep  63:15  0.00%  0.00% xntpd
27366 dlq       18    0  388K  160K sleep   0:00  0.00%  0.00% screen
13977 joe       18    0  232K  120K sleep   0:01  0.00%  0.00%
   82 root      18    0  284K  108K sleep   5:11  0.00%  0.00% cron
   80 root      18    0   72K   60K sleep  29:41  0.00%  0.00% update
    1 root      10    0  136K  124K sleep   0:25  0.00%  0.00% init
  125 root      10    0  416K   64K sleep 120:58  0.00%  0.00% httpd
   98 root      10    0   72K   28K sleep   0:00  0.00%  0.00%
   99 root      10    0   72K   28K sleep   0:00  0.00%  0.00%
14239 sally      3    0  224K  224K sleep   0:00  0.00%  0.00%

uptime

The uptime utility displays the current time, the length of time the system has been up, the number of users, and the load average of the system over the last 1, 5, and 15 minutes.

[www]:[5:14pm]:[/home/rnejdl] > uptime
 5:14PM  up 73 days,  1:39, 4 users, load averages: 0.01, 0.05, 0.03
[www]:[5:14pm]:[/home/rnejdl] >

In general, these numbers should be under 1, but a system may in general run high, so it is a good idea to have some historical information for these values before assuming that something is wrong here.

ps

As the load average is high, you need to determine the cause. Use ps -auxw to find out what processes are running on your system. Does the list of processes look normal (you really need to do this before you have problems so you can know what "normal" is)? If not look at the abnormal cases, maybe 20 copies of sendmail running. This is likely not something of your making, more likely it is a network slowness so the messages are moving v e r y s l o w l y. But you would want to check to see if that is the case.

[crimson]:[5:17pm]:[/home/rnejdl] > ps -auxw
USER          PID %CPU %MEM   VSZ  RSS  TT  STAT STARTED      TIME COMMAND
rnejdl      40659  1.0  0.5  1700 1340  p7  Ss    5:17PM   0:00.05 -tcsh (tcsh)
root            1  0.0  0.1   528  208  ??  ILs  20Jul01   0:04.53 /sbin/init --
root            2  0.0  0.0     0    0  ??  DL   20Jul01   0:08.92  (pagedaemon)
root            3  0.0  0.0     0    0  ??  DL   20Jul01   0:00.00  (vmdaemon)
root            4  0.0  0.0     0    0  ??  DL   20Jul01   0:37.37  (bufdaemon)
root            5  0.0  0.0     0    0  ??  DL   20Jul01  17:15.40  (syncer)
root          205  0.0  0.2   928  568  ??  Ss   20Jul01   0:37.76 syslogd -s
root          228  0.0  0.2  1048  628  ??  Is   20Jul01   0:00.75 inetd -wW
root          230  0.0  0.2   964  596  ??  Ss   20Jul01   0:26.17 /usr/sbin/cron
root          237  0.0  0.5  2148 1204  ??  Ss   20Jul01   9:15.74 /usr/sbin/sshd
root          239  0.0  0.2   896  520  ??  Is   20Jul01   0:10.51 /usr/sbin/usbd
root         1230  0.0  0.5  1924 1280  ??  Ss   20Jul01   2:35.82 /usr/local/sbin/httpd
root         1240  0.0  0.2   932  532  v0  Is+  20Jul01   0:00.00 /usr/libexec/getty Pc ttyv0
root         1241  0.0  0.2   932  532  v1  Is+  20Jul01   0:00.00 /usr/libexec/getty Pc ttyv1
root         1242  0.0  0.2   932  532  v2  Is+  20Jul01   0:00.00 /usr/libexec/getty Pc ttyv2
root         1243  0.0  0.2   932  532  v3  Is+  20Jul01   0:00.01 /usr/libexec/getty Pc ttyv3
root         1244  0.0  0.2   932  532  v4  Is+  20Jul01   0:00.00 /usr/libexec/getty Pc ttyv4
root         1245  0.0  0.2   932  532  v5  Is+  20Jul01   0:00.00 /usr/libexec/getty Pc ttyv5
root         1246  0.0  0.2   932  532  v6  Is+  20Jul01   0:00.00 /usr/libexec/getty Pc ttyv6
root         1247  0.0  0.2   932  532  v7  Is+  20Jul01   0:00.00 /usr/libexec/getty Pc ttyv7
nobody       1248  0.0  0.5  1988 1384  ??  I    20Jul01   0:15.65 /usr/local/sbin/httpd
nobody       1249  0.0  0.5  1988 1388  ??  I    20Jul01   0:16.02 /usr/local/sbin/httpd
nobody       1250  0.0  0.5  1988 1384  ??  I    20Jul01   0:15.67 /usr/local/sbin/httpd
nobody       1251  0.0  0.5  1988 1388  ??  I    20Jul01   0:15.51 /usr/local/sbin/httpd
nobody       1252  0.0  0.5  1988 1384  ??  I    20Jul01   0:15.74 /usr/local/sbin/httpd
nobody       1263  0.0  0.5  1988 1380  ??  I    20Jul01   0:15.87 /usr/local/sbin/httpd
nobody       1267  0.0  0.5  1988 1376  ??  I    20Jul01   0:15.90 /usr/local/sbin/httpd
nobody       1285  0.0  0.5  1992 1384  ??  I    20Jul01   0:15.44 /usr/local/sbin/httpd
nobody       1377  0.0  0.5  1988 1376  ??  I    20Jul01   0:15.34 /usr/local/sbin/httpd
nobody       1378  0.0  0.5  1988 1392  ??  I    20Jul01   0:15.31 /usr/local/sbin/httpd
dwhite      66457  0.0  0.6  2140 1560  ??  Is    8Aug01   0:12.68 named
root        35401  0.0  0.7  2432 1688  ??  Is   28Aug01   1:17.87 sendmail: accepting connections (sendmail)
evillarreal 46253  0.0  0.6  2140 1584  ??  Ss   10Sep01   0:04.64 named
evillarreal 47496  0.0  0.6  2140 1584  ??  Is   10Sep01   0:04.63 named
root        71197  0.0  0.9  2908 2376  ??  Ss   20Sep01   0:23.50 /usr/sbin/named
root         9001  0.0  0.6  2232 1564  ??  I    Fri10PM   0:00.04 sshd: cbohlin@ttyp1 (sshd)
cbohlin      9009  0.0  0.5  1708 1320  p1  Is+  Fri11PM   0:00.04 -tcsh (tcsh)
root        38599  0.0  0.6  2232 1568  ??  I     1:24PM   0:00.67 sshd: twashburn@ttyp2 (sshd)
twashburn   38600  0.0  0.5  1692 1332  p2  Is    1:24PM   0:00.05 -tcsh (tcsh)
root        38830  0.0  0.6  2232 1564  ??  I     1:53PM   0:00.19 sshd: xtran@ttyp3 (sshd)
xtran       38831  0.0  0.5  1708 1320  p3  Is+   1:53PM   0:00.08 -tcsh (tcsh)
root        38908  0.0  0.6  2232 1568  ??  I     2:00PM   0:00.14 sshd: ashenkar@ttyp4 (sshd)
ashenkar    38909  0.0  0.5  1704 1316  p4  Is    2:00PM   0:00.07 -tcsh (tcsh)
root        39050  0.0  0.6  2232 1564  ??  I     2:18PM   0:00.05 sshd: kbrown@ttyp5 (sshd)
kbrown      39072  0.0  0.5  1688 1316  p5  Is+   2:21PM   0:00.04 -tcsh (tcsh)
twashburn   39952  0.0  0.9  2544 2252  p2  I+    4:02PM   0:00.66 /usr/local/bin/mutt
root        40112  0.0  0.6  2232 1568  ??  I     4:17PM   0:00.16 sshd: dwalton
@ttyp0 (sshd)
dwalton     40113  0.0  0.5  1692 1304  p0  Is+   4:17PM   0:00.10 -tcsh (tcsh)
root        40616  0.0  0.6  2232 1568  ??  I     5:12PM   0:00.05 sshd: rojackson@ttyp6 (sshd)
rojackson   40620  0.0  0.5  1712 1332  p6  Is+   5:13PM   0:00.05 -tcsh (tcsh)
ashenkar    40643  0.0  0.5  2128 1244  p4  I+    5:15PM   0:00.01 telnet 204.233.235.1 25
awilson2    40657  0.0  0.0   176   36  ??  I     5:16PM   0:00.00 sleep 33
root        40658  0.4  0.6  2232 1568  ??  S     5:17PM   0:00.04 sshd: rnejdl@ttyp7 (sshd)
root            0  0.0  0.0     0    0  ??  DLs  20Jul01   0:03.35  (swapper)
rnejdl      40663  0.0  0.1   448  216  p7  R+    5:17PM   0:00.00 ps -auxw
[crimson]:[5:17pm]:[/home/rnejdl] >

w

The w command is useful to see who is currently logged into the system and what they are doing at this time.

[www]:[5:01pm]:[/var/log] > w
 5:04PM  up 73 days,  1:28, 4 users, load averages: 0.07, 0.07, 0.02
USER             TTY      FROM              LOGIN@  IDLE WHAT
rnejdl           p0       rnejdl            2:42PM     - vi unixtroubleshooting
rnejdl           p1       tethys.ringofsat Fri07PM 1day  tail -f error_log2
rnejdl           p2       tethys.ringofsat Fri07PM 15:55 -tcsh (tcsh)
rnejdl           p3       rnejdl            4:11PM     - w
[www]:[5:04pm]:[/var/log] >

Virtual Memory Statistics

Another cause of slowness is the system doing nothing but swapping since it has run out of memory. This may be caused by a gradual increase in workload or by having a number of runaway processes consuming all available memory. Here you want to issue a vmstat. One of the fields is swap i/o, you will probably need to do a man vmstat to find out what column(s) are swapping.

This command is useful to give an overview of your system's memory resources and paging information. With some options it gives the i/o rates to disks.

vmstat 5

This reports every 5 seconds

 procs   memory     page                    disks      faults      cpu
 r b w   avm   fre  flt  re  pi  po  fr  sr w0 s0 s1   in   sy  cs us sy id
 0 0 0  7708   692   79   2   0   0    0   0 0  0  1   111 516  16 13  3 84
 0 0 0  8288   684    2   0   1   0    0   0 0  0  0   106  26   9  0  0 100
 0 0 0  8652   684    1   0   0   0    0   0 0  1  1   107  24   9  0  0 100

System Messages


Check for system errors, error recovery can just halt a system. Look at the console messages and the system log. A full file system can also radically slow a system as it is having to recover from a full file system error after each write.

The system messages are found in the file /var/log/messages and to view this real time, you can use the tail -f command.

[crimson]:[5:20pm]:[/home/awilson2/bin] > tail -f /var/log/messages
Oct  1 16:56:37 crimson sshd[40453]: fatal: Timeout before authentication for 207.55.202.98.
Oct  1 16:58:37 crimson sshd[40468]: fatal: Timeout before authentication for 207.55.202.98.
Oct  1 17:00:37 crimson sshd[40483]: fatal: Timeout before authentication for 207.55.202.98.
Oct  1 17:02:37 crimson sshd[40504]: fatal: Timeout before authentication for 207.55.202.98.
Oct  1 17:04:37 crimson sshd[40531]: fatal: Timeout before authentication for 207.55.202.98.
Oct  1 17:06:37 crimson sshd[40546]: fatal: Timeout before authentication for 207.55.202.98.
Oct  1 17:08:37 crimson sshd[40563]: fatal: Timeout before authentication for 207.55.202.98.
Oct  1 17:10:37 crimson sshd[40580]: fatal: Timeout before authentication for 207.55.202.98.
Oct  1 17:12:37 crimson sshd[40603]: fatal: Timeout before authentication for 207.55.202.98.

Networking


Interface Config


The ifconfig command is useful for viewing the current state of the Ethernet interface that a server is using to connect with. Without any command arguments, it will show all of the interfaces that a machine is configured for, but here, we are only interested in the Ethernet interface, which is listed at the top as xl0.

[crimson]:[5:25pm]:[/home/awilson2/bin] > ifconfig
xl0: flags=8843 mtu 1500
        inet 209.39.6.5 netmask 0xffffff00 broadcast 209.39.6.255
        inet6 fe80::2c0:4fff:fe04:e77f%xl0 prefixlen 64 scopeid 0x1
        ether 00:c0:4f:04:e7:7f
        media: autoselect (100baseTX ) status: active
        supported media: autoselect 100baseTX  100baseTX 10baseT/UT
P  10baseT/UTP 100baseTX 
lp0: flags=8810 mtu 1500
faith0: flags=8000 mtu 1500
gif0: flags=8010 mtu 1280
gif1: flags=8010 mtu 1280
gif2: flags=8010 mtu 1280
gif3: flags=8010 mtu 1280
lo0: flags=8049 mtu 16384
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8
        inet6 ::1 prefixlen 128
        inet 127.0.0.1 netmask 0xff000000
ppp0: flags=8010 mtu 1500
sl0: flags=c010 mtu 552
[crimson]:[5:25pm]:[/home/awilson2/bin] >

One thing to watch our for is the media line to ensure that the network card is configured for 100 Megabits and Full-Duplex, which in this case it is. With Sun servers, the media type that the Ethernet card chooses often does not match that with the Catalyst switch at the other end of the connection. With these instances, hard coding the configurations for both ends tends to solve packletloss problems that come up.

Network Statistics


/usr/etc/netstat 5

This reports every 5 seconds.

   input    (we0)     output            input   (Total)    output
  packets  errs   packets  errs colls    packets  errs   packets  errs colls
46779720    25  8490605     1 100434  46781383    25  8492268     1 100434
      16     0        2     0     0        16     0        2     0     0
      14     0        1     0     0        14     0        1     0     0

Network errors can also bring your system to an effective crawl for users not logged on via directly attached terminals or the console. Sometimes you can find these by netstat 5. This gives you a count of packets, errors and collisions. The first line is a cumulative total, so you are interested in the following lines that are 5 second interval numbers. Lots of errors is very bad news, and you need to contact your networking support folks. A number of collisions equal to the number of packets sent is also a bad sign, usually it means that you have a conflict between transceiver and system in what they want for heartbeat or sqe test signals.