Troubleshooting UNIX
What to do when you have system problems?
The basic idea is to follow an organized procedure and not just randomly jump from one idea to another and quickly loose sight of what you have checked. A good basic technique is to divide the problem into areas and check each area until you are reasonably certain that the problem is not there and then after noting what you have checked, go on to another area.
File System
Tour of Unix File system
/ is the root of the unix file system. Everything grows from here.
/bin This contains a set of binaries used when the machine is first booted. Usually only a small subset of commands are available here.
/etc This contains system configuration files and executables. The password file and boot scripts are two examples of configuration files.
/dev This contains special files that are used to communicate with disks, tape, cdrom, and terminals.
/home This usually has the users home directories.
/list+found This contains lost files. Files that are recovered by fsck are placed in this directory.
/mnt A temporary mount point directory. It should be empty. Used for temporarily mounting directories.
/tmp Temporary directory available to all users as scratch space. Used by editors for temporary working copies of files. This is why you cannot send mail, edit or such activities when /tmp is full.
/var Spool files, accounting data and other varying data.
Display Free Disk Space
To see what the file system is looking like, including free and used space, then use the df -k command to see what the space is doing.
[www]:[5:01pm]:[/var/log] > df -k Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 496111 38559 417864 8% / /dev/da0s1f 23324075 10301228 11156921 48% /usr /dev/da0s1e 9924475 194425 8936092 2% /var procfs 4 4 0 100% /proc [www]:[5:01pm]:[/var/log] >One of the more common problems that UNIX will run into but is difficult to intially see is the file system filling up. There are several prominent reasons that this can happen:
- Little system maintenance has been done in the past and the file system is simply full, most likely from temp files.
- A hard drive didn't mount properly, either due to a configuration issue or more likely to a problem with the hard drive itself. See fsck below.
- Configuration issue where a file was supposed to go to a tape drive or remote file system, but instead was stored on the local system and filled up the hard drive.
I/O Statistics
/usr/etc/iostat 5
This reports every 5 seconds.
This gives the I/O activity for disks, by default only 4 disks. This can help determine if disk activity is causing your problems.
tin tout sps tps msps sps tps msps sps tps msps us ni sy id 0 15 2 0 2.0 2 0 23.7 16 1 13.7 4 9 3 0 84 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0 0100 0 0 0 0 0.0 19 1 26.5 13 1 12.5 0 0 0 0100Checking the Memory and CPU Environment
Top CPU use processes
top
This gives an updating display of the top 10 to 15 cpu processes and some cpu statistics. The default update interval is 5 seconds. This may not be available on some systems as if installed with privs it can kill jobs that do not belong to you.
load averages: 0.00, 0.00, 0.00 04:50:47 35 processes: 1 running, 34 sleeping Memory: Real: 3992K/10M Virt: 8308K/112M Free: 588K PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 27519 dlq 30 0 212K 356K run 0:00 0.00% 0.00% top 27471 dlq 18 0 224K 300K sleep 0:01 0.00% 0.00% csh 27368 dlq 18 0 224K 304K sleep 0:00 0.00% 0.00% csh 25144 dlq 18 0 220K 304K sleep 0:00 0.00% 0.00% csh 27401 dlq 18 0 224K 300K sleep 0:00 0.00% 0.00% csh 94 root 18 -12 316K 216K sleep 63:15 0.00% 0.00% xntpd 27366 dlq 18 0 388K 160K sleep 0:00 0.00% 0.00% screen 13977 joe 18 0 232K 120K sleep 0:01 0.00% 0.00% 82 root 18 0 284K 108K sleep 5:11 0.00% 0.00% cron 80 root 18 0 72K 60K sleep 29:41 0.00% 0.00% update 1 root 10 0 136K 124K sleep 0:25 0.00% 0.00% init 125 root 10 0 416K 64K sleep 120:58 0.00% 0.00% httpd 98 root 10 0 72K 28K sleep 0:00 0.00% 0.00% 99 root 10 0 72K 28K sleep 0:00 0.00% 0.00% 14239 sally 3 0 224K 224K sleep 0:00 0.00% 0.00%uptime
The uptime utility displays the current time, the length of time the system has been up, the number of users, and the load average of the system over the last 1, 5, and 15 minutes.
[www]:[5:14pm]:[/home/rnejdl] > uptime 5:14PM up 73 days, 1:39, 4 users, load averages: 0.01, 0.05, 0.03 [www]:[5:14pm]:[/home/rnejdl] >In general, these numbers should be under 1, but a system may in general run high, so it is a good idea to have some historical information for these values before assuming that something is wrong here.
ps
As the load average is high, you need to determine the cause. Use ps -auxw to find out what processes are running on your system. Does the list of processes look normal (you really need to do this before you have problems so you can know what "normal" is)? If not look at the abnormal cases, maybe 20 copies of sendmail running. This is likely not something of your making, more likely it is a network slowness so the messages are moving v e r y s l o w l y. But you would want to check to see if that is the case.
[crimson]:[5:17pm]:[/home/rnejdl] > ps -auxw USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND rnejdl 40659 1.0 0.5 1700 1340 p7 Ss 5:17PM 0:00.05 -tcsh (tcsh) root 1 0.0 0.1 528 208 ?? ILs 20Jul01 0:04.53 /sbin/init -- root 2 0.0 0.0 0 0 ?? DL 20Jul01 0:08.92 (pagedaemon) root 3 0.0 0.0 0 0 ?? DL 20Jul01 0:00.00 (vmdaemon) root 4 0.0 0.0 0 0 ?? DL 20Jul01 0:37.37 (bufdaemon) root 5 0.0 0.0 0 0 ?? DL 20Jul01 17:15.40 (syncer) root 205 0.0 0.2 928 568 ?? Ss 20Jul01 0:37.76 syslogd -s root 228 0.0 0.2 1048 628 ?? Is 20Jul01 0:00.75 inetd -wW root 230 0.0 0.2 964 596 ?? Ss 20Jul01 0:26.17 /usr/sbin/cron root 237 0.0 0.5 2148 1204 ?? Ss 20Jul01 9:15.74 /usr/sbin/sshd root 239 0.0 0.2 896 520 ?? Is 20Jul01 0:10.51 /usr/sbin/usbd root 1230 0.0 0.5 1924 1280 ?? Ss 20Jul01 2:35.82 /usr/local/sbin/httpd root 1240 0.0 0.2 932 532 v0 Is+ 20Jul01 0:00.00 /usr/libexec/getty Pc ttyv0 root 1241 0.0 0.2 932 532 v1 Is+ 20Jul01 0:00.00 /usr/libexec/getty Pc ttyv1 root 1242 0.0 0.2 932 532 v2 Is+ 20Jul01 0:00.00 /usr/libexec/getty Pc ttyv2 root 1243 0.0 0.2 932 532 v3 Is+ 20Jul01 0:00.01 /usr/libexec/getty Pc ttyv3 root 1244 0.0 0.2 932 532 v4 Is+ 20Jul01 0:00.00 /usr/libexec/getty Pc ttyv4 root 1245 0.0 0.2 932 532 v5 Is+ 20Jul01 0:00.00 /usr/libexec/getty Pc ttyv5 root 1246 0.0 0.2 932 532 v6 Is+ 20Jul01 0:00.00 /usr/libexec/getty Pc ttyv6 root 1247 0.0 0.2 932 532 v7 Is+ 20Jul01 0:00.00 /usr/libexec/getty Pc ttyv7 nobody 1248 0.0 0.5 1988 1384 ?? I 20Jul01 0:15.65 /usr/local/sbin/httpd nobody 1249 0.0 0.5 1988 1388 ?? I 20Jul01 0:16.02 /usr/local/sbin/httpd nobody 1250 0.0 0.5 1988 1384 ?? I 20Jul01 0:15.67 /usr/local/sbin/httpd nobody 1251 0.0 0.5 1988 1388 ?? I 20Jul01 0:15.51 /usr/local/sbin/httpd nobody 1252 0.0 0.5 1988 1384 ?? I 20Jul01 0:15.74 /usr/local/sbin/httpd nobody 1263 0.0 0.5 1988 1380 ?? I 20Jul01 0:15.87 /usr/local/sbin/httpd nobody 1267 0.0 0.5 1988 1376 ?? I 20Jul01 0:15.90 /usr/local/sbin/httpd nobody 1285 0.0 0.5 1992 1384 ?? I 20Jul01 0:15.44 /usr/local/sbin/httpd nobody 1377 0.0 0.5 1988 1376 ?? I 20Jul01 0:15.34 /usr/local/sbin/httpd nobody 1378 0.0 0.5 1988 1392 ?? I 20Jul01 0:15.31 /usr/local/sbin/httpd dwhite 66457 0.0 0.6 2140 1560 ?? Is 8Aug01 0:12.68 named root 35401 0.0 0.7 2432 1688 ?? Is 28Aug01 1:17.87 sendmail: accepting connections (sendmail) evillarreal 46253 0.0 0.6 2140 1584 ?? Ss 10Sep01 0:04.64 named evillarreal 47496 0.0 0.6 2140 1584 ?? Is 10Sep01 0:04.63 named root 71197 0.0 0.9 2908 2376 ?? Ss 20Sep01 0:23.50 /usr/sbin/named root 9001 0.0 0.6 2232 1564 ?? I Fri10PM 0:00.04 sshd: cbohlin@ttyp1 (sshd) cbohlin 9009 0.0 0.5 1708 1320 p1 Is+ Fri11PM 0:00.04 -tcsh (tcsh) root 38599 0.0 0.6 2232 1568 ?? I 1:24PM 0:00.67 sshd: twashburn@ttyp2 (sshd) twashburn 38600 0.0 0.5 1692 1332 p2 Is 1:24PM 0:00.05 -tcsh (tcsh) root 38830 0.0 0.6 2232 1564 ?? I 1:53PM 0:00.19 sshd: xtran@ttyp3 (sshd) xtran 38831 0.0 0.5 1708 1320 p3 Is+ 1:53PM 0:00.08 -tcsh (tcsh) root 38908 0.0 0.6 2232 1568 ?? I 2:00PM 0:00.14 sshd: ashenkar@ttyp4 (sshd) ashenkar 38909 0.0 0.5 1704 1316 p4 Is 2:00PM 0:00.07 -tcsh (tcsh) root 39050 0.0 0.6 2232 1564 ?? I 2:18PM 0:00.05 sshd: kbrown@ttyp5 (sshd) kbrown 39072 0.0 0.5 1688 1316 p5 Is+ 2:21PM 0:00.04 -tcsh (tcsh) twashburn 39952 0.0 0.9 2544 2252 p2 I+ 4:02PM 0:00.66 /usr/local/bin/mutt root 40112 0.0 0.6 2232 1568 ?? I 4:17PM 0:00.16 sshd: dwalton @ttyp0 (sshd) dwalton 40113 0.0 0.5 1692 1304 p0 Is+ 4:17PM 0:00.10 -tcsh (tcsh) root 40616 0.0 0.6 2232 1568 ?? I 5:12PM 0:00.05 sshd: rojackson@ttyp6 (sshd) rojackson 40620 0.0 0.5 1712 1332 p6 Is+ 5:13PM 0:00.05 -tcsh (tcsh) ashenkar 40643 0.0 0.5 2128 1244 p4 I+ 5:15PM 0:00.01 telnet 204.233.235.1 25 awilson2 40657 0.0 0.0 176 36 ?? I 5:16PM 0:00.00 sleep 33 root 40658 0.4 0.6 2232 1568 ?? S 5:17PM 0:00.04 sshd: rnejdl@ttyp7 (sshd) root 0 0.0 0.0 0 0 ?? DLs 20Jul01 0:03.35 (swapper) rnejdl 40663 0.0 0.1 448 216 p7 R+ 5:17PM 0:00.00 ps -auxw [crimson]:[5:17pm]:[/home/rnejdl] >w
The w command is useful to see who is currently logged into the system and what they are doing at this time.
[www]:[5:01pm]:[/var/log] > w 5:04PM up 73 days, 1:28, 4 users, load averages: 0.07, 0.07, 0.02 USER TTY FROM LOGIN@ IDLE WHAT rnejdl p0 rnejdl 2:42PM - vi unixtroubleshooting rnejdl p1 tethys.ringofsat Fri07PM 1day tail -f error_log2 rnejdl p2 tethys.ringofsat Fri07PM 15:55 -tcsh (tcsh) rnejdl p3 rnejdl 4:11PM - w [www]:[5:04pm]:[/var/log] >Virtual Memory Statistics
Another cause of slowness is the system doing nothing but swapping since it has run out of memory. This may be caused by a gradual increase in workload or by having a number of runaway processes consuming all available memory. Here you want to issue a vmstat. One of the fields is swap i/o, you will probably need to do a man vmstat to find out what column(s) are swapping.
This command is useful to give an overview of your system's memory resources and paging information. With some options it gives the i/o rates to disks.
vmstat 5
This reports every 5 seconds
procs memory page disks faults cpu r b w avm fre flt re pi po fr sr w0 s0 s1 in sy cs us sy id 0 0 0 7708 692 79 2 0 0 0 0 0 0 1 111 516 16 13 3 84 0 0 0 8288 684 2 0 1 0 0 0 0 0 0 106 26 9 0 0 100 0 0 0 8652 684 1 0 0 0 0 0 0 1 1 107 24 9 0 0 100System Messages
Check for system errors, error recovery can just halt a system. Look at the console messages and the system log. A full file system can also radically slow a system as it is having to recover from a full file system error after each write.
The system messages are found in the file /var/log/messages and to view this real time, you can use the tail -f command.
[crimson]:[5:20pm]:[/home/awilson2/bin] > tail -f /var/log/messages Oct 1 16:56:37 crimson sshd[40453]: fatal: Timeout before authentication for 207.55.202.98. Oct 1 16:58:37 crimson sshd[40468]: fatal: Timeout before authentication for 207.55.202.98. Oct 1 17:00:37 crimson sshd[40483]: fatal: Timeout before authentication for 207.55.202.98. Oct 1 17:02:37 crimson sshd[40504]: fatal: Timeout before authentication for 207.55.202.98. Oct 1 17:04:37 crimson sshd[40531]: fatal: Timeout before authentication for 207.55.202.98. Oct 1 17:06:37 crimson sshd[40546]: fatal: Timeout before authentication for 207.55.202.98. Oct 1 17:08:37 crimson sshd[40563]: fatal: Timeout before authentication for 207.55.202.98. Oct 1 17:10:37 crimson sshd[40580]: fatal: Timeout before authentication for 207.55.202.98. Oct 1 17:12:37 crimson sshd[40603]: fatal: Timeout before authentication for 207.55.202.98.Networking
Interface Config
The ifconfig command is useful for viewing the current state of the Ethernet interface that a server is using to connect with. Without any command arguments, it will show all of the interfaces that a machine is configured for, but here, we are only interested in the Ethernet interface, which is listed at the top as xl0.
[crimson]:[5:25pm]:[/home/awilson2/bin] > ifconfig xl0: flags=8843mtu 1500 inet 209.39.6.5 netmask 0xffffff00 broadcast 209.39.6.255 inet6 fe80::2c0:4fff:fe04:e77f%xl0 prefixlen 64 scopeid 0x1 ether 00:c0:4f:04:e7:7f media: autoselect (100baseTX ) status: active supported media: autoselect 100baseTX 100baseTX 10baseT/UT P 10baseT/UTP 100baseTX lp0: flags=8810 mtu 1500 faith0: flags=8000 mtu 1500 gif0: flags=8010 mtu 1280 gif1: flags=8010 mtu 1280 gif2: flags=8010 mtu 1280 gif3: flags=8010 mtu 1280 lo0: flags=8049 mtu 16384 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8 inet6 ::1 prefixlen 128 inet 127.0.0.1 netmask 0xff000000 ppp0: flags=8010 mtu 1500 sl0: flags=c010 mtu 552 [crimson]:[5:25pm]:[/home/awilson2/bin] > One thing to watch our for is the media line to ensure that the network card is configured for 100 Megabits and Full-Duplex, which in this case it is. With Sun servers, the media type that the Ethernet card chooses often does not match that with the Catalyst switch at the other end of the connection. With these instances, hard coding the configurations for both ends tends to solve packletloss problems that come up.
Network Statistics
/usr/etc/netstat 5
This reports every 5 seconds.
input (we0) output input (Total) output packets errs packets errs colls packets errs packets errs colls 46779720 25 8490605 1 100434 46781383 25 8492268 1 100434 16 0 2 0 0 16 0 2 0 0 14 0 1 0 0 14 0 1 0 0Network errors can also bring your system to an effective crawl for users not logged on via directly attached terminals or the console. Sometimes you can find these by netstat 5. This gives you a count of packets, errors and collisions. The first line is a cumulative total, so you are interested in the following lines that are 5 second interval numbers. Lots of errors is very bad news, and you need to contact your networking support folks. A number of collisions equal to the number of packets sent is also a bad sign, usually it means that you have a conflict between transceiver and system in what they want for heartbeat or sqe test signals.