Network and System Monitoring Primers
So the question is: "What do you need to monitor?" The answer is easy: "Everything". That's a pretty big amount. Let reality kick in and rephrase it to: "Everything you can think of". This is too vague and misses the final goal: "Everything you assume to be the normal conditions for a system or service to run properly.".
"Everything you assume to be the normal conditions for a system or service run properly". That is quite a lot, and different for each service you provide or device you have on the network. It will make you to have to you dig into your network and servers to find out what is going on.
This also means that you need to find out how your network and services behave. In the beginning you will get a number of alerts which will be false positives and you need to tune your alert settings. Or you will get the same alerts every day at the same time: they will be normal system behaviour. You can chose between getting these messages everyday and changing the alert-settings so that you won't get them anymore. History shows that the last option isn't the smartest one, because it will hide possible issues from you.
A lot of the items described in this primer can be considered as being too nitpicky which might or might not be an issue for you at this moment. Keep in mind that you should monitor for normal operations! Everything which happens in your network and systems which isn't normal is worth investigating!
This primer only describes active monitoring and realtime monitoring. Passive monitoring (via SNMP traps, syslog messages or monitoring agents) and historical monitoring (history graphs with application like Cacti) are not described.
This primer tries to be generic, but is based on my experience with Nagios. At the end there will be a link to the scripts I use for non-standard Nagios features.
Nagios is described as "a host, service and network monitoring program". If you are a beginner, you will find its configuration files horrible. But once you get through that, it is easy to expand. Nagios does do all the checking by executing scripts in its libexec/ directory. This doesn't mean that it is limited to doing checks of remote services which are running on other hosts. For that there is the program called NRPE, which stands for Nagios Remote Program Executor. This program runs as a daemon on the remote hosts and runs the same but local installed Nagios scripts. (FreeBSD: net-mgmt/nagios and net-mgmt/nrpe2)
There are several components which needs to be monitored on a system: Hardware (disks, CPU), the Operating System and the Services.
- These days you can get a lot of information about components
of your motherboards: CPU temperature, internal temperature,
fan speeds and power voltages. Higher temperatures are bad for
your motherboard. Fan speeds which are suddenly much higher,
or lower, indicate that one of them is broken and might cause
higher temperatures. And power voltage changes indicate problems
with your power supply.
On Linux, this information can be gathered from /proc/acpi. On FreeBSD this can be gathered via sysutils/healthd.
- IDE harddisks characteristics can be monitored via the SMART interface
(Self-Monitoring, Analysis and Reporting Technology), for example
the temperature of the disks and a handful of counters:
Reallocated Sector Count, Seek Error Rate, Spin Retry Count,
Calibration Retry Count, Reallocated Event Count, Current Pending
Sector and UDMA CRC Error Count. If these counters go up, there
might be a problem with your harddisk.
On Linux and FreeBSD this data can be gathered via the smartmontools software (FreeBSD: sysutils/smartmontools).
- RAID hardware is beautiful, a broken harddisk won't wake you
up in the middle of the night anymore (but two broken harddisks
will so it better be monitored). There are various ways to check
them, and every vendors seems to have its own software. The
following works on FreeBSD:
- camcontrol for HP/Compaq RAID cards:
[~] root@freebsd>camcontrol inquiry da3 pass3: <COMPAQ RAID 1 VOLUME OK> Fixed Direct Access SCSI-0 device
- TW_CLI for 3WARE RAID cards: (sysutils/tw_cli)
[~] root@freebsd>/usr/local/bin/tw_cli info c0 unitstatus # of units: 1 Unit 0: RAID 5 1.63 TB ( 3516478848 blocks): OK
- aaccli for Adaptec AAC Controllers (sysutils/aaccli)
[~] root@freebsd>aaccli 'open aac0 : disk list : container list' -------------------------------------------------------------------------------- Adaptec SCSI RAID Controller Command Line Interface Copyright 1998-2002 Adaptec, Inc. All rights reserved -------------------------------------------------------------------------------- Executing: open "aac0" Executing: disk list C:ID:L Device Type Blocks Bytes/Block Usage Shared Rate ------ -------------- --------- ----------- ---------------- ------ ---- 0:00:0 Disk 390721968 512 Initialized NO 100 0:03:0 Disk 390721968 512 Initialized NO 100 Executing: container list Num Total Oth Stripe Scsi Partition Label Type Size Ctr Size Usage C:ID:L Offset:Size ----- ------ ------ --- ------ ------- ------ ------------- 0 RAID-5 745GB 64KB Open 0:00:0 64.0KB: 186GB /dev/aacd0 raid5 0:03:0 64.0KB: 186GB
- camcontrol for HP/Compaq RAID cards:
- Network interfaces monitoring consists of two items: The first one
is the number of packet errors, which can be gathered by the
output of netstat -ni:
The second one is the media status: how is the device talking to the switch. On Linux this can be found with the output of mii-tool, on FreeBSD this can be found on the media line in the output of ifconfig:
[~] root@linux>netstat -ni Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth1 1500 0 641851668 0 0 0 646287092 0 0 0 BMRU eth2 1500 0 711410096 0 0 0 701868617 0 0 0 BMRU lo 16436 0 6611086 0 0 0 6611086 0 0 0 LRU [~] root@freebsd>netstat -ni sk0 Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll sk0 1500 <Link#1> 00:0f:ea:2c:d5:18 3706970 0 3316491 0 0 sk0 1500 fe80:1::20f:e fe80:1::20f:eaff: 0 - 2 - - sk0 1500 10.251.1.16/2 10.251.1.18 2923134 - 2536594 - -
[~] root@linux>mii-tool eth1 eth1: negotiated 100baseTx-FD flow-control, link ok [~] root@freebsd>ifconfig sk0 sk0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 options=8<VLAN_MTU> inet6 fe80::20f:eaff:fe2c:d518%sk0 prefixlen 64 scopeid 0x1 inet 10.251.1.18 netmask 0xfffffff0 broadcast 10.251.1.31 ether 00:0f:ea:2c:d5:18 media: Ethernet autoselect (100baseTX <full-duplex,flag0,flag1>) status: active
- If you have a UPS, see if you can get the status of it. Information
from APC UPS's can be gathered via apcupsd and
Do not check only for ONLINE, check if ONLINE is the only string because
STATUS : ONLINE
can be valid too! (FreeBSD: sysutils/apcupsd)
STATUS : ONLINE REPLACEBATT
The Operating System
- Diskspace information, or partition information, can be gotten with the
output of df, which gives you the free disk
space. Another important piece of information is the number of
inodes you have available: df -i, because if you don't
have any inodes free, you can't create any more files.
[~] root@freebsd>df -i / Filesystem 1K-blocks Used Avail Capacity iused ifree %iused Mounted on /dev/da0s1a 128990 86072 32600 73% 2952 13302 18% / [~] root@linux>df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/VolGroup00-LogVol00 35291136 559063 34732073 2% / /dev/cciss/c0d0p1 26104 36 26068 1% /boot
- A freshly installed system should have very few services running,
maybe only crond, inetd, ntpd, sshd and syslogd. They all create
their own PID files, so it is easy to get the process IDs:
Of course you need to check first if the PID file exists.
[~] root@freebsd>ps wup `head /var/run/ntpd.pid ` USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ntp 3148 0.0 0.1 4044 4044 ? SLs Apr14 0:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
If the service is a network based service, then besides checking if the PID files exist and the processes exist, you should also check if the service works up to some extent: For ssh you should get the SSH banner, for ntpd you can check if the services is synced.
- If the server is supposed to transport emails (both mail servers
and application servers), then check if the mail-queue is more
or less empty:
[~] root@postfix>mailq -Queue ID- --Size-- ----Arrival Time---- -Sender/Recipient------- A634A5CD036 66814 Mon Apr 16 12:38:57 email@example.com [..] -- 343 Kbytes in 3 Requests. [~] root@postfix>mailq Mail queue is empty [~] root@sendmail>mailq /var/spool/mqueue is empty Total requests: 0
- On machines which act as a router, and specially ones which do
dynamic routing, it is important to make sure that the default
gateway is pointing to the expected interface:
If you use dynamic routing in your network, one other thing you need to do on one or more machines is to check if you have all your networks in the routing table. Missing one means that you can't reach that network from that machine!
[~] root@freebsd>netstat -rn -f inet | grep ^default default 22.214.171.124 UG1 1 885506855 fxp2 [~] root@linux>netstat -rn | grep ^0.0.0.0 0.0.0.0 10.252.13.9 0.0.0.0 UG 0 0 0 eth0
- Number of users, total processes and swap: Easy to measure, and
it might be an indication that there is something wrong. For
machines which are servers, the number of users logged in
shouldn't be too high: Unless work is done on them, nobody
should be logged in.
For swap, preferable it is not in use.
[~] root@freebsd>uptime 4:13PM up 88 days, 4 mins, 4 users, load averages: 0.24, 0.39, 0.32 [~] root@freebsd>ps auxw | wc -l 205 [~] root@freebsd>swapinfo Device 1K-blocks Used Avail Capacity /dev/ad10s1b 4168496 480 4168016 0% [~] root@linux>uptime 16:12:49 up 4 days, 3:52, 2 users, load average: 0.11, 0.11, 0.06 [~] root@linux>ps auxw | wc -l 90 [~] root@linux>cat /proc/swaps Filename Type Size Used Priority /dev/mapper/VolGroup00-LogVol01 partition 2031608 1576 -1
- Uptime. Don't rely on the host not being able to be pinged to
determine if the machine has been rebooted. With todays hardware
and background file system checks the machine is back before the
ping-timeout threshold has been reached. If the uptime has been
reseted, then something has happened!
Note that if you monitor this via SNMP, that the system.sysUptime OID returns the number of seconds from the snmpd being active, not the number of seconds of the machine being active. Restarting the snmpd will reset this counter!
SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19
- FreeBSD Jails are great for setting up small isolated environments,
for example webservers and SMTP servers. The server which hosts
all jails should check for them, and warn if one is missing or
if an unknown one has popped up:
[~] root@freebsd>jls JID IP Address Hostname Path 11 126.96.36.199 ns0.mavetju.org /usr/jails/ns0.mavetju.org 10 188.8.131.52 dhcp.mavetju.org /usr/jails/dhcp.mavetju.org 9 184.108.40.206 proxy2.mavetju.org /usr/jails/proxy2.mavetju.org 8 220.127.116.11 mail4.mavetju.org /usr/jails/mail4.mavetju.org 6 18.104.22.168 tftp.mavetju.org /usr/jails/tftp.mavetju.org 5 22.214.171.124 syslog.mavetju.org /usr/jails/syslog.mavetju.org 3 126.96.36.199 cvs.mavetju.org /usr/jails/cvs.mavetju.org 2 188.8.131.52 mailman.mavetju.org /usr/jails/mailman.mavetju.org 1 184.108.40.206 jabber.mavetju.org /usr/jails/jabber.mavetju.org
In theory checking of services could be very easy:
- If the service makes a PID file, check if the process is running.
[~] root@freebsd>ps wup `cat /var/run/named.pid ` USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 407 0.0 1.3 46472 42104 ?? Ss Thu02PM 13:44.89 /usr/sbin/named -c /etc/namedb/named.conf -u root
- If the service doesn't make a PID file, use pgrep to
see if it is running:
[~] root@freebsd>pgrep -lf named 407 /usr/sbin/named -c /etc/namedb/named.conf -u root
- If the service is listening on the network, check if you can setup a TCP session towards it.
Some extra checks can be made for the following services:
See if you can get an answer back from the request for version.bind or version.server. That will show you if the server is actually answering requests.
[~] root@freebsd>dig @ns0.mavetju.org version.server chaos txt ; <<>> DiG 9.3.2 <<>> @ns0.mavetju.org version.server chaos txt ; (1 server found) ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39096 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;version.server. CH TXT ;; ANSWER SECTION: version.server. 0 CH TXT "Nominum ANS 220.127.116.11" ;; Query time: 161 msec ;; SERVER: 18.104.22.168#53(22.214.171.124) ;; WHEN: Wed Apr 18 16:59:37 2007 ;; MSG SIZE rcvd: 64
- POP3 / IMAP
Both POP3 and IMAP services return a greeting when you connect to it:
[~] root@freebsd>telnet pop.mavetju.org pop3 Trying 126.96.36.199... Connected to mail4.mavetju.org. Escape character is '^]'. +OK DBMAIL pop3 server ready to rock <firstname.lastname@example.org< [~] root@freebsd>telnet imap.mavetju.org imap Trying 188.8.131.52... Connected to mail4.mavetju.org. Escape character is '^]'. * OK dbmail imap (protocol version 4r1) server 2.0.10 ready to run
- SMTP / spam checker / greylisting / virus scanner
SMTP servers (check if the SMTP server is running) listen for network connections on port 25 (smtp) and port 587 (submission). Incoming SMTP traffic but might be greylisted (check if the greylist daemon is running). The email received goes through a virus scanner (if you are using a commercial package, make sure your license hasn't expired) (make sure the virus scanner daemon is running) (make sure that the signatures are up to date):
Then the email goes through the spam checker (make sure that the daemon is running) and then into the mail folder.
[~] root@freebsd>/usr/local/viruscan/kav/bin/aveclient -c -p /var/run/aveserver RECORDS 283158 UPDATED 18-04-2007 SERIAL 0367-0003F5-012E4689 EXPIRE 17-04-2008
Email can come in bulk. That means that one moment your queue is empty, and the next moment there are 500 messages in the queue. If your users get a daily mailing like this every day at 17:00, then you will get a daily alert about it.
- NAT gateways
Check the size of the NAT table. The expected size is depending on the policy of your network. If your network is open (no proxy server, no restrictions on traffic), then the NAT table will be very big.
If you have a regulated network (HTTP has to go via the proxy server, email has to be delivered to the local SMTP servers, DNS requests have to go to the local DNS server etc), then this will be relative small. A chance in the size can show that there is something wrong.
[~] root@freebsd>ipnat -l | wc -l 300
- Database replication
Not only the consistency of the data in a database is very important, but so is the replication of it. And it should be as realtime as possible. Slony, the replication service for PostgreSQL, gives these statistics via the sl_status table:
database=# select st_origin,st_received,st_lag_time from _database.sl_status; st_origin | st_received | st_lag_time -----------+-------------+----------------- 4 | 1 | 00:00:01.271073 4 | 2 | 00:00:01.091502
- Asterisk VoIP
There are a couple of important things to be monitored in Asterisk via the Manager interface: Status of the PRI interfaces, status of the SIP peers, status of the IAX peers.
With the SIP and IAX status, not only the OK status is important but also the time for the answer.
voip*CLI> pri show spans PRI span 1/0: Provisioned, Up, Active PRI span 2/0: Provisioned, Up, Active PRI span 3/0: Provisioned, Up, Active PRI span 4/0: Provisioned, Up, Active voip*CLI> sip show peers Name/username Host Dyn Nat ACL Port Status edwin 184.108.40.206 D N 2051 Unmonitored wen09-vega 10.197.9.12 5060 OK (7 ms) ccm-publisher 10.252.11.130 5060 OK (1 ms) 3 sip peers [3+0 online, 0 offline, 0 unmonitored] voip*CLI> iax2 show peers Name/Username Host Mask Port Status bluebox-tardis/ 220.127.116.11 (S) 255.255.255.255 4569 (T) OK (3 ms) 1 iax2 peers [1 online, 0 offline, 0 unmonitored]
Network device monitoring
Gathering information for network device monitoring is a little bit trickier than systems monitoring, because you can't run these fancy scripts on your routers and switches. Often you only can get information via SNMP...
- System Uptime: Embedded devices are often very fast with their
reboots, so they can reboot several times and you will not even
know anything. With the system.sysUpTime OID you can
get the uptime:
SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19
- If you have a clean network, and have your network devices
and user devices separated from each other, then there is
a nice border between where the responsibility lays. And it
gives you an easy way to check if all interfaces on your devices
are in the state you expect them in.
If an ifSpeed is suddenly 100Mbps instead of 1Gbps, you know that there is something wrong. If an ifOperStatus is down instead of up, you know that there is a problem. If you have redundancy in your network, these issues might have been hidden because the remote subnet never has been unreachable.
[~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifDescr RFC1213-MIB::ifDescr.1001 = STRING: "hs1-x450/14" RFC1213-MIB::ifDescr.1002 = STRING: "hs2-ssg550/e0_2" RFC1213-MIB::ifDescr.1003 = STRING: "hs2-ssg550/e0_0" RFC1213-MIB::ifDescr.1000006 = STRING: "VLAN 04094 (to-internet)" RFC1213-MIB::ifDescr.1000007 = STRING: "rtif(18.104.22.168/29)" RFC1213-MIB::ifDescr.1000008 = STRING: "VLAN 04093 (to-sjh)" [~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifSpeed RFC1213-MIB::ifSpeed.1001 = Gauge32: 1000000000 RFC1213-MIB::ifSpeed.1002 = Gauge32: 1000000000 RFC1213-MIB::ifSpeed.1003 = Gauge32: 1000000000 RFC1213-MIB::ifSpeed.1000006 = Gauge32: 0 RFC1213-MIB::ifSpeed.1000007 = Gauge32: 0 RFC1213-MIB::ifSpeed.1000008 = Gauge32: 0 [~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifOperStatus RFC1213-MIB::ifOperStatus.1001 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1002 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1003 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1000006 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1000007 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1000008 = INTEGER: up(1)
Routers can "suddenly" have more or less interfaces, for example when you create or delete a new VLAN. So you have to monitor for the absence of expected VLANs and the presence of unknown VLANs.
This is for a radio link:
[~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifDescr IF-MIB::ifDescr.1 = STRING: Ethernet Interface IF-MIB::ifDescr.2 = STRING: lo0 IF-MIB::ifDescr.3 = STRING: WORP Interface [~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifSpeed IF-MIB::ifSpeed.1 = Gauge32: 100000000 IF-MIB::ifSpeed.2 = Gauge32: 100000000 IF-MIB::ifSpeed.3 = Gauge32: 36000000 [~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifOperStatus IF-MIB::ifOperStatus.1 = INTEGER: up(1) IF-MIB::ifOperStatus.2 = INTEGER: up(1) IF-MIB::ifOperStatus.3 = INTEGER: up(1)
- If you are exchanging routing information with your ISP to the
internet or to other 3rd parties, then this goes via BGP.
Checking if your BGP neighbours are up can be done via SNMP:
Here also goes: check for the absence of expected neighbours and the presence of unknown neighbours.
[~] root@freebsd>snmpwalk -v 1 -c secret router.mavetju.org BGP4-MIB::bgpPeerState BGP4-MIB::bgpPeerState.22.214.171.124 = INTEGER: established(6) BGP4-MIB::bgpPeerState.126.96.36.199 = INTEGER: idle(1) BGP4-MIB::bgpPeerState.188.8.131.52 = INTEGER: established(6)
- If a router supports environmental reporting (temperature,
fanspeed), measure it and report anomalies. High temperatures
are bad for hardware!
EXTREME-SYSTEM-MIB::extremeFanOperational.101 = INTEGER: true(1) EXTREME-SYSTEM-MIB::extremeFanOperational.102 = INTEGER: true(1) EXTREME-SYSTEM-MIB::extremeFanOperational.103 = INTEGER: true(1) EXTREME-SYSTEM-MIB::extremeCurrentTemperature.0 = INTEGER: 27
- If a router has multiple power supplies, it is important
that you check if all of them are active. They're just like
RAID cards: You can live with one less, but not with two!
[~] root@freebsd>snmpwalk -v 1 -c secret router.mavetju.org BGP4-MIB::bgpPeerState EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.1 = INTEGER: presentOK(2) EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.2 = INTEGER: presentOK(2) EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.3 = INTEGER: presentOK(2)