Network and System Monitoring Primers

So the question is: "What do you need to monitor?" The answer is easy: "Everything". That's a pretty big amount. Let reality kick in and rephrase it to: "Everything you can think of". This is too vague and misses the final goal: "Everything you assume to be the normal conditions for a system or service to run properly.".

"Everything you assume to be the normal conditions for a system or service run properly". That is quite a lot, and different for each service you provide or device you have on the network. It will make you to have to you dig into your network and servers to find out what is going on.

This also means that you need to find out how your network and services behave. In the beginning you will get a number of alerts which will be false positives and you need to tune your alert settings. Or you will get the same alerts every day at the same time: they will be normal system behaviour. You can chose between getting these messages everyday and changing the alert-settings so that you won't get them anymore. History shows that the last option isn't the smartest one, because it will hide possible issues from you.

A lot of the items described in this primer can be considered as being too nitpicky which might or might not be an issue for you at this moment. Keep in mind that you should monitor for normal operations! Everything which happens in your network and systems which isn't normal is worth investigating!

This primer only describes active monitoring and realtime monitoring. Passive monitoring (via SNMP traps, syslog messages or monitoring agents) and historical monitoring (history graphs with application like Cacti) are not described.

Software

This primer tries to be generic, but is based on my experience with Nagios. At the end there will be a link to the scripts I use for non-standard Nagios features.

Nagios is described as "a host, service and network monitoring program". If you are a beginner, you will find its configuration files horrible. But once you get through that, it is easy to expand. Nagios does do all the checking by executing scripts in its libexec/ directory. This doesn't mean that it is limited to doing checks of remote services which are running on other hosts. For that there is the program called NRPE, which stands for Nagios Remote Program Executor. This program runs as a daemon on the remote hosts and runs the same but local installed Nagios scripts. (FreeBSD: net-mgmt/nagios and net-mgmt/nrpe2)

Systems monitoring

There are several components which needs to be monitored on a system: Hardware (disks, CPU), the Operating System and the Services.

Hardware

  • These days you can get a lot of information about components of your motherboards: CPU temperature, internal temperature, fan speeds and power voltages. Higher temperatures are bad for your motherboard. Fan speeds which are suddenly much higher, or lower, indicate that one of them is broken and might cause higher temperatures. And power voltage changes indicate problems with your power supply.
    On Linux, this information can be gathered from /proc/acpi. On FreeBSD this can be gathered via sysutils/healthd.
  • IDE harddisks characteristics can be monitored via the SMART interface (Self-Monitoring, Analysis and Reporting Technology), for example the temperature of the disks and a handful of counters: Reallocated Sector Count, Seek Error Rate, Spin Retry Count, Calibration Retry Count, Reallocated Event Count, Current Pending Sector and UDMA CRC Error Count. If these counters go up, there might be a problem with your harddisk.
    On Linux and FreeBSD this data can be gathered via the smartmontools software (FreeBSD: sysutils/smartmontools).
  • RAID hardware is beautiful, a broken harddisk won't wake you up in the middle of the night anymore (but two broken harddisks will so it better be monitored). There are various ways to check them, and every vendors seems to have its own software. The following works on FreeBSD:
    • camcontrol for HP/Compaq RAID cards:
      [~] root@freebsd>camcontrol inquiry da3
      pass3: <COMPAQ RAID 1  VOLUME OK> Fixed Direct Access SCSI-0 device 
      	
    • TW_CLI for 3WARE RAID cards: (sysutils/tw_cli)
      [~] root@freebsd>/usr/local/bin/tw_cli info c0 unitstatus
      # of units: 1
              Unit 0: RAID 5 1.63 TB ( 3516478848 blocks): OK
      	
    • aaccli for Adaptec AAC Controllers (sysutils/aaccli)
      [~] root@freebsd>aaccli 'open aac0 : disk list : container list'
      --------------------------------------------------------------------------------
      Adaptec SCSI RAID Controller Command Line Interface
      Copyright 1998-2002 Adaptec, Inc. All rights reserved
      --------------------------------------------------------------------------------
      Executing: open "aac0" 
      
      Executing: disk list 
      
      C:ID:L  Device Type     Blocks    Bytes/Block Usage            Shared Rate
      ------  --------------  --------- ----------- ---------------- ------ ----
      0:00:0   Disk            390721968 512         Initialized      NO     100 
      0:03:0   Disk            390721968 512         Initialized      NO     100 
      
      Executing: container list 
      Num          Total  Oth Stripe          Scsi   Partition    
      Label Type   Size   Ctr Size   Usage   C:ID:L Offset:Size  
      ----- ------ ------ --- ------ ------- ------ -------------
       0    RAID-5  745GB       64KB Open    0:00:0 64.0KB: 186GB 
       /dev/aacd0           raid5            0:03:0 64.0KB: 186GB 
      	
  • Network interfaces monitoring consists of two items: The first one is the number of packet errors, which can be gathered by the output of netstat -ni:
    [~] root@linux>netstat -ni
    Kernel Interface table
    Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
    eth1       1500   0 641851668      0      0      0 646287092      0      0      0 BMRU
    eth2       1500   0 711410096      0      0      0 701868617      0      0      0 BMRU
    lo        16436   0  6611086      0      0      0  6611086      0      0      0 LRU
    
    [~] root@freebsd>netstat -ni sk0
    Name    Mtu Network       Address              Ipkts Ierrs    Opkts Oerrs  Coll
    sk0    1500 <Link#1>      00:0f:ea:2c:d5:18  3706970     0  3316491     0     0 
    sk0    1500 fe80:1::20f:e fe80:1::20f:eaff:        0     -        2     -     - 
    sk0    1500 10.251.1.16/2 10.251.1.18        2923134     -  2536594     -     - 
    
    The second one is the media status: how is the device talking to the switch. On Linux this can be found with the output of mii-tool, on FreeBSD this can be found on the media line in the output of ifconfig:
    [~] root@linux>mii-tool eth1
    eth1: negotiated 100baseTx-FD flow-control, link ok
    
    [~] root@freebsd>ifconfig sk0
    sk0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
            options=8<VLAN_MTU>
    	inet6 fe80::20f:eaff:fe2c:d518%sk0 prefixlen 64 scopeid 0x1 
    	inet 10.251.1.18 netmask 0xfffffff0 broadcast 10.251.1.31
    	ether 00:0f:ea:2c:d5:18
    	media: Ethernet autoselect (100baseTX <full-duplex,flag0,flag1>)
    	status: active
    
  • If you have a UPS, see if you can get the status of it. Information from APC UPS's can be gathered via apcupsd and apcaccess:
        STATUS   : ONLINE
    
    Do not check only for ONLINE, check if ONLINE is the only string because
        STATUS   : ONLINE REPLACEBATT
    
    can be valid too! (FreeBSD: sysutils/apcupsd)

The Operating System

  • Diskspace information, or partition information, can be gotten with the output of df, which gives you the free disk space. Another important piece of information is the number of inodes you have available: df -i, because if you don't have any inodes free, you can't create any more files.
    [~] root@freebsd>df -i /
    Filesystem  1K-blocks      Used    Avail Capacity iused    ifree %iused  Mounted on
    /dev/da0s1a    128990     86072    32600    73%    2952    13302   18%   /
    
    [~] root@linux>df -i
    Filesystem            Inodes   IUsed   IFree IUse% Mounted on
    /dev/mapper/VolGroup00-LogVol00
                         35291136  559063 34732073    2% /
    /dev/cciss/c0d0p1      26104      36   26068    1% /boot
    
  • A freshly installed system should have very few services running, maybe only crond, inetd, ntpd, sshd and syslogd. They all create their own PID files, so it is easy to get the process IDs:
    [~] root@freebsd>ps wup `head /var/run/ntpd.pid `
    USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    ntp       3148  0.0  0.1   4044  4044 ?        SLs  Apr14   0:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
    
    Of course you need to check first if the PID file exists.
    If the service is a network based service, then besides checking if the PID files exist and the processes exist, you should also check if the service works up to some extent: For ssh you should get the SSH banner, for ntpd you can check if the services is synced.
  • If the server is supposed to transport emails (both mail servers and application servers), then check if the mail-queue is more or less empty:
    [~] root@postfix>mailq -Queue ID- --Size-- ----Arrival
    Time---- -Sender/Recipient------- A634A5CD036    66814 Mon Apr 16
    12:38:57  edwin@mavetju.org [..] -- 343 Kbytes in 3 Requests.
    
    [~] root@postfix>mailq
    Mail queue is empty
    
    [~] root@sendmail>mailq
    /var/spool/mqueue is empty
    		Total requests: 0
    
  • On machines which act as a router, and specially ones which do dynamic routing, it is important to make sure that the default gateway is pointing to the expected interface:
    [~] root@freebsd>netstat -rn -f inet | grep ^default
    default            202.83.178.153     UG1         1 885506855   fxp2
    
    [~] root@linux>netstat -rn  | grep ^0.0.0.0
    0.0.0.0         10.252.13.9     0.0.0.0         UG        0 0          0 eth0
    
    If you use dynamic routing in your network, one other thing you need to do on one or more machines is to check if you have all your networks in the routing table. Missing one means that you can't reach that network from that machine!
  • Number of users, total processes and swap: Easy to measure, and it might be an indication that there is something wrong. For machines which are servers, the number of users logged in shouldn't be too high: Unless work is done on them, nobody should be logged in.
    For swap, preferable it is not in use.
    [~] root@freebsd>uptime
     4:13PM  up 88 days, 4 mins, 4 users, load averages: 0.24, 0.39, 0.32
    [~] root@freebsd>ps auxw | wc -l
         205
    [~] root@freebsd>swapinfo
    Device          1K-blocks     Used    Avail Capacity
    /dev/ad10s1b      4168496      480  4168016     0%
    
    [~] root@linux>uptime
     16:12:49 up 4 days,  3:52,  2 users,  load average: 0.11, 0.11, 0.06
    [~] root@linux>ps auxw | wc -l
    90
    [~] root@linux>cat /proc/swaps
    Filename                                Type            Size    Used    Priority
    /dev/mapper/VolGroup00-LogVol01         partition       2031608 1576    -1
    
  • Uptime. Don't rely on the host not being able to be pinged to determine if the machine has been rebooted. With todays hardware and background file system checks the machine is back before the ping-timeout threshold has been reached. If the uptime has been reseted, then something has happened!
    Note that if you monitor this via SNMP, that the system.sysUptime OID returns the number of seconds from the snmpd being active, not the number of seconds of the machine being active. Restarting the snmpd will reset this counter!
    SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19
    
  • FreeBSD Jails are great for setting up small isolated environments, for example webservers and SMTP servers. The server which hosts all jails should check for them, and warn if one is missing or if an unknown one has popped up:
    [~] root@freebsd>jls
       JID  IP Address     Hostname               Path
        11  212.73.76.0    ns0.mavetju.org        /usr/jails/ns0.mavetju.org
        10  212.73.76.3    dhcp.mavetju.org       /usr/jails/dhcp.mavetju.org
         9  212.73.78.126  proxy2.mavetju.org     /usr/jails/proxy2.mavetju.org
         8  212.73.78.125  mail4.mavetju.org      /usr/jails/mail4.mavetju.org
         6  212.73.78.96   tftp.mavetju.org       /usr/jails/tftp.mavetju.org
         5  212.73.78.95   syslog.mavetju.org     /usr/jails/syslog.mavetju.org
         3  212.73.78.92   cvs.mavetju.org        /usr/jails/cvs.mavetju.org
         2  212.73.78.91   mailman.mavetju.org    /usr/jails/mailman.mavetju.org
         1  212.73.78.90   jabber.mavetju.org     /usr/jails/jabber.mavetju.org
    

The Services

In theory checking of services could be very easy:

  • If the service makes a PID file, check if the process is running.
    [~] root@freebsd>ps wup `cat /var/run/named.pid `
    USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
    root   407  0.0  1.3 46472 42104  ??  Ss   Thu02PM  13:44.89 /usr/sbin/named -c /etc/namedb/named.conf -u root
    
  • If the service doesn't make a PID file, use pgrep to see if it is running:
    [~] root@freebsd>pgrep -lf named
    407 /usr/sbin/named -c /etc/namedb/named.conf -u root
    
  • If the service is listening on the network, check if you can setup a TCP session towards it.
It might not be ideal, but it's a good start.

Some extra checks can be made for the following services:

  • DNS
    See if you can get an answer back from the request for version.bind or version.server. That will show you if the server is actually answering requests.
    [~] root@freebsd>dig @ns0.mavetju.org version.server chaos txt
    ; <<>> DiG 9.3.2 <<>> @ns0.mavetju.org version.server chaos txt
    ; (1 server found)
    ;; global options:  printcmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39096
    ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
    
    ;; QUESTION SECTION:
    ;version.server.                        CH      TXT
    
    ;; ANSWER SECTION:
    version.server.         0       CH      TXT     "Nominum ANS 2.7.0.2"
    
    ;; Query time: 161 msec
    ;; SERVER: 212.73.76.0#53(212.73.76.0)
    ;; WHEN: Wed Apr 18 16:59:37 2007
    ;; MSG SIZE  rcvd: 64
    
  • POP3 / IMAP
    Both POP3 and IMAP services return a greeting when you connect to it:
    [~] root@freebsd>telnet pop.mavetju.org pop3
    Trying 212.73.78.125...
    Connected to mail4.mavetju.org.
    Escape character is '^]'.
    +OK DBMAIL pop3 server ready to rock <11fd22afdf2b08b14bed8ea5ff16cf21@mail4.mavetju.org<
    
    [~] root@freebsd>telnet imap.mavetju.org imap
    Trying 212.73.78.125...
    Connected to mail4.mavetju.org.
    Escape character is '^]'.
    * OK dbmail imap (protocol version 4r1) server 2.0.10 ready to run
    
  • SMTP / spam checker / greylisting / virus scanner
    SMTP servers (check if the SMTP server is running) listen for network connections on port 25 (smtp) and port 587 (submission). Incoming SMTP traffic but might be greylisted (check if the greylist daemon is running). The email received goes through a virus scanner (if you are using a commercial package, make sure your license hasn't expired) (make sure the virus scanner daemon is running) (make sure that the signatures are up to date):
    [~] root@freebsd>/usr/local/viruscan/kav/bin/aveclient  -c -p /var/run/aveserver
    RECORDS 283158
    UPDATED 18-04-2007
    
    SERIAL 0367-0003F5-012E4689
    EXPIRE 17-04-2008
    
    Then the email goes through the spam checker (make sure that the daemon is running) and then into the mail folder.
    Email can come in bulk. That means that one moment your queue is empty, and the next moment there are 500 messages in the queue. If your users get a daily mailing like this every day at 17:00, then you will get a daily alert about it.
  • NAT gateways
    Check the size of the NAT table. The expected size is depending on the policy of your network. If your network is open (no proxy server, no restrictions on traffic), then the NAT table will be very big.
    If you have a regulated network (HTTP has to go via the proxy server, email has to be delivered to the local SMTP servers, DNS requests have to go to the local DNS server etc), then this will be relative small. A chance in the size can show that there is something wrong.
        [~] root@freebsd>ipnat -l | wc -l
    	 300
    
  • Database replication
    Not only the consistency of the data in a database is very important, but so is the replication of it. And it should be as realtime as possible. Slony, the replication service for PostgreSQL, gives these statistics via the sl_status table:
    database=# select st_origin,st_received,st_lag_time from _database.sl_status; 
     st_origin | st_received |   st_lag_time   
    -----------+-------------+-----------------
             4 |           1 | 00:00:01.271073
             4 |           2 | 00:00:01.091502
    
  • Asterisk VoIP
    There are a couple of important things to be monitored in Asterisk via the Manager interface: Status of the PRI interfaces, status of the SIP peers, status of the IAX peers.
    voip*CLI> pri show spans
    PRI span 1/0: Provisioned, Up, Active
    PRI span 2/0: Provisioned, Up, Active
    PRI span 3/0: Provisioned, Up, Active
    PRI span 4/0: Provisioned, Up, Active
    
    voip*CLI> sip show peers
    Name/username              Host            Dyn Nat ACL Port     Status    
    edwin                      121.44.244.57    D   N      2051     Unmonitored
    wen09-vega                 10.197.9.12                 5060     OK (7 ms) 
    ccm-publisher              10.252.11.130               5060     OK (1 ms) 
    3 sip peers [3+0 online, 0 offline, 0 unmonitored]
    
    voip*CLI> iax2 show peers
    Name/Username    Host                 Mask             Port          Status    
    bluebox-tardis/  202.83.176.44   (S)  255.255.255.255  4569 (T)      OK (3 ms) 
    1 iax2 peers [1 online, 0 offline, 0 unmonitored]
    
    With the SIP and IAX status, not only the OK status is important but also the time for the answer.

Network device monitoring

Gathering information for network device monitoring is a little bit trickier than systems monitoring, because you can't run these fancy scripts on your routers and switches. Often you only can get information via SNMP...

  • System Uptime: Embedded devices are often very fast with their reboots, so they can reboot several times and you will not even know anything. With the system.sysUpTime OID you can get the uptime:
    SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19
    
  • If you have a clean network, and have your network devices and user devices separated from each other, then there is a nice border between where the responsibility lays. And it gives you an easy way to check if all interfaces on your devices are in the state you expect them in.
    [~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifDescr
    RFC1213-MIB::ifDescr.1001 = STRING: "hs1-x450/14"
    RFC1213-MIB::ifDescr.1002 = STRING: "hs2-ssg550/e0_2"
    RFC1213-MIB::ifDescr.1003 = STRING: "hs2-ssg550/e0_0"
    RFC1213-MIB::ifDescr.1000006 = STRING: "VLAN 04094 (to-internet)"
    RFC1213-MIB::ifDescr.1000007 = STRING: "rtif(202.83.178.178/29)"
    RFC1213-MIB::ifDescr.1000008 = STRING: "VLAN 04093 (to-sjh)"
    [~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifSpeed
    RFC1213-MIB::ifSpeed.1001 = Gauge32: 1000000000
    RFC1213-MIB::ifSpeed.1002 = Gauge32: 1000000000
    RFC1213-MIB::ifSpeed.1003 = Gauge32: 1000000000
    RFC1213-MIB::ifSpeed.1000006 = Gauge32: 0
    RFC1213-MIB::ifSpeed.1000007 = Gauge32: 0
    RFC1213-MIB::ifSpeed.1000008 = Gauge32: 0
    [~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifOperStatus
    RFC1213-MIB::ifOperStatus.1001 = INTEGER: up(1)
    RFC1213-MIB::ifOperStatus.1002 = INTEGER: up(1)
    RFC1213-MIB::ifOperStatus.1003 = INTEGER: up(1)
    RFC1213-MIB::ifOperStatus.1000006 = INTEGER: up(1)
    RFC1213-MIB::ifOperStatus.1000007 = INTEGER: up(1)
    RFC1213-MIB::ifOperStatus.1000008 = INTEGER: up(1)
    
    If an ifSpeed is suddenly 100Mbps instead of 1Gbps, you know that there is something wrong. If an ifOperStatus is down instead of up, you know that there is a problem. If you have redundancy in your network, these issues might have been hidden because the remote subnet never has been unreachable.
    Routers can "suddenly" have more or less interfaces, for example when you create or delete a new VLAN. So you have to monitor for the absence of expected VLANs and the presence of unknown VLANs.

    This is for a radio link:
    [~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifDescr
    IF-MIB::ifDescr.1 = STRING: Ethernet Interface
    IF-MIB::ifDescr.2 = STRING: lo0
    IF-MIB::ifDescr.3 = STRING: WORP Interface
    [~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifSpeed
    IF-MIB::ifSpeed.1 = Gauge32: 100000000
    IF-MIB::ifSpeed.2 = Gauge32: 100000000
    IF-MIB::ifSpeed.3 = Gauge32: 36000000
    [~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifOperStatus
    IF-MIB::ifOperStatus.1 = INTEGER: up(1)
    IF-MIB::ifOperStatus.2 = INTEGER: up(1)
    IF-MIB::ifOperStatus.3 = INTEGER: up(1)
    
  • If you are exchanging routing information with your ISP to the internet or to other 3rd parties, then this goes via BGP. Checking if your BGP neighbours are up can be done via SNMP:
    [~] root@freebsd>snmpwalk -v 1 -c secret router.mavetju.org BGP4-MIB::bgpPeerState
    BGP4-MIB::bgpPeerState.218.100.2.1 = INTEGER: established(6)
    BGP4-MIB::bgpPeerState.218.100.2.62 = INTEGER: idle(1)
    BGP4-MIB::bgpPeerState.221.133.215.61 = INTEGER: established(6)
    
    Here also goes: check for the absence of expected neighbours and the presence of unknown neighbours.
  • If a router supports environmental reporting (temperature, fanspeed), measure it and report anomalies. High temperatures are bad for hardware!
    EXTREME-SYSTEM-MIB::extremeFanOperational.101 = INTEGER: true(1)
    EXTREME-SYSTEM-MIB::extremeFanOperational.102 = INTEGER: true(1)
    EXTREME-SYSTEM-MIB::extremeFanOperational.103 = INTEGER: true(1)
    EXTREME-SYSTEM-MIB::extremeCurrentTemperature.0 = INTEGER: 27
    
  • If a router has multiple power supplies, it is important that you check if all of them are active. They're just like RAID cards: You can live with one less, but not with two!
    [~] root@freebsd>snmpwalk -v 1 -c secret router.mavetju.org BGP4-MIB::bgpPeerState
    EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.1 = INTEGER: presentOK(2)
    EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.2 = INTEGER: presentOK(2)
    EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.3 = INTEGER: presentOK(2)