Daily Data --- Everyday Bits and Bytes: Nagios: Preventing data center downtime

No matter the duration of the downtime, when a company loses data center functionality, it cuts a company's profit considerably. First, it loses business during the actual failure. And of course, it loses credibility -- and additional money -- through the loss of service that its clients experience.

Such consequences make monitoring data centers a business-critical role. This tip shows you how to use monitoring software Nagios to help your company protect itself from costly data center failures. You will learn how to configure the software to take best advantage of its plug-in, remote management, dependency-notification and failover capabilities.

Nagios is a system and network monitoring tool that can be used to monitor a variety of hosts and services. Nagios is free and open software under the GPLv2 and its feature overview as presented in its documentation is:

Monitoring of network services (SMTP, POP3, HTTP, NNTP, PING, etc.)
Monitoring of host resources (processor load, disk usage, etc.)
Simple plugin design that allows users to easily develop their own service checks
Parallelized service checks
Ability to define network host hierarchy using "parent" hosts, allowing detection of anddistinction between hosts that are down and those that are unreachable
Contact notifications when service or host problems occur and get resolved (via email, pager or user-defined method)
Ability to define event handlers to be run during service or host events for proactive problem resolution
Automatic log file rotation
Support for implementing redundant monitoring hosts
Optional Web interface for viewing current network status, notification and problem history, log file, etc.

What does all of that mean to IT management and staff? Well, let's make the assumption that the people using your computer systems expect them to have a certain level of availability (a pretty good assumption). Any unplanned downtime for some services may even directly contribute to your business losing money – not a pleasant thought. So, rather than hiring someone to stare at console screens 24/7 and be ready to fix anything that breaks, you can use a tool like Nagios to keep track of when services go down, notify you that they are down, and even restart them for you.

Nagios is very flexible and extensible. If there is not already a service check available for the service you care to monitor, a plugin to perform the service check can easily be added on. This is where Nagios' active community comes in handy. You could go it alone but chances are that what you are looking for is either included in the Nagios Plugins package, is available on the Nagios Exchange website or has been discussed on the Nagios Community site. In the event that you are the one that needs to start the ball rolling for a new type of service check, there are coding guidelines provided, and you can essentially use any programming/scripting language that you are comfortable with. You just need to keep in mind that the end product should be portable and conform to the proper Nagios plugin return values and command line options.

Nagios can also monitor local resources on remote machines. For instance, you cannot generally query the memory usage of a machine from remote. There are two add-ons for Nagios that allow it to remotely monitor local resources: NetSaint Remote Plugin Executor (NRPE) and NetSaint Service Check Acceptor (NSCA). NRPE allows the Nagios server to execute plugins (like check_memory) on remote Linux/Unix hosts. NSCA instead allows the remote server to run the check on itself periodically and send the results to the server of its own accord. This is called passive checking. Using passive checks can help to reduce the load on the Nagios server and is an important component of setting up distributed monitoring.

If you have hundreds or even thousands of hosts to monitor (plus their services), distributed monitoring and passive checks can be crucial features to use. You can set up a number of distributed servers that could perform checks on an arbitrary group of servers to be monitored (perhaps you could have one distributed server per subnet) and send the data back to the central server. The central server remains responsible for capturing all of the monitoring data, performance data (if you choose), and performing any notifications. You should also make sure that the central server is monitoring the distributed servers so that it can tell the difference between a subnet outage and one distributed server going out of commission. This brings up two more important features in an enterprise monitoring tool that Nagios provides:

Dependencies (host, service, or network)
Failover capability

Network dependencies are configured by parent relationship in Nagios. The basic idea of this is that Nagios is then able to differentiate between telling when something is DOWN (all points up to the host are fine) or just UNREACHABLE (a router between the Nagios server and the monitored host is known to be in a DOWN state). This is very useful when setting up notifications. Notifications for the UNREACHABLE state can even be disabled, thereby preventing you from receiving notification that 50 hosts on a subnet are UNREACHABLE when just getting the notification that the subnet's router is DOWN would be sufficient. Service dependencies can also help to pare down notifications to the essential as well as to map out how services relate to one another. You may have a service that you want to monitor for connectivity and for authentication. The authentication is dependent on the connectivity, so if the connectivity check fails, there is no need to check authentication.

Failover is one of the most important enterprise class features of Nagios. If you can't rely on your monitoring tool, then it is useless. With Nagios, a failover configuration is achieved in the following steps:

A second, or slave, Nagios server is set up with the same configuration for hosts, services, and contacts as the master.
The slave server will not actively perform any service checks or send any notifications, however. The master server will keep the slave server apprised of the current state of everything through passive service check notifications.
The slave server will actively monitor the Nagios process on the master server. In the event that this proves DOWN, the slave Nagios server will start doing active checks and notifications.
Once the Nagios process on the master server comes back, the slave will again cease active checks and notifications.

Nagios' great strength is its configurability and adaptability to many different monitoring scenarios. There are also plug-ins and add-ons for Nagios that can do things such as store all the monitoring data in a database, do extensive graphing with said data, and any number of other things. All of these options and configurations can also make the initial setup of Nagios a bit imposing. Proper planning, research and outlining of your deployment (things you do anyway, right?) will alleviate the pain of configuration and will ultimately reward you with a well-monitored environment.

source: Mark Keisler

Daily Data --- Everyday Bits and Bytes

Wednesday, September 3, 2008

Nagios: Preventing data center downtime

No comments: