Xiaopei's DokuWiki

These are the good times in your life,
so put on a smile and it'll be alright

User Tools

Site Tools


it:nagios:start

Nagios3

与其他监控系统的比较:

tips

commands

  • 检查配置
    $ nagios -v nagios.cfg

web frontend

lighttpd config

60-nagios.conf
server.modules += ( "mod_alias", mod_auth", "mod_cgi", "mod_setenv" )
 
alias.url += (
		"/cgi-bin/nagios3" => "/usr/lib/cgi-bin/nagios3",
		"/nagios3/stylesheets" => "/etc/nagios3/stylesheets",
		"/nagios3" => "/usr/share/nagios3/htdocs"
	    )
 
$HTTP["url"] =~ "^/cgi-bin"  {
	cgi.assign = ( "" => "" )
	# nagios 将 cgi 已做成 binary, 所以 assign 给自己
}
 
$HTTP["url"] =~ "nagios3" {
	auth.backend = "htpasswd"
		auth.backend.htpasswd.userfile = "/etc/nagios3/htpasswd.users"
		auth.require = ( "" => (
					"method" => "basic",
					"realm" => "nagios3",
					"require" => "user=nagiosadmin"
				       )
			       )
}

ref

configuration

配置结构

nagios.cfg

主配置文件

  • log_file=/var/log/nagios3/nagios.log, 日志记录位置, 必须在第一项配置
  • resource_file=/etc/nagios3/resource.cfg, $USERx$ macros 文件位置, x=1~32, $USER1$ 默认为 path to the plugins
  • cfg_filecfg_dir, 其他配置文件地址, 推荐的配置步骤如下:
    • 删除原有 cfg_filecfg_dir;
    • 创建目录并写入以下配置
      # mkdir {timeperiods,contacts,commands,hosts,services,ext}
      # ln -s /etc/nagios-plugins/config/ commands/nagios-plugins
      # mv commands.cfg commands/default.cfg
      cfg_dir=/etc/nagios3/commands
      # mv conf.d/timeperiods_nagios2.cfg timeperiods/default.cfg
      cfg_dir=/etc/nagios3/timeperiods
      # mv conf.d/contacts_nagios2.cfg contacts/default.cfg
      cfg_dir=/etc/nagios3/contacts
      # mv conf.d/*host* hosts/
      cfg_dir=/etc/nagios3/hosts
      # mv conf.d/*service* services/
      cfg_dir=/etc/nagios3/services
      # mv conf.d/extinfo_nagios2.cfg ext/
      cfg_dir=/etc/nagios3/ext
      • Time_Period 定义日期/时间段, 什么时候检查/什么时候通知都得用它
      • Contact 即联系人, 联系人的属性包含若干联系方法及何时能被联系, 联系人可分组, 即 contactgroup
      • Command 是检查的方法, “如何检查”, Command 都是通过插件实现的
      • Host 是物理主机, 它既能是服务器也能是路由器/交换机等网络设备, 它的属性包括主机的联系人(Contact), 如何检查(Command)何时检查(Time_Period), 主机可设置 parents 构建树状关系, 主机亦可使用 hostdependency 定义更复杂的依赖关系, 主机也可分组, 即 hostgroup
      • Service 是主机上运行的各类服务, 它的属性包括服务的联系人(Contact), 如何检查(Command)何时检查(Time_Period), 另有 servicedependency 可设置服务间的依赖关系, 服务也可分组, 即 servicegroup
  1. 某些文档中会发现 service 的 check_command 中有额外参数的写法
    define service {
        use                         generic-service
        host_name                   foo.bar.com
        service_description         HTTP WITH PORT
        check_command               check_http!-p 8080
    }
  2. 虽然 check_http 支持 -p 参数
    $ /usr/lib/nagios/plugins/check_http -I 111.222.111.222 -p 8080
    HTTP OK: HTTP/1.0 302 Found - 203 bytes in 0.031 second response time |time=0.030661s;;;0.000000 size=203B;;;0
  3. 但不能忘了 service 到 check_http 中间还有 command 一层, 额外参数必须在 command 中显性说明
    define command{
    	command_name	check_http_accept_external_opts
    	command_line	/usr/lib/nagios/plugins/check_http -H '$HOSTADDRESS$' -I '$HOSTADDRESS$' $ARG1$
            ; $ARG1$ 允许添加任意参数
    }
    # 或写新的特殊 command
    define command{
    	command_name	check_http_with_port
    	command_line	/usr/lib/nagios/plugins/check_http -H '$HOSTADDRESS$' -I '$HOSTADDRESS$' -p $ARG1$
            ; $ARG1$ 作为 -p|--port 的参数
    }
  • ext 目录放了 extension confs, 如 web interface 中的操作系统图标等
  • 其他推荐配置
    check_external_commands=1
    interval_length=60
    accept_passive_service_checks=1
    accept_passive_host_checks=1

hosts 和 services

配置 hosts 和 services 要利用好 hostgroup:

define hostgroup {
        hostgroup_name  all
        alias           All Servers
        members         *
}
 
define hostgroup {
        hostgroup_name  http-servers
        alias           HTTP servers
        members         localhost
}
 
define service {
        hostgroup_name                  http-servers
        service_description             HTTP
        check_command                   check_http
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}

Configuring Hosts

将网络拓扑修剪成适于 nagios 的形式

Notification

Notifications may be sent out in one of the following situations:

  1. The host has changed its state to DOWN or UNREACHABLE state; notification is sent out after first_notification_delay number of minutes specified in the corresponding host object
  2. The host remains in DOWN or UNREACHABLE state; notification is sent out every notification_interval number of minutes specified in the corresponding host object
  3. Host recovers to an UP state; notification is sent out immediately and only once
  4. Host starts or stops flapping; notification is sent out immediately
  5. Host remains flapping; notification is sent out every notification_interval number of minutes specified in the corresponding host object
  6. Service has changed its state to WARNING, CRITICAL or UNKNOWN state; notification is sent out after first_notification_delay number of minutes specified in the corresponding service object
  7. Service remains in WARNING, CRITICAL or UNKNOWN state; notification is sent out every notification_interval number of minutes specified in the corresponding service object
  8. Service recovers to an OK state; notification is sent out immediately and only once
  9. Service starts or stops flapping; notification is sent out immediately
  10. Service remains flapping; notification is sent out every notification_interval number of minutes specified in the corresponding service object

flapping

当在 nagios.cfg 设置 enable_flap_detection=1 后, nagios 就会检测 flapping.

Nagios 会记录最近 21 次的检查结果, 计算(21 次检查状态实际变化数 / 20, 即 21 次检查中最多可出现 20 次状态变化 ) * 100% = flapping 的比率

nagios.cfg 中可设置 flapping 界限:

low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0

flapping 比率超过界限便会认为是 flapping.

Detection and Handling of State Flapping

Summary

Our Nagios setup is now complete and is ready to be started! We took the road from source code into a working application. We have also configured it so that it monitors the machine it is running on from scratch, and it took very little time and effort to do so.

Our Nagios installation now uses three directories—/opt/nagios for binaries, /etc/ nagios for configuration, and /var/nagios for storing data. All object definitions are stored in a categorized way as the subdirectories /etc/nagios. This allows much easier management of Nagios objects.

We have configured the server that Nagios is running on, to be monitored. You might want to add more servers just to see how they works.

We told Nagios to monitor only the SSH server. But in all proability, you will also want to monitor other things such as a web server or email.

Chapter 4, Overview of Nagios Plugins, will help when it comes to setting up various types of checks. Make sure to read the /etc/nagios/commands/default.cfg file to see what commands Nagios already came configured with. Sometimes, it will also be needed to set up your own check commands—either custom scripts, or using Nagios plugins in a different way from the default command set.

You would also want to set up other users if you are working as part of a larger team. It will definitely help everyone in your team if you tell Nagios who is taking care of which parts of the infrastructure!

All that should be a good start for making sure everything works fine in your company. Of course, configuring Nagios for your needs might take a lot of time, but starting with monitoring just the essentials is a good thing. You will learn how it works and increase the number of monitorables over time.

The next step is to set up the web interface so that you will be able to see things from your favorite browser or even put on your desktop. The next chapter provides the essential information on how to install, configure, and use it.

it/nagios/start.txt · Last modified: 2013/08/19 07:22 (external edit)