友链
导航
These are the good times in your life,
so put on a smile and it'll be alright
友链
导航
gmond -m, - -metrics
1) module | metric | explaination |
---|---|---|
core_metrics | gexec | gexec available |
core_metrics | heartbeat | Last heartbeat |
core_metrics | location | Location of the machine |
cpu_module | cpu_aidle | Percent of time since boot idle CPU |
cpu_module | cpu_idle | Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request |
cpu_module | cpu_intr | cpu_intr |
cpu_module | cpu_nice | Percentage of CPU utilization that occurred while executing at the user level with nice priority |
cpu_module | cpu_num | Total number of CPUs |
cpu_module | cpu_sintr | cpu_sintr |
cpu_module | cpu_speed | CPU Speed in terms of MHz |
cpu_module | cpu_system | Percentage of CPU utilization that occurred while executing at the system level |
cpu_module | cpu_user | Percentage of CPU utilization that occurred while executing at the user level |
cpu_module | cpu_wio | Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request |
disk_module | disk_free | Total free disk space |
disk_module | disk_total | Total available disk space |
disk_module | part_max_used | Maximum percent used for all partitions |
load_module | load_fifteen | Fifteen minute load average |
load_module | load_five | Five minute load average |
load_module | load_one | One minute load average |
mem_module | mem_buffers | Amount of buffered memory |
mem_module | mem_cached | Amount of cached memory |
mem_module | mem_free | Amount of available memory |
mem_module | mem_shared | Amount of shared memory |
mem_module | mem_total | Total amount of memory displayed in KBs |
mem_module | swap_free | Amount of available swap memory |
mem_module | swap_total | Total amount of swap space displayed in KBs |
net_module | bytes_in | Number of bytes in per second |
net_module | bytes_out | Number of bytes out per second |
net_module | pkts_in | Packets in per second |
net_module | pkts_out | Packets out per second |
proc_module | proc_run | Total number of running processes |
proc_module | proc_total | Total number of processes |
sys_module | boottime | The last time that the system was started |
sys_module | machine_type | System architecture |
sys_module | mtu | Network maximum transmission unit |
sys_module | os_name | Operating system name |
sys_module | os_release | Operating system release date |
sys_module | sys_clock | Time as reported by the system clock |
gmond - -default_config
可查看默认配置, man gmond.conf
查看配置文档ganglia/gmetric - github, This is the official repository for hosting all user-contributed gmetric scripts.
The Ganglia Metric Tool (gmetric) allows you to easily monitor any arbitrary host metrics that you like expanding on the core metrics that gmond measures by default.
Gmetric sends the metric specified on the commandline to all udp_send_channels specified in the configuration file /etc/ganglia/gmond.conf by default.
All metrics in ganglia have a name, value, type and optionally units. For example, say I wanted to measure the temperature of my CPU (something gmond doesn't do by default) then I could send this metric with name=``temperature'', value=``63'', type=``int16'' and units=``Celcius''.
Assume I have a program called cputemp which outputs in text the temperature of the CPU
% cputemp 63
I could easily send this data to all listening gmonds by running
% gmetric --name temperature --value `cputemp` --type int16 --units Celcius
Check the exit value of gmetric to see if it successfully sent the data: 0 on success and -1 on failure.
To constantly sample this temperature metric, you just need too add this command to your cron table.
gmetric 不需在 gmond.conf 中配置 modules.
该脚本使用 gmetric 对 gmond 增加 diskio 监控. 将脚本 chmod +x, 再增加 crontab 即可:
* * * * * root /etc/ganglia/diskio.pl sda
gmetad.conf
# Format: # data_source "my cluster" [polling interval] address1:port addreses2:port ... # Examples: data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655 data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651 data_source "another source" 1.3.4.7:8655 1.3.4.8
lighttpd 配置如下, 验证部分教程见mod_auth:
alias.url += ("/ganglia" => "/usr/share/ganglia-webfrontend") # 验证 server.modules += ( "mod_auth" ) auth.debug = 2 auth.backend = "plain" auth.backend.plain.userfile = "/path/to/ganglia.pass" auth.require = ( "/ganglia/" => ( "method" => "basic", "realm" => "Ganglia Access", "require" => "user=ganglia", ) )
用户及密码:
ganglia:ganglia
unicast 即单播, 设置若干 receivers, 其他 node 主动向 receivers 发信息, 而不是听多播回信息. 配置如下.
配置要点:
globals { mute = no deaf = no send_metadata_interval = 30 /* secs */ } cluster { name = "Production" owner = "unspecified" latlong = "unspecified" url = "unspecified" } udp_send_channel { host = mon1 port = 8649 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 }
要点如下:
globals { mute = no deaf = yes send_metadata_interval = 30 /*secs */ } cluster { name = "Production" owner = "unspecified" latlong = "unspecified" url = "unspecified" } udp_send_channel { host = mon1 port = 8649 ttl = 1 }
ganglia 是为 cluster 监控而设计, 默认用 multicast(多播) 模式寻找同一网络上(如 mcast_join = 239.2.11.71
port = 8649
)的其他节点来监控并统计. 但如果你没有 cluster 这么高级的东西玩儿, 只有若干台分散在不同网络的服务器, 又看了 WO: KTDOT 了解到 ganglia, 想用 ganglia 对这些服务器做监控(+ nagios)的话, 以下的配置会适合你. 该配置要求节点有独立 IP, 但若条件不符, 也可 ssh 端口映射到 gmetad…
虽然每个点(project or customer)一般只是一台服务器, 但按照 ganglia 的惯例, 还是用 cluster 相称.
配置要点如下:
globals { mute = no /* 不能哑, 哑了(但不聋)的话, 虽能建立连接, 但不回复 ''<GANGLIA_XML>...</GANGLIA_XML>''. 如果设置 mute = yes, telnet 又发现有输出, 要注意这会只是 XML 的结构说明 */ deaf = no /* 不能聋, 聋了的话不能建立 tcp 连接 */ send_metadata_interval = 30 /* 应该需要这个 */ } /* 如果 cluster 只有一个节点, 就需将 udp 如下配置到只听自己, 以防网内其他节点污染 */ udp_send_channel { host = 127.0.0.1 port = 8941 } udp_recv_channel { port = 8941 } tcp_accept_channel { /* 在因特网上一定要做好安全防备 */ acl { default = "deny" access { ip = 123.123.123.123 /* collector's IP, 只允许特定 IP 访问 */ mask = 32 /* 或者通过掩码指定一个网段 */ action = "allow" } } port = 8941 /* 同时也应用非默认端口 */ }
曾使用以下配置:
udp_send_channel { host = localhost port = 8941 }
但该配置在某些 node 中遇到 webfrontend 中显示该 cluster 但无图形 问题, 按 debugging_tips DEBUG 结果如下:
<CLUSTER></CLUSTER>
中无 <METRIC></METRIC>
<CLUSTER></CLUSTER>
中无 <METRIC></METRIC>
/usr/bin/gstat -a
中 Hosts: 0
/usr/sbin/gmond -d 2
DEBUG 模式运行, 发现几乎所有 metric 都 1 errors(sent message 'metric' of length 56 with 1 errors
), 且即使 -d 10 都不显示具体 errorgoogle 不出结果, 和正常的 node 对照 UDP sysctl 也无大区别.
最终 lsof gmond 进程, 发现问题 node 本地 UDP 连接是用 IPv6, 而正常 node 用 IPv4.
$ lsof -p 21272 -n # ... gmond 21272 ganglia 6u IPv6 14551406 0t0 UDP [::1]:58281->[::1]:8649
用 IPv6 是由于 /etc/hosts 中 localhost 既有 IPv4 又有 IPv6 解析造成.
故修改 gmond.conf 的 udp_send_channel { host = 127.0.0.1 }
.
设置 data_source:
data_source "gin" 111.111.111.111:8941 data_source "vodka" 111.111.111.222:8941 data_source "whisky" 111.111.222.222:8941
如果哪天终于用上 cluster 了, 在对 nodes 的快捷配置上, 可参考 Ganglia 和 Nagios,第 1 部分: 用 Ganglia 监视企业集群
ps aux|grep gmond
command./etc/init.d/gmond stop; /usr/sbin/gmond -d 2
. Look for errors near the top.nc <hostname> 8649
nc -u -l 8653
on the host in question, then echo “hello”|nc -u <hostname> 8653
from the gmetad or another gmond./usr/bin/gstat -a
ps aux|grep gmetad
command.tail /var/log/messages
/etc/init.d/gmetad stop; /usr/sbin/gmetad -d 2
. Look for errors near the top./var/lib/ganglia
and it's children are owned and writable by the nobody
user (ganglia user on Debian/Ubuntu).nc <hostname> 8650
. This information is useful for submitting bug reports. fsockopen error: Connection refused
, 则很可能为 gmetad 未启动. 若 gmetad 启动不了, 可打开 gmetad.conf 的 debug 检查问题tail -f /var/log/apache2/error_log
.conf.php
are correct. If you are installing from source, don't just copy the web/
folder and rename conf.php.in
, and version.php.in
, they both have variables in them that need to be set. Run make -C web conf.php version.php
or fill in the variables by hand (there are only 2, and both are enclosed by @'s).contrib/check_ganglia.py
#!/usr/bin/env python import sys import getopt import socket import xml.parsers.expat class GParser: def __init__(self, host, metric): self.inhost =0 self.inmetric = 0 self.value = None self.host = host self.metric = metric def parse(self, file): p = xml.parsers.expat.ParserCreate() p.StartElementHandler = parser.start_element p.EndElementHandler = parser.end_element p.ParseFile(file) if self.value == None: raise Exception('Host/value not found') return float(self.value) def start_element(self, name, attrs): if name == "HOST": # 至少在 3.1.7 的 gmond 中, host 是无法改名的 # gmond 配置中只能设置 host 的 location # 所以此处用 location 判断更方便 if attrs["LOCATION"]==self.host: self.inhost=1 elif self.inhost==1 and name == "METRIC" and attrs["NAME"]==self.metric: self.value=attrs["VAL"] def end_element(self, name): if name == "HOST" and self.inhost==1: self.inhost=0 def usage(): print """Usage: check_ganglia \ -h|--host= -m|--metric= -w|--warning= \ -c|--critical= [-s|--server=] [-p|--port=] """ sys.exit(3) if __name__ == "__main__": ############################################################## ganglia_host = '127.0.0.1' # 8649 是默认的 gmond port # ganglia_port = 8649 # gmond port # 8651 是默认的 gmetad port # 在 gmetad 可查到所有 data_source 中的 hosts ganglia_port = 8651 host = None metric = None warning = None critical = None try: options, args = getopt.getopt(sys.argv[1:], "h:m:w:c:s:p:", ["host=", "metric=", "warning=", "critical=", "server=", "port="], ) except getopt.GetoptError, err: print "check_gmond:", str(err) usage() sys.exit(3) for o, a in options: if o in ("-h", "--host"): host = a elif o in ("-m", "--metric"): metric = a elif o in ("-w", "--warning"): warning = float(a) elif o in ("-c", "--critical"): critical = float(a) elif o in ("-p", "--port"): ganglia_port = int(a) elif o in ("-s", "--server"): ganglia_host = a if critical == None or warning == None or metric == None or host == None: usage() sys.exit(3) try: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((ganglia_host,ganglia_port)) parser = GParser(host, metric) value = parser.parse(s.makefile("r")) s.close() except Exception, err: print "CHECKGANGLIA UNKNOWN: Error while getting value \"%s\"" % (err) sys.exit(3) # 原脚本仅在阈值过高时发出警告 # 希望在阈值过低时发出警告(在 disk_free 中是这样),则需要 # 根据比较 critical 和 warning 值, 判断该做小于**检查还是**大于**检查 if critical > warning: if value >= critical: print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value) sys.exit(2) elif value >= warning: print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value) sys.exit(1) else: print "CHECKGANGLIA OK: %s is %.2f" % (metric, value) sys.exit(0) else: if critical >= value: print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value) sys.exit(2) elif warning >= value: print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value) sys.exit(1) else: print "CHECKGANGLIA OK: %s is %.2f" % (metric, value) sys.exit(0) sys.exit(0)
/usr/lib/nagios/plugins/
chmod a+x check_ganglia.py
# 定义 command define command { command_name check_ganglia command_line $USER1$/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$ }
# 定义 service group define servicegroup { servicegroup_name ganglia-metrics alias Ganglia Metrics } # 对需要监控的 host/host_group 定义基类 service define service { use generic-service name ganglia-service # hostgroup_name dallas-cloud-servers # nagios 中检查的是 host(不是 cluster) # nagios 里的 host_name 必须与 ganglia 中 host 的 location 相同 host_name localhost service_groups ganglia-metrics # notifications_enabled 0 register 0 } # 对每个 metric 定义具体 service # 有哪些 metric 需参考 ganglia, 且 metric 需在 ganglia 中打开 define service { use ganglia-service service_description load_one check_command check_ganglia!load_one!4!5 } define service { use ganglia-service service_description disk_free # 检查硬盘剩余多少 G check_command check_ganglia!disk_free!10!5 }
这些 metrics 有必要在 nagios 中监控:
man gmond.conf
相同2.3 优点及可能存在的问题
gmond/modules/
中找到