Xiaopei's DokuWiki

These are the good times in your life,
so put on a smile and it'll be alright

User Tools

Site Tools


it:ganglia

Ganglia

组成

gmond

  • 能监控什么? gmond -m, - -metrics1)
    modulemetricexplaination
    core_metricsgexecgexec available
    core_metricsheartbeatLast heartbeat
    core_metricslocationLocation of the machine
    cpu_modulecpu_aidlePercent of time since boot idle CPU
    cpu_modulecpu_idlePercentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request
    cpu_modulecpu_intrcpu_intr
    cpu_modulecpu_nicePercentage of CPU utilization that occurred while executing at the user level with nice priority
    cpu_modulecpu_numTotal number of CPUs
    cpu_modulecpu_sintrcpu_sintr
    cpu_modulecpu_speedCPU Speed in terms of MHz
    cpu_modulecpu_systemPercentage of CPU utilization that occurred while executing at the system level
    cpu_modulecpu_userPercentage of CPU utilization that occurred while executing at the user level
    cpu_modulecpu_wioPercentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request
    disk_moduledisk_freeTotal free disk space
    disk_moduledisk_totalTotal available disk space
    disk_modulepart_max_usedMaximum percent used for all partitions
    load_moduleload_fifteenFifteen minute load average
    load_moduleload_fiveFive minute load average
    load_moduleload_oneOne minute load average
    mem_modulemem_buffersAmount of buffered memory
    mem_modulemem_cachedAmount of cached memory
    mem_modulemem_freeAmount of available memory
    mem_modulemem_sharedAmount of shared memory
    mem_modulemem_totalTotal amount of memory displayed in KBs
    mem_moduleswap_freeAmount of available swap memory
    mem_moduleswap_totalTotal amount of swap space displayed in KBs
    net_modulebytes_inNumber of bytes in per second
    net_modulebytes_outNumber of bytes out per second
    net_modulepkts_inPackets in per second
    net_modulepkts_outPackets out per second
    proc_moduleproc_runTotal number of running processes
    proc_moduleproc_totalTotal number of processes
    sys_moduleboottimeThe last time that the system was started
    sys_modulemachine_typeSystem architecture
    sys_modulemtuNetwork maximum transmission unit
    sys_moduleos_nameOperating system name
    sys_moduleos_releaseOperating system release date
    sys_modulesys_clockTime as reported by the system clock
  • 为了 gmetad 能正确收集信息, 运行 gmond 的机器应注意校时
  • gmond - -default_config 可查看默认配置, man gmond.conf 查看配置文档
  • 默认端口: 86492), 默认多播通道(mcast_join): 239.2.11.713)

gmetric

ganglia/gmetric - github, This is the official repository for hosting all user-contributed gmetric scripts.

The Ganglia Metric Tool (gmetric) allows you to easily monitor any arbitrary host metrics that you like expanding on the core metrics that gmond measures by default.

Gmetric sends the metric specified on the commandline to all udp_send_channels specified in the configuration file /etc/ganglia/gmond.conf by default.

All metrics in ganglia have a name, value, type and optionally units. For example, say I wanted to measure the temperature of my CPU (something gmond doesn't do by default) then I could send this metric with name=``temperature'', value=``63'', type=``int16'' and units=``Celcius''.

Assume I have a program called cputemp which outputs in text the temperature of the CPU

  % cputemp
  63

I could easily send this data to all listening gmonds by running

  % gmetric --name temperature --value `cputemp` --type int16 --units Celcius

Check the exit value of gmetric to see if it successfully sent the data: 0 on success and -1 on failure.

To constantly sample this temperature metric, you just need too add this command to your cron table.

gmetric 不需在 gmond.conf 中配置 modules.

diskio.pl

该脚本使用 gmetric 对 gmond 增加 diskio 监控. 将脚本 chmod +x, 再增加 crontab 即可:

* * * * * root  /etc/ganglia/diskio.pl sda

gmetad

  • 配置文件 gmetad.conf
  • data_source, what to monitor. The most important section of this file.
    # Format: 
    # data_source "my cluster" [polling interval] address1:port addreses2:port ...
    
    # Examples:
    data_source "my cluster" 10 localhost  my.machine.edu:8649  1.2.3.5:8655
    data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
    data_source "another source" 1.3.4.7:8655  1.3.4.8

Web Frontend

lighttpd 配置如下, 验证部分教程见mod_auth:

60-ganglia.conf
alias.url += ("/ganglia" => "/usr/share/ganglia-webfrontend")
 
# 验证
server.modules               += ( "mod_auth" )
 
auth.debug                   = 2
auth.backend                 = "plain"
auth.backend.plain.userfile  = "/path/to/ganglia.pass"
 
auth.require                 = (
  "/ganglia/" => (
    "method"    => "basic",
    "realm"     => "Ganglia Access",
    "require"   => "user=ganglia",
  )
)

用户及密码:

/path/to/ganglia.pass
ganglia:ganglia

FIXME http://sourceforge.net/apps/trac/ganglia/wiki/ganglia-web-2

Examples 举几个栗子

cluster 内使用 unicast

unicast 即单播, 设置若干 receivers, 其他 node 主动向 receivers 信息, 而不是听多播信息. 配置如下.

receiver mon1 的 gmond

配置要点:

  • send_metadata_interval = 30
  • 删除了 udp_send/recv_channel 的 mcast_join
  • 对 udp_send_channel 设置了 host
mon1:/etc/ganglia/gmond.conf
globals {
  mute = no
  deaf = no
  send_metadata_interval = 30 /* secs */
}
 
cluster {
  name = "Production"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}
 
udp_send_channel {
  host = mon1
  port = 8649
  ttl = 1
}
 
udp_recv_channel { 
  port = 8649
}
 
tcp_accept_channel {
  port = 8649
}

other nodes 的 gmond

要点如下:

  • send_metadata_interval = 30
  • deaf = yes, 对此节点 telnet 没用了, 此节点会主动发
  • 删除了 udp_send_channel 的 mcast_join
  • 对 udp_send_channel 设置了 host
  • 删除了 udp_recv_channel 和 tcp_accept_channel
other_nodes:/etc/ganglia/gmond.conf
globals {
  mute = no
  deaf = yes
  send_metadata_interval = 30 /*secs */
}
 
cluster {
  name = "Production"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}
 
udp_send_channel {
  host = mon1
  port = 8649
  ttl = 1
}

通过因特网监控多主机

ganglia 是为 cluster 监控而设计, 默认用 multicast(多播) 模式寻找同一网络上(如 mcast_join = 239.2.11.71 port = 8649)的其他节点来监控并统计. 但如果你没有 cluster 这么高级的东西玩儿, 只有若干台分散在不同网络的服务器, 又看了 WO: KTDOT 了解到 ganglia, 想用 ganglia 对这些服务器做监控(+ nagios)的话, 以下的配置会适合你. 该配置要求节点有独立 IP, 但若条件不符, 也可 ssh 端口映射到 gmetad…

each cluster

虽然每个点(project or customer)一般只是一台服务器, 但按照 ganglia 的惯例, 还是用 cluster 相称.

配置要点如下:

gmond.conf
globals {
  mute = no /* 不能哑, 哑了(但不聋)的话, 虽能建立连接, 但不回复 ''<GANGLIA_XML>...</GANGLIA_XML>''.
               如果设置 mute = yes, telnet 又发现有输出, 要注意这会只是 XML 的结构说明 */
  deaf = no /* 不能聋, 聋了的话不能建立 tcp 连接 */
  send_metadata_interval = 30 /* 应该需要这个 */
}
 
/* 如果 cluster 只有一个节点, 就需将 udp 如下配置到只听自己, 以防网内其他节点污染 */
udp_send_channel {
  host = 127.0.0.1
  port = 8941
}
udp_recv_channel {
  port = 8941
}
 
tcp_accept_channel {
  /* 在因特网上一定要做好安全防备 */
  acl {
    default = "deny"
    access {
      ip = 123.123.123.123 /* collector's IP, 只允许特定 IP 访问 */
      mask = 32 /* 或者通过掩码指定一个网段 */
      action = "allow"
    }
  }
 
  port = 8941 /* 同时也应用非默认端口 */
}

曾使用以下配置:

udp_send_channel {
  host = localhost
  port = 8941
}

但该配置在某些 node 中遇到 webfrontend 中显示该 cluster 但无图形 问题, 按 debugging_tips DEBUG 结果如下:

  1. 在 gmetad 上 telnet $GMOND 8649, <CLUSTER></CLUSTER> 中无 <METRIC></METRIC>
  2. 在 gmond 上 telnet localhost 8649, <CLUSTER></CLUSTER> 中无 <METRIC></METRIC>
  3. 在 gmond 上 /usr/bin/gstat -aHosts: 0
  4. /usr/sbin/gmond -d 2 DEBUG 模式运行, 发现几乎所有 metric 都 1 errors(sent message 'metric' of length 56 with 1 errors), 且即使 -d 10 都不显示具体 error
  5. nc 测试 UDP 本地可用(of cause)

google 不出结果, 和正常的 node 对照 UDP sysctl 也无大区别.

最终 lsof gmond 进程, 发现问题 node 本地 UDP 连接是用 IPv6, 而正常 node 用 IPv4.

$ lsof -p 21272 -n 
# ...
gmond   21272 ganglia    6u  IPv6   14551406      0t0      UDP [::1]:58281->[::1]:8649 

用 IPv6 是由于 /etc/hosts 中 localhost 既有 IPv4 又有 IPv6 解析造成.

故修改 gmond.conf 的 udp_send_channel { host = 127.0.0.1 }.

collector

设置 data_source:

gmetad.conf
data_source "gin" 111.111.111.111:8941
data_source "vodka" 111.111.111.222:8941
data_source "whisky" 111.111.222.222:8941

终于用上 cluster 了

如果哪天m(终于用上 cluster 了, 在对 nodes 的快捷配置上, 可参考 Ganglia 和 Nagios,第 1 部分: 用 Ganglia 监视企业集群

debugging tips

For gmond

  1. See if the gmond service is running, issue the ps aux|grep gmond command.
  2. Stop the gmond service and run it by hand with debug mode. /etc/init.d/gmond stop; /usr/sbin/gmond -d 2. Look for errors near the top.
  3. Attempt to retrieve the XML data by netcatting to the gmond daemon. nc <hostname> 8649
  4. Confirm that UDP connections can be established between the gmetad and gmond(or gmond and other gmond's for multicast purposes) by running nc -u -l 8653 on the host in question, then echo “hello”|nc -u <hostname> 8653 from the gmetad or another gmond.
  5. Check gmond data using /usr/bin/gstat -a

For gmetad

  1. See if the gmetad service is running, issue the ps aux|grep gmetad command.
  2. Check syslog for errors. tail /var/log/messages
  3. Stop the gmetad service and run it by hand with debug mode. /etc/init.d/gmetad stop; /usr/sbin/gmetad -d 2. Look for errors near the top.
  4. Ensure that /var/lib/ganglia and it's children are owned and writable by the nobody user (ganglia user on Debian/Ubuntu).
  5. Retrieve the XML data by netcatting to the gmetad daemon. nc <hostname> 8650. This information is useful for submitting bug reports.

For the web interface

  1. 若访问时出现错误 fsockopen error: Connection refused, 则很可能为 gmetad 未启动. 若 gmetad 启动不了, 可打开 gmetad.conf 的 debug 检查问题
  2. Monitor the web server error log. PHP errors will appear here. tail -f /var/log/apache2/error_log.
  3. Ensure that the settings in conf.php are correct. If you are installing from source, don't just copy the web/ folder and rename conf.php.in, and version.php.in, they both have variables in them that need to be set. Run make -C web conf.php version.php or fill in the variables by hand (there are only 2, and both are enclosed by @'s).

ganglia + nagios

  1. ganglia 的源码包4)中有一个供 nagios 使用的监测脚本: contrib/check_ganglia.py
  2. 但该脚本正常使用还需修改, 如下
    check_ganglia.py
    #!/usr/bin/env python
     
    import sys
    import getopt
    import socket
    import xml.parsers.expat
     
    class GParser:
      def __init__(self, host, metric):
        self.inhost =0
        self.inmetric = 0
        self.value = None
        self.host = host
        self.metric = metric
     
      def parse(self, file):
        p = xml.parsers.expat.ParserCreate()
        p.StartElementHandler = parser.start_element
        p.EndElementHandler = parser.end_element
        p.ParseFile(file)
        if self.value == None:
          raise Exception('Host/value not found')
        return float(self.value)
     
      def start_element(self, name, attrs):
        if name == "HOST":
          # 至少在 3.1.7 的 gmond 中, host 是无法改名的
          # gmond 配置中只能设置 host 的 location
          # 所以此处用 location 判断更方便
          if attrs["LOCATION"]==self.host:
            self.inhost=1
        elif self.inhost==1 and name == "METRIC" and attrs["NAME"]==self.metric:
          self.value=attrs["VAL"]
     
      def end_element(self, name):
        if name == "HOST" and self.inhost==1:
          self.inhost=0
     
    def usage():
      print """Usage: check_ganglia \
    -h|--host= -m|--metric= -w|--warning= \
    -c|--critical= [-s|--server=] [-p|--port=] """
      sys.exit(3)
     
    if __name__ == "__main__":
    ##############################################################
      ganglia_host = '127.0.0.1'
     
      # 8649 是默认的 gmond port
      # ganglia_port = 8649 # gmond port
     
      # 8651 是默认的 gmetad port
      # 在 gmetad 可查到所有 data_source 中的 hosts
      ganglia_port = 8651
     
      host = None
      metric = None
      warning = None
      critical = None
     
      try:
        options, args = getopt.getopt(sys.argv[1:],
          "h:m:w:c:s:p:",
          ["host=", "metric=", "warning=", "critical=", "server=", "port="],
          )
      except getopt.GetoptError, err:
        print "check_gmond:", str(err)
        usage()
        sys.exit(3)
     
      for o, a in options:
        if o in ("-h", "--host"):
           host = a
        elif o in ("-m", "--metric"):
           metric = a
        elif o in ("-w", "--warning"):
           warning = float(a)
        elif o in ("-c", "--critical"):
           critical = float(a)
        elif o in ("-p", "--port"):
           ganglia_port = int(a)
        elif o in ("-s", "--server"):
           ganglia_host = a
     
      if critical == None or warning == None or metric == None or host == None:
        usage()
        sys.exit(3)
     
      try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.connect((ganglia_host,ganglia_port))
        parser = GParser(host, metric)
        value = parser.parse(s.makefile("r"))
        s.close()
      except Exception, err:
        print "CHECKGANGLIA UNKNOWN: Error while getting value \"%s\"" % (err)
        sys.exit(3)
     
      # 原脚本仅在阈值过高时发出警告
      # 希望在阈值过低时发出警告(在 disk_free 中是这样),则需要
      # 根据比较 critical 和 warning 值, 判断该做小于**检查还是**大于**检查
      if critical > warning:
        if value >= critical:
          print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
          sys.exit(2)
        elif value >= warning:
          print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
          sys.exit(1)
        else:
          print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
          sys.exit(0)
      else:
        if critical >= value:
          print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
          sys.exit(2)
        elif warning >= value:
          print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
          sys.exit(1)
        else:
          print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
          sys.exit(0)
        sys.exit(0)
  3. 将此脚本 cp 至 nagios-plugin 目录: /usr/lib/nagios/plugins/
  4. 增加执行权限: chmod a+x check_ganglia.py
  5. 修改 nagios 配置, 增加 ganglia 监测(按 <learning nagios 3.0> 的配置结构)
    1. commands
      commands/ganglia_commands.cfg
      # 定义 command
      define command {
        command_name check_ganglia
        command_line $USER1$/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$
      }
    2. services
      services/ganglia_services.cfg
      # 定义 service group
      define servicegroup {
        servicegroup_name ganglia-metrics
        alias Ganglia Metrics
      }
       
      # 对需要监控的 host/host_group 定义基类 service
      define service {
        use generic-service
        name ganglia-service
        # hostgroup_name dallas-cloud-servers
        # nagios 中检查的是 host(不是 cluster)
        # nagios 里的 host_name 必须与 ganglia 中 host 的 location 相同
        host_name localhost
        service_groups ganglia-metrics
        # notifications_enabled 0
        register       0 
      }
       
      # 对每个 metric 定义具体 service
      # 有哪些 metric 需参考 ganglia, 且 metric 需在 ganglia 中打开
      define service {
        use ganglia-service
        service_description load_one
        check_command check_ganglia!load_one!4!5
      }
       
      define service {
        use ganglia-service
        service_description disk_free
        # 检查硬盘剩余多少 G
        check_command check_ganglia!disk_free!10!5
      }
  6. done!

这些 metrics 有必要在 nagios 中监控:

  • disk_free!10!5
  • load_one!4!5
  • FIXME

refs

refs

1)
看 gmond.conf 亦可, 更多信息可在源码的 gmond/modules/ 中找到
2)
U*N*I*X on a phone key pad
3)
作者的生日 02/11/1971
4)
从 ganglia 主页下载: http://ganglia.info/?page_id=66
it/ganglia.txt · Last modified: 2013/08/19 07:22 (external edit)