Xiaopei's DokuWiki

These are the good times in your life,
so put on a smile and it'll be alright

User Tools

Site Tools


it:es

elastic search

使用 ES 前中的选型

安装

docker-compose.yml
version: '2'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:5.4.1
    container_name: elasticsearch
    environment:
      - http.host=0.0.0.0
      - transport.host=127.0.0.1
    volumes:
      - ./es:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
  grafana:
    image: grafana/grafana
    ports:
      - 3000:3000

http://localhost:9200/ 默认用户名密码 elastic / changeme

教程

插入数据

要注意一个 index,如果没有做过 mapping、模板,只要设置过一个字段为某种类型(比如 text),其他数据都会按 text 插入。

最终容易导致不能聚合

没法修改老数据的类型,需要用新索引 re-index

搜索

初始测试数据:https://gist.github.com/clintongormley/8579281

数据层级 /{index}/{type}/{id}

# 看有哪些 index
$ http get http://localhost:9200/_aliases/
{
   "gb" : {
      "aliases" : {}
   },
   "us" : {
      "aliases" : {}
   }
}
 
# 看 index 下有哪些 type,每个 type 的文档有哪些字端
$ http get http://localhost:9200/gb
{
   "gb" : {
      "mappings" : {
         "tweet" : {
            "properties" : {
               "name" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "ignore_above" : 256,
                        "type" : "keyword"
                     }
                  }
               },
               ...
            }
         },
         "user" : {
            "properties" : {
               "name" : {
                  "fields" : {
                     "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                     }
                  },
                  "type" : "text"
               },
               ...
            }
         }
      },
      "settings" : {
         "index" : {
            "version" : {
               "created" : "5040199"
            },
            "provided_name" : "gb",
            "creation_date" : "1496644515265",
            "uuid" : "edAUIhTuR3K18TGe9lTrhQ",
            "number_of_shards" : "5",
            "number_of_replicas" : "1"
         }
      },
      "aliases" : {}
   }
}
 
# 在 index 里空搜索
$ http http://localhost:9200/gb/_search
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":7,"max_score":1.0,"hits":[{"_index":"gb","_type":"tweet","_id":"5","_score":1.0,"_source":
{
   "date" : "2014-09-15",
   "name" : "Mary Jones",
   "tweet" : "However did I manage before Elasticsearch?",
   "user_id" : 2
}
...
},{"_index":"gb","_type":"tweet","_id":"11","_score":1.0,"_source":
{
   "date" : "2014-09-21",
   "name" : "Mary Jones",
   "tweet" : "Elasticsearch is built for the cloud, easy to scale",
   "user_id" : 2
}
}]}}
 
# 在多个 index 里搜索同一 type
$ http http://localhost:9200/gb,us/user/_search
{"took":8,"timed_out":false,"_shards":{"total":10,"successful":10,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"gb","_type":"user","_id":"2","_score":1.0,"_source":
{
   "email" : "mary@jones.com",
   "name" : "Mary Jones",
   "username" : "@mary"
}
},{"_index":"us","_type":"user","_id":"1","_score":1.0,"_source":
{
   "email" : "john@smith.com",
   "name" : "John Smith",
   "username" : "@john"
}
}]}}
 
# 分页
GET /_search?size=5
GET /_search?size=5&from=5
GET /_search?size=5&from=10
 
# 轻量搜索
# 查询在 tweet 类型中 tweet 字段包含 elasticsearch 单词的所有文档:
GET /_all/tweet/_search?q=tweet:elasticsearch
 
# +name:john +tweet:mary
# 查询在 name 字段中包含 john *并且* 在 tweet 字段中包含 mary 的文档
# + 前缀表示必须与查询条件匹配。类似地, - 前缀表示一定不与查询条件匹配。
# 没有 + 或者 - 的所有其他条件都是可选的——匹配的越多,文档就越相关。
$ http "localhost:9200/_search?q=%2Bname%3Ajohn+%2Btweet%3Amary"
{"took":7,"timed_out":false,"_shards":{"total":18,"successful":18,"failed":0},"hits":{"total":1,"max_score":0.80316985,"hits":[{"_index":"us","_type":"tweet","_id":"4","_score":0.80316985,"_source":
{
   "date" : "2014-09-14",
   "name" : "John Smith",
   "tweet" : "@mary it is not just text, it does everything",
   "user_id" : 1
}
}]}}
 
# +name:(mary john) +date:>2014-09-10 +(aggregations geo) 表示:
# name 字段中包含 mary 或者 john
# date 值大于 2014-09-10
# _all_ 字段包含 aggregations 或者 geo
 
 
# 使用 HTTP Body 发送查询条件
# GET POST 都可,但 HTTP client 可能不支持 GET 带 Body
$ http -vvv --auth=elastic:changeme http://localhost:9200/_search < f.json
POST /_search HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 92
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
    "query": {
        "match": {
            "tweet": "elasticsearch"
        }
    }
}
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "_shards": {
        "failed": 0,
        "successful": 18,
        "total": 18
    },
    "hits": {
        "hits": [
            {
                "_id": "13",
                "_index": "gb",
                "_score": 0.7081689,
                "_source": {
                    "date": "2014-09-23",
                    "name": "Mary Jones",
                    "tweet": "So yes, I am an Elasticsearch fanboy",
                    "user_id": 2
                },
                "_type": "tweet"
            },
            ...
            {
                "_id": "12",
                "_index": "us",
                "_score": 0.15716305,
                "_source": {
                    "date": "2014-09-22",
                    "name": "John Smith",
                    "tweet": "Elasticsearch and I have left the honeymoon stage, and I still love her.",
                    "user_id": 1
                },
                "_type": "tweet"
            }
        ],
        "max_score": 0.7081689,
        "total": 7
    },
    "timed_out": false,
    "took": 5
}
 
 
# 使用 bool 符合条件
$ http -vvv --auth=elastic:changeme http://localhost:9200/_search < f.json
 
POST /_search HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 162
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
    "query": {
        "bool": {
            "must": {
                "match": {
                    "tweet": "elasticsearch"
                }
            },
            "must_not": {
                "match": {
                    "name": "mary"
                }
            }
        }
    }
}
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "_shards": {
        "failed": 0,
        "successful": 18,
        "total": 18
    },
    "hits": {
        "hits": [
            {
                "_id": "6",
                "_index": "us",
                "_score": 0.6395861,
                "_source": {
                    "date": "2014-09-16",
                    "name": "John Smith",
                    "tweet": "The Elasticsearch API is really easy to use",
                    "user_id": 1
                },
                "_type": "tweet"
            },
            ...
            {
                "_id": "12",
                "_index": "us",
                "_score": 0.15716305,
                "_source": {
                    "date": "2014-09-22",
                    "name": "John Smith",
                    "tweet": "Elasticsearch and I have left the honeymoon stage, and I still love her.",
                    "user_id": 1
                },
                "_type": "tweet"
            }
        ],
        "max_score": 0.6395861,
        "total": 3
    },
    "timed_out": false,
    "took": 3
}

聚合 aggs aggregations

可能遇到 Fielddata is disabled on text fields by default 问题,解决方法见:

Pipeline Aggregations | Elasticsearch Reference [5.4] | Elastic 对其他 metric 做计算的 metric

  • derivative:计算上一个 metric 与本 metric 的差值
  • cumulative_sum:计算一个 metric 积累到当前 bucket 的和

基础

初始化汽车销售数据:尝试聚合 | Elasticsearch: 权威指南 | Elastic

可能遇到的 Fielddata is disabled on text fields by default 问题

# 解决问题
$ http -vvv --auth=elastic:changeme put http://localhost:9200/cars/_mapping/transactions < f.json
PUT /cars/_mapping/transactions HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 98
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
    "properties": {
        "color": {
            "fielddata": true,
            "type": "text"
        }
    }
}
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "acknowledged": true
}
 
# 如果其他字端也有错,需要 update_all_types
PUT /cars/_mapping/transactions?update_all_types HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 117
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
    "properties": {
        "make": {
            "fielddata": true,
            "type": "text"
        }
    }
}
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "acknowledged": true
}
# 根据颜色聚合
POST /cars/transactions/_search HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 153
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
    "aggs": {
        "popular_colors": {
            "terms": {
                "field": "color"
            }
        }
    },
    "size": 0
}
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {
        "popular_colors": {
            "buckets": [
                {
                    "doc_count": 4,
                    "key": "red"
                },
                {
                    "doc_count": 2,
                    "key": "blue"
                },
                {
                    "doc_count": 2,
                    "key": "green"
                }
            ],
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 9
    },
    "timed_out": false,
    "took": 67
}
 
 
# 添加度量指标
POST /cars/transactions/_search HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 275
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
    "aggs": {
        "colors": {
-增加指标    "aggs": {
                "avg_price": {
                    "avg": {
                        "field": "price"
                    }
                }
            },
            "terms": {
                "field": "color"
            }
        }
    },
    "size": 0
}
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {
        "colors": {
            "buckets": [
                {
-返回指标            "avg_price": {
                        "value": 32500.0
                    },
                    "doc_count": 4,
                    "key": "red"
                },
                {
                    "avg_price": {
                        "value": 20000.0
                    },
                    "doc_count": 2,
                    "key": "blue"
                },
                {
                    "avg_price": {
                        "value": 21000.0
                    },
                    "doc_count": 2,
                    "key": "green"
                }
            ],
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 9
    },
    "timed_out": false,
    "took": 1
}
 
 
# 嵌套桶
# 嵌套结果
POST /cars/transactions/_search HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 392
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
  "aggs": {
    "colors": { 		// 命名
      "aggs": {
        "avg_price": {		// 命名
          "avg": {
            "field": "price"	// 原字端
          }
        },
        "make": {		// 命名
          "terms": {		// 原字端
            "field": "make"
          }
        }
      },
      "terms": {		// 原字端
        "field": "color"
      }
    }
  },
  "size": 0
}
 
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {
        "colors": {
            "buckets": [
                {
                    "avg_price": {
                        "value": 32500.0
                    },
                    "doc_count": 4,
                    "key": "red",
                    "make": {
                        "buckets": [
                            {
                                "doc_count": 3,
                                "key": "honda"
                            },
                            {
                                "doc_count": 1,
                                "key": "bmw"
                            }
                        ],
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0
                    }
                },
                {
                    "avg_price": {
                        "value": 20000.0
                    },
                    "doc_count": 2,
                    "key": "blue",
                    "make": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "ford"
                            },
                            {
                                "doc_count": 1,
                                "key": "toyota"
                            }
                        ],
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0
                    }
                },
                {
                    "avg_price": {
                        "value": 21000.0
                    },
                    "doc_count": 2,
                    "key": "green",
                    "make": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "ford"
                            },
                            {
                                "doc_count": 1,
                                "key": "toyota"
                            }
                        ],
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0
                    }
                }
            ],
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 9
    },
    "timed_out": false,
    "took": 2
}
 
# 为每个汽车生成商计算最低和最高的价格
GET /cars/transactions/_search
{
   "size" : 0,
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": {
            "avg_price": { "avg": { "field": "price" }
            },
            "make" : {
                "terms" : {
                    "field" : "make"
                },
                "aggs" : {
                    "min_price" : { "min": { "field": "price"} },
                    "max_price" : { "max": { "field": "price"} }
                }
            }
         }
      }
   }
}
 
{
...
   "aggregations": {
      "colors": {
         "buckets": [
            {
               "key": "red",
               "doc_count": 4,
               "make": {
                  "buckets": [
                     {
                        "key": "honda",
                        "doc_count": 3,
                        "min_price": {
                           "value": 10000
                        },
                        "max_price": {
                           "value": 20000
                        }
                     },
                     {
                        "key": "bmw",
                        "doc_count": 1,
                        "min_price": {
                           "value": 80000
                        },
                        "max_price": {
                           "value": 80000
                        }
                     }
                  ]
               },
               "avg_price": {
                  "value": 32500
               }
            },
...
 
 
# 有了这两个桶,我们可以对查询的结果进行扩展并得到以下信息:
# - 有四辆红色车。
# - 红色车的平均售价是 $32,500 美元。
# - 其中三辆红色车是 Honda 本田制造,一辆是 BMW 宝马制造。
# - 最便宜的红色本田售价为 $10,000 美元。
# - 最贵的红色本田售价为 $20,000 美元。

为图表构造数据

histogram 柱状图(按间隔、类别)

# 柱状图
POST /cars/transactions/_search HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 302
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
    "aggs": {
        "price": {
            "aggs": {
-指标            "revenue": {
                    "sum": {
                        "field": "price"
                    }
                }
            },
-柱状图      "histogram": {
                "field": "price",
-间隔            "interval": 20000
            }
        }
 
-doc_count会作为默认指标返回
 
    },
    "size": 0
}
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {
        "price": {
            "buckets": [
                {
-数量,默认指标       "doc_count": 3,
-代表0~20,000        "key": 0.0,
                    "revenue": {
-总销售额                 "value": 37000.0
                    }
                },
                {
                    "doc_count": 4,
-代表20,000~40,000   "key": 20000.0,
                    "revenue": {
                        "value": 95000.0
                    }
                },
                {
                    "doc_count": 0,
                    "key": 40000.0,
                    "revenue": {
                        "value": 0.0
                    }
                },
                {
                    "doc_count": 0,
                    "key": 60000.0,
                    "revenue": {
                        "value": 0.0
                    }
                },
                {
                    "doc_count": 1,
                    "key": 80000.0,
                    "revenue": {
                        "value": 80000.0
                    }
                }
            ]
        }
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 9
    },
    "timed_out": false,
    "took": 16
}
 
 
# 使用内置的 extended_stats 功能
GET /cars/transactions/_search
{
  "size" : 0,
  "aggs": {
    "makes": {
      "terms": {
        "field": "make",
        "size": 10
      },
      "aggs": {
        "stats": {
          "extended_stats": {
            "field": "price"
          }
        }
      }
    }
  }
}
 
 
{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {
        "makes": {
            "buckets": [
                {
                    "doc_count": 3,
                    "key": "honda",
-一个extended_stats     "stats": {
-带出一堆指标            "avg": 16666.666666666668,
-就可以做按make的柱状图了 "count": 3,
                        "max": 20000.0,
                        "min": 10000.0,
                        "std_deviation": 4714.045207910315,
                        "std_deviation_bounds": {
                            "lower": 7238.5762508460375,
                            "upper": 26094.757082487296
                        },
                        "sum": 50000.0,
                        "sum_of_squares": 900000000.0,
                        "variance": 22222222.22222221
-可以计算标准差std_err = std_deviation / count
                    }
                },
                ...
            ],
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 9
    },
    "timed_out": false,
    "took": 4
}

date_histogram 按时间统计

时间格式:format | Elasticsearch Reference [5.4] | Elastic

可以用通常的 histogram 进行时间分析吗?

从技术上来讲,是可以的。 通常的 histogram bucket(桶)是可以处理日期的。 但是它不能自动识别日期。 而用 date_histogram ,你可以指定时间段如 1 个月 ,它能聪明地知道 2 月的天数比 12 月少。 date_histogram 还具有另外一个优势,即能合理地处理时区,这可以使你用客户端的时区进行图标定制,而不是用服务器端时区。

通常的 histogram 会把日期看做是数字,这意味着你必须以微秒为单位指明时间间隔。另外聚合并不知道日历时间间隔,使得它对于日期而言几乎没什么用处。

# 每月销售多少台汽车?
POST /cars/transactions/_search HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 201
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
    "aggs": {
        "sales": {
            "date_histogram": {
                "field": "sold",
                "format": "yyyy-MM-dd",
                "interval": "month"
            }
        }
    },
    "size": 0
}
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {
        "sales": {
            "buckets": [
                {
                    "doc_count": 1,
                    "key": 1388534400000,
                    "key_as_string": "2014-01-01"
                },
                {
                    "doc_count": 1,
                    "key": 1391212800000,
                    "key_as_string": "2014-02-01"
                },
                {
                    "doc_count": 0,
                    "key": 1393632000000,
                    "key_as_string": "2014-03-01"
                },
                {
                    "doc_count": 0,
                    "key": 1396310400000,
                    "key_as_string": "2014-04-01"
                },
                {
                    "doc_count": 1,
                    "key": 1398902400000,
                    "key_as_string": "2014-05-01"
                },
                {
                    "doc_count": 0,
                    "key": 1401580800000,
                    "key_as_string": "2014-06-01"
                },
                {
                    "doc_count": 1,
                    "key": 1404172800000,
                    "key_as_string": "2014-07-01"
                },
                {
                    "doc_count": 1,
                    "key": 1406851200000,
                    "key_as_string": "2014-08-01"
                },
                {
                    "doc_count": 0,
                    "key": 1409529600000,
                    "key_as_string": "2014-09-01"
                },
                {
                    "doc_count": 1,
                    "key": 1412121600000,
                    "key_as_string": "2014-10-01"
                },
                {
                    "doc_count": 2,
                    "key": 1414800000000,
                    "key_as_string": "2014-11-01"
                }
            ]
        }
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 9
    },
    "timed_out": false,
    "took": 12
}
 
 
 
# 计算每种品牌的总销售金额。
# 也计算所有全部品牌的汇总销售金额。
GET /cars/transactions/_search
{
   "size" : 0,
   "aggs": {
      "sales": {
         "date_histogram": {
            "field": "sold",
            "interval": "quarter",
            "format": "yyyy-MM-dd",
            "min_doc_count" : 0,
            "extended_bounds" : {
                "min" : "2014-01-01",
                "max" : "2014-12-31"
            }
         },
         "aggs": {
            "per_make_sum": {
               "terms": {
                  "field": "make"
               },
               "aggs": {
                  "sum_price": {
                     "sum": { "field": "price" }
                  }
               }
            },
            "total_sum": {
               "sum": { "field": "price" }
            }
         }
      }
   }
}
 
{
....
"aggregations": {
   "sales": {
      "buckets": [
         {
            "key_as_string": "2014-01-01",
            "key": 1388534400000,
            "doc_count": 2,
            "total_sum": {
               "value": 105000
            },
            "per_make_sum": {
               "buckets": [
                  {
                     "key": "bmw",
                     "doc_count": 1,
                     "sum_price": {
                        "value": 80000
                     }
                  },
                  {
                     "key": "ford",
                     "doc_count": 1,
                     "sum_price": {
                        "value": 25000
                     }
                  }
               ]
            }
         },
...
}

范围限定的聚合

之前的聚合例子省略了一个 query 。 整个请求只不过是一个聚合。

聚合可以与搜索请求同时执行,但是我们需要理解一个新概念: 范围 。 默认情况下,聚合与查询是对同一范围进行操作的,也就是说,聚合是基于我们查询匹配的文档集合进行计算的。

没有 query 和 查询所有文档 是等价的:

# ES 会转化以下查询
GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color"
            }
        }
    }
}
 
# 成为
GET /cars/transactions/_search
{
    "size" : 0,
    "query" : {
        "match_all" : {}
    },
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color"
            }
        }
    }
}

因为聚合 总是查询范围内 的结果进行操作的,所以一个隔离的聚合实际上是在对 match_all 的结果范围操作,即所有的文档。

利用范围,我们可以问“福特在售车有多少种颜色?”诸如此类的问题。可以简单的在请求中加上一个查询(本例中为 match 查询):

GET /cars/transactions/_search
{
    "query" : {
        "match" : {
            "make" : "ford"
        }
    },
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color"
            }
        }
    }
}
 
# 因为我们没有指定 "size" : 0 ,所以搜索结果和聚合结果都被返回了:
{
...
   "hits": {
      "total": 2,
      "max_score": 1.6931472,
      "hits": [
         {
            "_source": {
               "price": 25000,
               "color": "blue",
               "make": "ford",
               "sold": "2014-02-12"
            }
         },
         {
            "_source": {
               "price": 30000,
               "color": "green",
               "make": "ford",
               "sold": "2014-05-18"
            }
         }
      ]
   },
   "aggregations": {
      "colors": {
         "buckets": [
            {
               "key": "blue",
               "doc_count": 1
            },
            {
               "key": "green",
               "doc_count": 1
            }
         ]
      }
   }
}

同时返回 结果和聚合 对高大上的仪表盘来说至关重要。 加入一个搜索栏可以将任何静态的仪表板变成一个实时数据搜索设备。 这让用户可以搜索数据,查看所有实时更新的图形。

global:{} 全局桶

全局 桶包含 所有 的文档,它无视查询的范围。因为它还是一个桶,我们可以像平常一样将聚合嵌套在内

POST /cars/transactions/_search HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==
Connection: keep-alive
Content-Length: 399
Content-Type: application/json
Host: localhost:9200
User-Agent: HTTPie/0.9.9
 
{
    "aggs": {
        "all": {
            "aggs": {
                "avg_price": {
                    "avg": {
                        "field": "price"
                    }
                }
            },
            "global": {}
        },
        "single_avg_price": {
            "avg": {
                "field": "price"
            }
        }
    },
    "query": {
        "match": {
            "make": "ford"
        }
    },
    "size": 0
}
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
 
{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {
        "all": {
            "avg_price": {
                "value": 26500.0
            },
            "doc_count": 9
        },
        "single_avg_price": {
            "value": 27500.0
        }
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 2
    },
    "timed_out": false,
    "took": 11
}

过滤 - query 中的 filter - 同时影响搜索结果和聚合结果

如果我们想找到售价在 $10,000 美元之上的所有汽车同时也为这些车计算平均售价, 可以简单地使用一个 constant_score 查询和 filter 约束

这正如我们在前面章节中讨论过那样,从根本上讲,使用 不计算得分的查询 和使用 match 查询没有任何区别。查询(包括了一个过滤器)返回一组文档的子集,聚合正是操作这些文档。使用 filtering query 会忽略评分,并有可能会缓存结果数据等等(好)。

GET /cars/transactions/_search
{
    "size" : 0,
    "query" : {
-区别于match "constant_score": {
            "filter": {
                "range": {
                    "price": {
                        "gte": 10000
                    }
                }
            }
        }
    },
    "aggs" : {
        "single_avg_price": {
            "avg" : { "field" : "price" }
        }
    }
}

过滤桶 - aggs 中的 filter - 影响聚合

假设我们正在为汽车经销商创建一个搜索页面,我们希望显示用户搜索的结果(所有时间),但是我们同时也想在页面上提供更丰富的信息,包括(与搜索匹配的) 上个月度 汽车的平均售价。

这里我们无法简单的做范围限定,因为有两个不同的条件。搜索结果必须是 ford ,但是聚合结果必须满足 ford AND sold > now - 1M 。

GET /cars/transactions/_search
{
   "size" : 0,
   "query":{
      "match": {
         "make": "ford"
      }
   },
   "aggs":{
      "recent_sales": {
         "filter": {     # 使用 过滤 桶在 查询 范围基础上应用过滤器。
            "range": {
               "sold": {
                  "from": "now-1M"
               }
            }
         },
         "aggs": {
            "average_price":{
               "avg": {   # avg 度量只会对 ford 和上个月售出的文档计算平均售价。
                  "field": "price"
               }
            }
         }
      }
   }
}
 
 
 
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {  # 上个月的数据
        "recent_sales": {
            "average_price": {
                "value": null
            },
            "doc_count": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 2     # 所有时间的数据
    },
    "timed_out": false,
    "took": 13
}

后过滤器(与过滤桶相反)- post_filter - 只影响搜索结果

目前为止,我们可以同时对搜索结果和聚合结果进行过滤(不计算得分的 filter 查询),以及针对聚合结果的一部分进行过滤( filter 桶)。

我们可能会想,“只过滤搜索结果,不过滤聚合结果呢?” 答案是使用 post_filter 。

让我们为汽车经销商设计另外一个搜索页面,这个页面允许用户搜索汽车同时可以根据颜色来过滤。颜色的选项是通过聚合获得的:

GET /cars/transactions/_search
{
    "size" : 0,
    "query": {
        "match": {
            "make": "ford"
        }
    },
    "post_filter": {
        "term" : {
            "color" : "green"
        }
    },
    "aggs" : {
        "all_colors": {
            "terms" : { "field" : "color" }
        }
    }
}
 
{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {
        "all_colors": {
            "buckets": [
                {
                    "doc_count": 1,
                    "key": "blue"
                },
                {
                    "doc_count": 1,
                    "key": "green"
                }
            ],
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 1
    },
    "timed_out": false,
    "took": 6
}

多桶排序

多值桶( terms 、 histogram 和 date_histogram )动态生成很多桶。 Elasticsearch 是如何决定这些桶展示给用户的顺序呢?

默认的,桶会根据 doc_count 降序排列。这是一个好的默认行为,因为通常我们想要找到文档中与查询条件相关的最大值:售价、人口数量、频率。但有些时候我们希望能修改这个顺序,不同的桶有着不同的处理方式。

order 内置排序

  • _count: 按文档数排序。对 terms 、 histogram 、 date_histogram 有效。
  • _term: 按词项的字符串值的字母顺序排序。只在 terms 内使用。
  • _key: 按每个桶的键值数值排序(理论上与 _term 类似)。 只在 histogram 和 date_histogram 内使用。
GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color",
              "order": {
                "_count" : "asc" 
              }
            }
        }
    }
}

按度量排序

有时,我们会想基于度量计算的结果值进行排序。 在我们的汽车销售分析仪表盘中,我们可能想按照汽车颜色创建一个销售条状图表,但按照汽车平均售价的升序进行排序。

我们可以增加一个度量,再指定 order 参数引用这个度量即可:

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color",
              "order": {
                "avg_price" : "asc" 
              }
            },
            "aggs": {
                "avg_price": {
                    "avg": {"field": "price"} 
                }
            }
        }
    }
}

如果我们想使用多值度量(如 extended_stats)进行排序, 我们只需以关心的度量为关键词使用 点式路径

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color",
              "order": {
                "stats.variance" : "asc" 
              }
            },
            "aggs": {
                "stats": {
                    "extended_stats": {"field": "price"}
                }
            }
        }
    }
}

基于“深度”度量排序

创建一个汽车售价的直方图,但是按照红色和绿色(不包括蓝色)车各自的方差来排序:

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "colors" : {
            "histogram" : {
              "field" : "price",
              "interval": 20000,
              "order": {
                "red_green_cars>stats.variance" : "asc"    # 按照嵌套度量的方差对桶的直方图进行排序。
              }
            },
            "aggs": {
                "red_green_cars": {
                    "filter": { "terms": {"color": ["red", "green"]}},     # 因为我们使用单值过滤器 filter ,我们可以使用嵌套排序。
                    "aggs": {
                        "stats": {"extended_stats": {"field" : "price"}}   # 按照生成的度量对统计结果进行排序。
                    }
                }
            }
        }
    }
}

本例中,可以看到我们如何访问一个嵌套的度量。 stats 度量是 red_green_cars 聚合的子节点,而 red_green_cars 又是 colors 聚合的子节点。 为了根据这个度量排序,我们定义了路径 red_green_cars>stats.variance 。我们可以这么做,因为 filter 桶是个单值桶。

近似聚合编辑

ES 是分布式的,比较复杂的分布式计算需要在算法的性能和内存使用上做出权衡。对于这个问题,我们有个三角因子模型:大数据 Big data、精确性 Exact 和实时性 Real Time。

  • 精确 + 实时:数据可以存入单台机器的内存之中,我们可以随心所欲,使用任何想用的算法。结果会 100% 精确,响应会相对快速。
  • 大数据 + 精确:传统的 Hadoop。可以处理 PB 级的数据并且为我们提供精确的答案,但它可能需要几周的时间才能为我们提供这个答案。
  • 大数据 + 实时:近似算法为我们提供准确但不精确的结果。

Elasticsearch 目前支持两种近似算法( cardinality 和 percentiles )。 它们会提供准确但不是 100% 精确的结果。 以牺牲一点小小的估算错误为代价,这些算法可以为我们换来高速的执行效率和极小的内存消耗。

类似 CAP定理

  • 一致性(Consistence) (等同于所有节点访问同一份最新的数据副本)
  • 可用性(Availability)(每次请求都能获取到非错的响应——但是不保证获取的数据为最新数据)
  • 分区容错性(Network partitioning)(以实际效果而言,分区相当于对通信的时限要求。系统如果不能在时限内达成数据一致性,就意味着发生了分区的情况,必须就当前操作在C和A之间做出选择)

分布式系统只能满足三项中的两项而不可能满足全部三项

统计去重后的数量

Elasticsearch 提供的首个近似聚合是 cardinality (注:基数)度量。 它提供一个字段的基数,即该字段的 distinct 或者 unique 值的数目。 你可能会对 SQL 形式比较熟悉:

SELECT COUNT(DISTINCT color)
FROM cars

我们可以用 cardinality 度量确定经销商销售汽车颜色的数量:

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color"
            }
        }
    }
}

带月份的统计:

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs" : {
      "months" : {
        "date_histogram": {
          "field": "sold",
          "interval": "month"
        },
        "aggs": {
          "distinct_colors" : {
              "cardinality" : {
                "field" : "color"
              }
          }
        }
      }
  }
}

更多算法细节见原文:统计去重后的数量 | Elasticsearch: 权威指南 | Elastic

百分位计算

百分位数通常用来找出异常。

更多算法细节见原文:百分位计算 | Elasticsearch: 权威指南 | Elastic

通过聚合发现异常指标 - significant_terms

通过聚合发现异常指标 | Elasticsearch: 权威指南 | Elastic

- 基于流行程度推荐(Recommending Based on Popularity) - 基于统计的推荐(Recommending Based on Statistics)

常见问题

多字端唯一

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "script": {
                    "lang": "painless",
                    "inline": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
                }
            }
        }
    }
}

kibana

教程:Visualizing Your Data | Kibana User Guide [5.4] | Elastic

kibana 不能直接使用 es aggs 展示图表

advanced query option for building custom aggregations · Issue #5282 · elastic/kibana

但是 grafana 行

在 kibana 中使用 es query 只能用 query/filter,用于 filter down the set of returned docs:

Only the query/filter part of the query DSL works in the Kibana search bar - it allows you to filter down the set of returned documents. To apply aggregations in Kibana, you have to use the visualization builder in the Visualize tab. Under the “buckets” section, look for the “terms” aggregation.

另外 kibana 中有 dev tool 可用来调试 es query:

grafana

grafana 现在也不能将任意 aggs 可视化(2.5.0 版本能,但 2.5.0 已不支持当前的 es1)

it/es.txt · Last modified: 2017/07/14 13:49 by admin