派生字段类型
派生字段允许您通过在现有字段上执行脚本动态创建新字段。现有字段可以是从包含原始文档的_source字段中检索,也可以是从字段的doc values中获取。一旦在索引映射或搜索请求中定义了派生字段,就可以像使用常规字段一样在查询中使用该字段。
何时使用派生字段
衍生字段在字段操作中提供灵活性,并优先考虑存储效率。然而,由于它们是在查询时计算的,可能会降低查询性能。衍生字段在需要实时数据转换的场景中尤其有用,例如:
日志分析:从日志消息中提取时间戳和日志级别。
性能指标:从起始和结束时间戳计算响应时间。
安全分析:实时IP地理位置和用户代理解析,用于威胁检测。
实验用例:测试新的数据转换、创建临时字段进行A/B测试,或生成一次性报告,而无需更改映射或重新索引数据。
尽管查询时计算的潜在性能影响,但派生字段的灵活性和存储效率使它们成为这些应用的有价值工具。
当前限制
目前,派生字段有以下限制
评分和排序:尚不支持
派生字段支持大多数聚合类型。以下聚合类型不受支持:地理(地理距离、地理哈希网格、地理十六进制网格、地理瓦片网格、地理边界、地理质心)、显著术语、显著文本和脚本度量。
仪表板支持:这些字段不会显示在UDB-SX仪表板中可用的字段列表中。但是,如果您知道派生字段名称,仍然可以使用它们进行筛选。
链式派生字段:一个派生字段不能用于定义另一个派生字段。
连接字段类型:关联字段类型不支持派生字段。
先决条件
在使用派生字段之前,请确保满足以下先决条件:
启用 _source 或 doc_values:确保用于脚本的字段中启用 _source 字段或 doc 值。
启用昂贵查询:确保 search.allow_expensive_queries 设置为 true。
功能控制:派生字段默认启用。您可以通过以下设置启用或禁用派生字段:
索引级别:更新 index.query.derived_field.enabled 设置。
集群级别:更新search.derived_field.enabled设置。这两个设置都是动态的,因此可以更改而不需要重新索引或节点重启。
性能考虑:在使用派生字段之前,评估其性能影响,以确保派生字段满足您的规模要求。
定义派生字段
示例设置
要尝试此页上的示例,首先创建以下logs索引:
PUT logs
{
"mappings": {
"properties": {
"request": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"clientip": {
"type": "keyword"
}
}
}
}
向索引添加示例文档:
POST _bulk
{ "index" : { "_index" : "logs", "_id" : "1" } }
{ "request": "894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778", "clientip": "61.177.2.0" }
{ "index" : { "_index" : "logs", "_id" : "2" } }
{ "request": "894140400 GET /french/playing/mascot/mascot.html HTTP/1.1 200 5474", "clientip": "185.92.2.0" }
{ "index" : { "_index" : "logs", "_id" : "3" } }
{ "request": "894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711", "clientip": "61.177.2.0" }
{ "index" : { "_index" : "logs", "_id" : "4" } }
{ "request": "894360400 POST /images/home_fr_button.gif HTTP/1.1 200 2140", "clientip": "129.178.2.0" }
{ "index" : { "_index" : "logs", "_id" : "5" } }
{ "request": "894470400 DELETE /images/102384s.gif HTTP/1.0 200 785", "clientip": "227.177.2.0" }
定义索引映射中的派生字段
从在 logs 索引中索引的 request 字段中提取 timestamp、method 和 size 字段,配置以下映射:
PUT /logs/_mapping
{
"derived": {
"timestamp": {
"type": "date",
"format": "MM/dd/yyyy",
"script": {
"source": """
emit(Long.parseLong(doc["request.keyword"].value.splitOnToken(" ")[0]))
"""
}
},
"method": {
"type": "keyword",
"script": {
"source": """
emit(doc["request.keyword"].value.splitOnToken(" ")[1])
"""
}
},
"size": {
"type": "long",
"script": {
"source": """
emit(Long.parseLong(doc["request.keyword"].value.splitOnToken(" ")[5]))
"""
}
}
}
}
请注意,timestamp字段有一个额外的format参数,用于指定显示date字段的格式。如果您不包含format参数,则默认格式为strict_date_time_no_millis。有关支持的日期格式更多信息,请参阅参数。
参数
以下表格列出了derived字段类型接受的参数。所有参数都是动态的,可以修改而不需要重新索引文档
| 参数 | 必需/可选 | 描述 |
|---|---|---|
| type | 必需 | 派生字段的类型。支持的类型有 boolean、date、geo_point、ip、keyword、text、long、double、float 和 object。 |
| script | 必需 | 脚本关联的派生字段。从脚本发出的任何值都必须使用emit()发出。发出的值的类型必须与派生字段的type匹配。如果启用,脚本可以访问doc_values和_source字段。可以使用doc['field_name'].value访问字段的doc值,并使用params._source["field_name"] |
| format | 可选 | 日期解析使用的格式。仅适用于date字段。有效值是strict_date_time_no_millis、strict_date_optional_time和epoch_millis。更多信息,请参阅格式。 |
| ignore_malformed | 可选 | 布尔值,用于指定在运行派生字段的查询时是否忽略格式错误的值。默认值为false(遇到格式错误时抛出异常)。 |
| prefilter_field | 可选 | 索引文本字段,用于提升派生字段的性能。指定一个现有的索引字段,在派生字段之前进行过滤。有关更多信息,请参阅预过滤字段。 |
在脚本中输出值
仅在实际字段脚本上下文中可用 emit() 函数。它用于在脚本运行于文档时发出一个或多个(对于多值字段)脚本值。
下表列出了支持的字段类型的emit()函数格式。
| 类型 | 输出格式 | 支持多值字段 |
|---|---|---|
| boolean | emit(boolean) | 否 |
| double | emit(double) | 是 |
| date | emit(long timeInMilis) | 是 |
| float | emit(float) | 是 |
| geo_point | emit(double lat, double lon) | 是 |
| ip | emit(String ip) | 是 |
| keyword | emit(String) | 是 |
| long | emit(long) | 是 |
| object | emit(String json) emit(String json) (有效的JSON) | 是 |
| text | emit(String) | 是 |
默认情况下,如果派生字段与其发出的值之间存在类型不匹配,则搜索请求将因错误而失败。如果 ignore_malformed 被设置为 true,则跳过失败的文档,搜索请求成功。
每个文档发出的值的尺寸限制为1 MB。
搜索在索引映射中定义的派生字段
要搜索派生字段,请使用与搜索常规字段相同的语法。例如,以下请求搜索指定范围内的具有派生timestamp字段的文档:
POST /logs/_search
{
"query": {
"range": {
"timestamp": {
"gte": "1970-01-11T08:20:30.400Z",
"lte": "1970-01-11T08:26:00.400Z"
}
}
},
"fields": ["timestamp"]
}
响应包含匹配的文档:
{
"took": 315,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "logs",
"_id": "1",
"_score": 1,
"_source": {
"request": "894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778",
"clientip": "61.177.2.0"
},
"fields": {
"timestamp": [
"1970-01-11T08:20:30.400Z"
]
}
},
{
"_index": "logs",
"_id": "2",
"_score": 1,
"_source": {
"request": "894140400 GET /french/playing/mascot/mascot.html HTTP/1.1 200 5474",
"clientip": "185.92.2.0"
},
"fields": {
"timestamp": [
"1970-01-11T08:22:20.400Z"
]
}
},
{
"_index": "logs",
"_id": "3",
"_score": 1,
"_source": {
"request": "894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711",
"clientip": "61.177.2.0"
},
"fields": {
"timestamp": [
"1970-01-11T08:24:10.400Z"
]
}
},
{
"_index": "logs",
"_id": "4",
"_score": 1,
"_source": {
"request": "894360400 POST /images/home_fr_button.gif HTTP/1.1 200 2140",
"clientip": "129.178.2.0"
},
"fields": {
"timestamp": [
"1970-01-11T08:26:00.400Z"
]
}
}
]
}
}
在搜索请求中定义和搜索派生字段
您也可以在搜索请求中直接定义派生字段,并与其一起查询常规索引字段。例如,以下请求创建了url和status派生字段,并搜索这些字段以及常规的request和clientip字段:
POST /logs/_search
{
"derived": {
"url": {
"type": "text",
"script": {
"source": """
emit(doc["request.keyword"].value.splitOnToken(" ")[2])
"""
}
},
"status": {
"type": "keyword",
"script": {
"source": """
emit(doc["request.keyword"].value.splitOnToken(" ")[4])
"""
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"clientip": "61.177.2.0"
}
},
{
"match": {
"url": "images"
}
},
{
"term": {
"status": "200"
}
}
]
}
},
"fields": ["request", "clientip", "url", "status"]
}
响应包含匹配的文档:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 2.8754687,
"hits": [
{
"_index": "logs",
"_id": "1",
"_score": 2.8754687,
"_source": {
"request": "894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778",
"clientip": "61.177.2.0"
},
"fields": {
"request": [
"894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778"
],
"clientip": [
"61.177.2.0"
],
"url": [
"/english/images/france98_venues.gif"
],
"status": [
"200"
]
}
},
{
"_index": "logs",
"_id": "3",
"_score": 2.8754687,
"_source": {
"request": "894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711",
"clientip": "61.177.2.0"
},
"fields": {
"request": [
"894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711"
],
"clientip": [
"61.177.2.0"
],
"url": [
"/english/venues/images/venue_header.gif"
],
"status": [
"200"
]
}
}
]
}
}
派生字段在搜索过程中使用索引分析设置中指定的默认分析器。您可以在搜索请求中覆盖默认分析器或指定搜索分析器,就像处理常规字段一样。
当一个字段同时存在索引映射和搜索定义时,搜索定义具有优先权。
检索字段
您可以使用搜索请求中的fields参数检索派生字段,其方法与常规字段相同,如前述示例所示。您还可以使用通配符检索与给定模式匹配的所有派生字段。
突出显示
类型 text 的派生字段支持使用统一高亮器进行高亮显示。例如,以下请求指定高亮显示派生 url 字段:
POST /logs/_search
{
"derived": {
"url": {
"type": "text",
"script": {
"source": """
emit(doc["request.keyword"].value.splitOnToken(" " )[2])
"""
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"clientip": "61.177.2.0"
}
},
{
"match": {
"url": "images"
}
}
]
}
},
"fields": ["request", "clientip", "url"],
"highlight": {
"fields": {
"url": {}
}
}
}
响应指定在url字段中突出显示:
{
"took": 45,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.8754687,
"hits": [
{
"_index": "logs",
"_id": "1",
"_score": 1.8754687,
"_source": {
"request": "894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778",
"clientip": "61.177.2.0"
},
"fields": {
"request": [
"894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778"
],
"clientip": [
"61.177.2.0"
],
"url": [
"/english/images/france98_venues.gif"
]
},
"highlight": {
"url": [
"/english/<em>images</em>/france98_venues.gif"
]
}
},
{
"_index": "logs",
"_id": "3",
"_score": 1.8754687,
"_source": {
"request": "894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711",
"clientip": "61.177.2.0"
},
"fields": {
"request": [
"894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711"
],
"clientip": [
"61.177.2.0"
],
"url": [
"/english/venues/images/venue_header.gif"
]
},
"highlight": {
"url": [
"/english/venues/<em>images</em>/venue_header.gif"
]
}
}
]
}
}
派生字段支持大多数聚合类型。
地理、重要术语、重要文本和脚本度量聚合不支持。
例如,以下请求在派生字段method上创建了一个简单的terms聚合:
POST /logs/_search
{
"size": 0,
"aggs": {
"methods": {
"terms": {
"field": "method"
}
}
}
}
响应包含以下存储桶:
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"methods" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "GET",
"doc_count" : 2
},
{
"key" : "POST",
"doc_count" : 2
},
{
"key" : "DELETE",
"doc_count" : 1
}
]
}
}
}
性能
派生字段未进行索引,而是通过从_source字段或文档值检索值来动态计算。因此,它们运行得更慢。为了提高性能,请尝试以下方法:
通过在索引字段和相关派生字段上添加查询过滤器来修剪搜索空间。
在适用的情况下,在脚本中使用文档值而不是_source以提高访问速度。
考虑在搜索请求中不使用显式过滤器的情况下,使用 prefilter_field 自动修剪搜索空间。
预过滤字段
指定一个预过滤字段有助于在不添加显式过滤器的情况下修剪搜索空间。预过滤字段指定在构建查询时自动过滤的现有索引字段(prefilter_field)。prefilter_field必须是一个文本字段(text或match_only_text)。
例如,您可以在派生字段 method 中添加一个 prefilter_field。更新索引映射,指定在 request 字段上预过滤:
PUT /logs/_mapping
{
"derived": {
"method": {
"type": "keyword",
"script": {
"source": """
emit(doc["request.keyword"].value.splitOnToken(" ")[1])
"""
},
"prefilter_field": "request"
}
}
}
现在使用对method派生字段的查询进行搜索:
POST /logs/_search
{
"profile": true,
"query": {
"term": {
"method": {
"value": "GET"
}
}
},
"fields": ["method"]
}
UDB-SX会自动在您的查询中添加对request字段的过滤器:
您可以使用profile选项来分析派生字段性能,如前例所示。
派生对象字段
脚本可以输出有效的JSON对象,这样您可以在不索引它们的情况下查询子字段,就像对常规字段一样。这对于需要偶尔搜索某些子字段的大型JSON对象很有用。在这种情况下,索引子字段成本高昂,而为每个子字段定义派生字段也增加了大量的资源开销。如果您没有明确提供子字段类型,则子字段类型将被推断。
例如,以下请求将一个derived_request_object派生字段定义为object类型:
PUT logs_object
{
"mappings": {
"properties": {
"request_object": { "type": "text" }
},
"derived": {
"derived_request_object": {
"type": "object",
"script": {
"source": "emit(params._source[\"request_object\"])"
}
}
}
}
}
考虑以下文档,其中request_object是JSON对象的字符串表示:
POST _bulk
{ "index" : { "_index" : "logs_object", "_id" : "1" } }
{ "request_object": "{\"@timestamp\": 894030400, \"clientip\":\"61.177.2.0\", \"request\": \"GET /english/venues/images/venue_header.gif HTTP/1.0\", \"status\": 200, \"size\": 711}" }
{ "index" : { "_index" : "logs_object", "_id" : "2" } }
{ "request_object": "{\"@timestamp\": 894140400, \"clientip\":\"129.178.2.0\", \"request\": \"GET /images/home_fr_button.gif HTTP/1.1\", \"status\": 200, \"size\": 2140}" }
{ "index" : { "_index" : "logs_object", "_id" : "3" } }
{ "request_object": "{\"@timestamp\": 894240400, \"clientip\":\"227.177.2.0\", \"request\": \"GET /images/102384s.gif HTTP/1.0\", \"status\": 400, \"size\": 785}" }
{ "index" : { "_index" : "logs_object", "_id" : "4" } }
{ "request_object": "{\"@timestamp\": 894340400, \"clientip\":\"61.177.2.0\", \"request\": \"GET /english/images/venue_bu_city_on.gif HTTP/1.0\", \"status\": 400, \"size\": 1397}\n" }
{ "index" : { "_index" : "logs_object", "_id" : "5" } }
{ "request_object": "{\"@timestamp\": 894440400, \"clientip\":\"132.176.2.0\", \"request\": \"GET /french/news/11354.htm HTTP/1.0\", \"status\": 200, \"size\": 3460, \"is_active\": true}" }
以下查询搜索derived_request_object的@timestamp子字段:
POST /logs_object/_search
{
"query": {
"range": {
"derived_request_object.@timestamp": {
"gte": "894030400",
"lte": "894140400"
}
}
},
"fields": ["derived_request_object.@timestamp"]
}
响应包含匹配的文档:
{
"took": 26,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "logs_object",
"_id": "1",
"_score": 1,
"_source": {
"request_object": """{"@timestamp": 894030400, "clientip":"61.177.2.0", "request": "GET /english/venues/images/venue_header.gif HTTP/1.0", "status": 200, "size": 711}"""
},
"fields": {
"derived_request_object.@timestamp": [
894030400
]
}
},
{
"_index": "logs_object",
"_id": "2",
"_score": 1,
"_source": {
"request_object": """{"@timestamp": 894140400, "clientip":"129.178.2.0", "request": "GET /images/home_fr_button.gif HTTP/1.1", "status": 200, "size": 2140}"""
},
"fields": {
"derived_request_object.@timestamp": [
894140400
]
}
}
]
}
}
您还可以指定突出显示派生对象字段:
POST /logs_object/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"derived_request_object.clientip": "61.177.2.0"
}
},
{
"match": {
"derived_request_object.request": "images"
}
}
]
}
},
"fields": ["derived_request_object.*"],
"highlight": {
"fields": {
"derived_request_object.request": {}
}
}
}
响应添加了对derived_request_object.request字段的突出显示:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 2,
"hits": [
{
"_index": "logs_object",
"_id": "1",
"_score": 2,
"_source": {
"request_object": """{"@timestamp": 894030400, "clientip":"61.177.2.0", "request": "GET /english/venues/images/venue_header.gif HTTP/1.0", "status": 200, "size": 711}"""
},
"fields": {
"derived_request_object.request": [
"GET /english/venues/images/venue_header.gif HTTP/1.0"
],
"derived_request_object.clientip": [
"61.177.2.0"
]
},
"highlight": {
"derived_request_object.request": [
"GET /english/venues/<em>images</em>/venue_header.gif HTTP/1.0"
]
}
},
{
"_index": "logs_object",
"_id": "4",
"_score": 2,
"_source": {
"request_object": """{"@timestamp": 894340400, "clientip":"61.177.2.0", "request": "GET /english/images/venue_bu_city_on.gif HTTP/1.0", "status": 400, "size": 1397}
"""
},
"fields": {
"derived_request_object.request": [
"GET /english/images/venue_bu_city_on.gif HTTP/1.0"
],
"derived_request_object.clientip": [
"61.177.2.0"
]
},
"highlight": {
"derived_request_object.request": [
"GET /english/<em>images</em>/venue_bu_city_on.gif HTTP/1.0"
]
}
}
]
}
}
推断子字段类型
类型推断基于与动态映射相同的逻辑。而不是从第一份文档中推断子字段类型,使用文档的随机样本来推断类型。如果随机样本中的任何文档都没有找到子字段,类型推断失败并记录警告。对于在文档中很少出现的子字段,考虑定义显式字段类型。对于此类子字段使用动态类型推断可能会导致查询返回无结果,就像缺失字段一样。
显式子字段类型
为了定义显式子字段类型,请在properties对象中提供type参数。在以下示例中,derived_logs_object.is_active字段被定义为boolean。因为这个字段只存在于一份文档中,其类型推断可能会失败,因此定义显式类型很重要:
POST /logs_object/_search
{
"derived": {
"derived_request_object": {
"type": "object",
"script": {
"source": "emit(params._source[\"request_object\"])"
},
"properties": {
"is_active": "boolean"
}
}
},
"query": {
"term": {
"derived_request_object.is_active": true
}
},
"fields": ["derived_request_object.is_active"]
}
响应包含匹配的文档:
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "logs_object",
"_id": "5",
"_score": 1,
"_source": {
"request_object": """{"@timestamp": 894440400, "clientip":"132.176.2.0", "request": "GET /french/news/11354.htm HTTP/1.0", "status": 200, "size": 3460, "is_active": true}"""
},
"fields": {
"derived_request_object.is_active": [
true
]
}
}
]
}
}