自动补全功能
自动补全功能在用户输入时提供建议。
例如,如果用户输入“pop”,UDB-SX 会提供类似“popcorn”或“popsicles”的建议。这些建议能预判用户的意图,并引导他们更快地找到可能的搜索词。
UDB-SX 允许您设计能够随每次击键更新、提供少量相关建议并容错拼写的自动补全功能。
可以使用以下方法之一实现自动补全:
前缀匹配在查询时进行,而其他三种方法则在索引时进行。所有方法将在以下部分中描述。
前缀匹配
前缀匹配查找与查询字符串中最后一个词项匹配的文档。
例如,假设用户在搜索界面中输入“qui”。为了自动补全这个短语,使用 match_phrase_prefix 查询来搜索所有以“qui”开头的 text_entry 字段值:
GET shakespeare/_search
{
"query": {
"match_phrase_prefix": {
"text_entry": {
"query": "qui",
"slop": 3
}
}
}
}
为了使词序和相对位置更灵活,可以指定 slop 值。要了解 slop 选项,请参阅Slop。
前缀匹配不需要任何特殊的映射设置。它可以直接处理您的现有数据。
然而,这是一项相当消耗资源的操作。前缀 a 可能匹配数十万个词项,对用户来说并无用处。
为了限制前缀扩展的影响,将 max_expansions 设置为一个合理的数值:
GET shakespeare/_search
{
"query": {
"match_phrase_prefix": {
"text_entry": {
"query": "qui",
"slop": 3,
"max_expansions": 10
}
}
}
}
max_expansions 是查询可以扩展到的最大词项数。查询会将搜索词“扩展”到 fuzziness 指定的距离内的多个匹配词项。
查询时自动补全的实现简便性是以性能为代价的。 在大规模实施此功能时,我们建议使用索引时解决方案。使用索引时解决方案,您可能会遇到索引速度较慢的情况,但这只需要付出一次代价,而不是每次查询都付出。边缘 N 元语法、即搜即得和完成建议器方法都是索引时解决方案。
边缘 N 元语法匹配
在索引过程中,边缘 N 元语法将一个词拆分为一系列 n 个字符,以支持更快速地查找部分搜索词。
如果您对单词“quick”进行 N 元语法分解,结果取决于 n 的值。
| n | 类型 | N 元语法 |
|---|---|---|
| 1 | 一元语法 | [ q, u, i, c, k ] |
| 2 | 二元语法 | [ qu, ui, ic, ck ] |
| 3 | 三元语法 | [ qui, uic, ick ] |
| 4 | 四元语法 | [ quic, uick ] |
| 5 | 五元语法 | [ quick ] |
自动补全只需要搜索短语的开头 N 元语法,因此 UDB-SX 使用一种特殊类型的 N 元语法,称为边缘 N 元语法。
对单词“quick”进行边缘 N 元语法分解,结果如下:
qququiquicquick
这遵循了用户输入的相同顺序。
要将字段配置为使用边缘 N 元语法,请创建一个带有 edge_ngram 过滤器的自动补全分析器:
PUT shakespeare
{
"mappings": {
"properties": {
"text_entry": {
"type": "text",
"analyzer": "autocomplete"
}
}
},
"settings": {
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edge_ngram_filter"
]
}
}
}
}
}
此示例创建索引,并实例化了边缘 N 元语法过滤器和分析器。
edge_ngram_filter 产生最小 N 元语法长度为 1(单个字母),最大长度为 20 的边缘 N 元语法。因此,它为最多 20 个字母的单词提供建议。
autocomplete 分析器将字符串分词为单个词项,将词项转换为小写,然后使用 edge_ngram_filter 为每个词项生成边缘 N 元语法。
使用 analyze 操作测试此分析器:
POST shakespeare/_analyze
{
"analyzer": "autocomplete",
"text": "quick"
}
它返回边缘 N 元语法作为词元:
qququiquicquick
在搜索时使用 standard 分析器。否则,搜索查询会被拆分为边缘 N 元语法,您会得到匹配 q、u 和 i 的所有结果。
这是在索引时和查询时使用不同分析器的少数情况之一:
GET shakespeare/_search
{
"query": {
"match": {
"text_entry": {
"query": "qui",
"analyzer": "standard"
}
}
}
}
响应包含匹配的文档:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 533,
"relation": "eq"
},
"max_score": 9.712725,
"hits": [
{
"_index": "shakespeare",
"_id": "22006",
"_score": 9.712725,
"_source": {
"type": "line",
"line_id": 22007,
"play_name": "Antony and Cleopatra",
"speech_number": 12,
"line_number": "5.2.44",
"speaker": "CLEOPATRA",
"text_entry": "Quick, quick, good hands."
}
},
{
"_index": "shakespeare",
"_id": "54665",
"_score": 9.712725,
"_source": {
"type": "line",
"line_id": 54666,
"play_name": "Loves Labours Lost",
"speech_number": 21,
"line_number": "5.1.52",
"speaker": "HOLOFERNES",
"text_entry": "Quis, quis, thou consonant?"
}
}
...
]
}
}
或者,在映射本身中指定 search_analyzer:
"mappings": {
"properties": {
"text_entry": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
完成建议器
完成建议器接受一个建议列表,并将它们构建成一个有限状态转换器(FST),这是一种本质上为图的高度优化的数据结构。该数据结构存在于内存中,并针对快速前缀查找进行了优化。要了解更多关于 FST 的信息,请参阅 Wikipedia。
当用户输入时,完成建议器沿着匹配路径逐个字符地遍历 FST 图。当用户输入用尽后,它会检查剩余的结尾以生成建议列表。
完成建议器使您的自动补全解决方案尽可能高效,并让您能够显式控制其建议。
使用一个名为 completion 的专用字段类型,它在索引中存储类似 FST 的数据结构:
PUT shakespeare
{
"mappings": {
"properties": {
"text_entry": {
"type": "completion"
}
}
}
}
要获取建议,请使用带有 suggest 参数的 search 端点:
GET shakespeare/_search
{
"suggest": {
"autocomplete": {
"prefix": "To be",
"completion": {
"field": "text_entry"
}
}
}
}
短语“to be”与 text_entry 字段的 FST 进行前缀匹配:
{
"took" : 29,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"autocomplete" : [
{
"text" : "To be",
"offset" : 0,
"length" : 5,
"options" : [
{
"text" : "To be a comrade with the wolf and owl,--",
"_index" : "shakespeare",
"_id" : "50652",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 50653,
"play_name" : "King Lear",
"speech_number" : 68,
"line_number" : "2.4.230",
"speaker" : "KING LEAR",
"text_entry" : "To be a comrade with the wolf and owl,--"
}
},
{
"text" : "To be a make-peace shall become my age:",
"_index" : "shakespeare",
"_id" : "78566",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 78567,
"play_name" : "Richard II",
"speech_number" : 20,
"line_number" : "1.1.160",
"speaker" : "JOHN OF GAUNT",
"text_entry" : "To be a make-peace shall become my age:"
}
},
{
"text" : "To be a party in this injury.",
"_index" : "shakespeare",
"_id" : "75259",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 75260,
"play_name" : "Othello",
"speech_number" : 57,
"line_number" : "5.1.93",
"speaker" : "IAGO",
"text_entry" : "To be a party in this injury."
}
},
{
"text" : "To be a preparation gainst the Polack;",
"_index" : "shakespeare",
"_id" : "33591",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 33592,
"play_name" : "Hamlet",
"speech_number" : 17,
"line_number" : "2.2.67",
"speaker" : "VOLTIMAND",
"text_entry" : "To be a preparation gainst the Polack;"
}
},
{
"text" : "To be a public spectacle to all:",
"_index" : "shakespeare",
"_id" : "3709",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 3710,
"play_name" : "Henry VI Part 1",
"speech_number" : 6,
"line_number" : "1.4.41",
"speaker" : "TALBOT",
"text_entry" : "To be a public spectacle to all:"
}
}
]
}
]
}
}
要指定要返回的建议数量,请使用 size 参数:
GET shakespeare/_search
{
"suggest": {
"autocomplete": {
"prefix": "To n",
"completion": {
"field": "text_entry",
"size": 3
}
}
}
}
最多返回三个文档:
{
"took" : 4109,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"autocomplete" : [
{
"text" : "To n",
"offset" : 0,
"length" : 4,
"options" : [
{
"text" : "To NESTOR",
"_index" : "shakespeare",
"_id" : "99707",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 99708,
"play_name" : "Troilus and Cressida",
"speech_number" : 3,
"line_number" : "",
"speaker" : "ULYSSES",
"text_entry" : "To NESTOR"
}
},
{
"text" : "To name the bigger light, and how the less,",
"_index" : "shakespeare",
"_id" : "91884",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 91885,
"play_name" : "The Tempest",
"speech_number" : 91,
"line_number" : "1.2.394",
"speaker" : "CALIBAN",
"text_entry" : "To name the bigger light, and how the less,"
}
},
{
"text" : "To nature none more bound; his training such,",
"_index" : "shakespeare",
"_id" : "40510",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 40511,
"play_name" : "Henry VIII",
"speech_number" : 18,
"line_number" : "1.2.126",
"speaker" : "KING HENRY VIII",
"text_entry" : "To nature none more bound; his training such,"
}
}
]
}
]
}
}
suggest 参数仅使用前缀匹配查找建议。
例如,文档“To be, or not to be”不在结果中。如果您希望特定文档作为建议返回,可以手动添加精选建议并分配权重以优先考虑您的建议。
索引一个包含输入建议的文档并分配权重:
PUT shakespeare/_doc/1?refresh=true
{
"text_entry": {
"input": [
"To n", "To be, or not to be: that is the question:"
],
"weight": 10
}
}
执行相同的搜索:
GET shakespeare/_search
{
"suggest": {
"autocomplete": {
"prefix": "To n",
"completion": {
"field": "text_entry",
"size": 3
}
}
}
}
您会看到索引的文档作为第一个结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"autocomplete" : [
{
"text" : "To n",
"offset" : 0,
"length" : 4,
"options" : [
{
"text" : "To n",
"_index" : "shakespeare",
"_id" : "1",
"_score" : 10.0,
"_source" : {
"text_entry" : {
"input" : [
"To n",
"To be, or not to be: that is the question:"
],
"weight" : 10
}
}
},
{
"text" : "To NESTOR",
"_index" : "shakespeare",
"_id" : "99707",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 99708,
"play_name" : "Troilus and Cressida",
"speech_number" : 3,
"line_number" : "",
"speaker" : "ULYSSES",
"text_entry" : "To NESTOR"
}
},
{
"text" : "To name the bigger light, and how the less,",
"_index" : "shakespeare",
"_id" : "91884",
"_score" : 1.0,
"_source" : {
"type" : "line",
"line_id" : 91885,
"play_name" : "The Tempest",
"speech_number" : 91,
"line_number" : "1.2.394",
"speaker" : "CALIBAN",
"text_entry" : "To name the bigger light, and how the less,"
}
}
]
}
]
}
}
您还可以通过指定 fuzzy 参数来允许查询中的拼写错误:
GET shakespeare/_search
{
"suggest": {
"autocomplete": {
"prefix": "rosenkrantz",
"completion": {
"field": "text_entry",
"size": 3,
"fuzzy" : {
"fuzziness" : "AUTO"
}
}
}
}
}
结果匹配正确的拼写:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"autocomplete" : [
{
"text" : "rosenkrantz",
"offset" : 0,
"length" : 11,
"options" : [
{
"text" : "ROSENCRANTZ:",
"_index" : "shakespeare",
"_id" : "35196",
"_score" : 5.0,
"_source" : {
"type" : "line",
"line_id" : 35197,
"play_name" : "Hamlet",
"speech_number" : 2,
"line_number" : "4.2.1",
"speaker" : "HAMLET",
"text_entry" : "ROSENCRANTZ:"
}
}
]
}
]
}
}
您可以使用正则表达式来定义完成建议器查询的前缀:
GET shakespeare/_search
{
"suggest": {
"autocomplete": {
"prefix": "rosen*",
"completion": {
"field": "text_entry",
"size": 3
}
}
}
}
更多信息,请参阅 completion 字段类型文档。
即搜即得
UDB-SX 有一个专用的 search_as_you_type 字段类型,它针对即搜即得功能进行了优化,可以使用前缀和中缀补全来匹配词项。search_as_you_type 字段不需要您设置自定义分析器或预先索引建议。
首先,将字段映射为 search_as_you_type:
PUT shakespeare
{
"mappings": {
"properties": {
"text_entry": {
"type": "search_as_you_type"
}
}
}
}
在您索引文档后,UDB-SX 会自动创建并存储其 N 元语法和边缘 N 元语法。例如,考虑字符串 that is the question。首先,它使用标准分析器分词,并将词项存储在 text_entry 字段中:
[
"that",
"is",
"the",
"question"
]
除了存储这些词项外,该字段的以下 2 元语法存储在 text_entry._2gram 字段中:
[
"that is",
"is the",
"the question"
]
该字段的以下 3 元语法存储在 text_entry._3gram 字段中:
[
"that is the",
"is the question"
]
最后,在应用边缘 N 元语法分词过滤器后,生成的词项存储在 text_entry._index_prefix 字段中:
[
"t",
"th",
"tha",
"that",
...
]
然后,您可以使用 multi-match 查询的 bool_prefix 类型来匹配任意顺序的词项:
GET shakespeare/_search
{
"query": {
"multi_match": {
"query": "uncle what",
"type": "bool_prefix",
"fields": [
"text_entry",
"text_entry._2gram",
"text_entry._3gram"
]
}
},
"size": 3
}
其中单词出现顺序与查询中相同的文档在结果中排名更高:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4759,
"relation" : "eq"
},
"max_score" : 10.437667,
"hits" : [
{
"_index" : "shakespeare",
"_id" : "2817",
"_score" : 10.437667,
"_source" : {
"type" : "line",
"line_id" : 2818,
"play_name" : "Henry IV",
"speech_number" : 5,
"line_number" : "5.2.31",
"speaker" : "HOTSPUR",
"text_entry" : "Uncle, what news?"
}
},
{
"_index" : "shakespeare",
"_id" : "37085",
"_score" : 9.437667,
"_source" : {
"type" : "line",
"line_id" : 37086,
"play_name" : "Henry V",
"speech_number" : 26,
"line_number" : "1.2.262",
"speaker" : "KING HENRY V",
"text_entry" : "What treasure, uncle?"
}
},
{
"_index" : "shakespeare",
"_id" : "79274",
"_score" : 9.358302,
"_source" : {
"type" : "line",
"line_id" : 79275,
"play_name" : "Richard II",
"speech_number" : 29,
"line_number" : "2.1.187",
"speaker" : "KING RICHARD II",
"text_entry" : "Why, uncle, whats the matter?"
}
}
]
}
}
要按顺序匹配词项,可以使用 match_phrase_prefix 查询:
GET shakespeare/_search
{
"query": {
"match_phrase_prefix": {
"text_entry": "uncle wha"
}
},
"size": 3
}
响应包含匹配前缀的文档:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : 16.37664,
"hits" : [
{
"_index" : "shakespeare",
"_id" : "2817",
"_score" : 16.37664,
"_source" : {
"type" : "line",
"line_id" : 2818,
"play_name" : "Henry IV",
"speech_number" : 5,
"line_number" : "5.2.31",
"speaker" : "HOTSPUR",
"text_entry" : "Uncle, what news?"
}
},
{
"_index" : "shakespeare",
"_id" : "6789",
"_score" : 16.37664,
"_source" : {
"type" : "line",
"line_id" : 6790,
"play_name" : "Henry VI Part 2",
"speech_number" : 60,
"line_number" : "1.3.202",
"speaker" : "KING HENRY VI",
"text_entry" : "Uncle, what shall we say to this in law?"
}
},
{
"_index" : "shakespeare",
"_id" : "7877",
"_score" : 16.37664,
"_source" : {
"type" : "line",
"line_id" : 7878,
"play_name" : "Henry VI Part 2",
"speech_number" : 13,
"line_number" : "3.2.28",
"speaker" : "KING HENRY VI",
"text_entry" : "Where is our uncle? whats the matter, Suffolk?"
}
}
]
}
}
最后,要精确匹配最后一个词项而不是作为前缀,可以使用 match_phrase 查询:
GET shakespeare/_search
{
"query": {
"match_phrase": {
"text_entry": "uncle what"
}
},
"size": 5
}
响应包含精确匹配的文档:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 14.437452,
"hits" : [
{
"_index" : "shakespeare",
"_id" : "2817",
"_score" : 14.437452,
"_source" : {
"type" : "line",
"line_id" : 2818,
"play_name" : "Henry IV",
"speech_number" : 5,
"line_number" : "5.2.31",
"speaker" : "HOTSPUR",
"text_entry" : "Uncle, what news?"
}
},
{
"_index" : "shakespeare,
"_id" : "6789",
"_score" : 9.461917,
"_source" : {
"type" : "line",
"line_id" : 6790,
"play_name" : "Henry VI Part 2",
"speech_number" : 60,
"line_number" : "1.3.202",
"speaker" : "KING HENRY VI",
"text_entry" : "Uncle, what shall we say to this in law?"
}
},
{
"_index" : "shakespeare",
"_id" : "100955",
"_score" : 8.947967,
"_source" : {
"type" : "line",
"line_id" : 100956,
"play_name" : "Troilus and Cressida",
"speech_number" : 28,
"line_number" : "3.2.98",
"speaker" : "CRESSIDA",
"text_entry" : "Well, uncle, what folly I commit, I dedicate to you."
}
}
]
}
}
如果您修改上一个 match_phrase 查询中的文本并省略最后一个字母,上一个响应中的任何文档都不会返回:
GET shakespeare/_search
{
"query": {
"match_phrase": {
"text_entry": "uncle wha"
}
}
}
结果为空:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
更多信息,请参阅 search_as_you_type 字段类型文档。