搜索推荐 suggest
概念
在用户输入时候,进行自动补正或错误纠正,提高搜索的精准度来提升用户的搜索体验.
类别
term suggester
只基于tokenizer之后的单个term去匹配建议词,并不会考虑多个term之间的关系.
结构
POST <index>/_search
{
"suggest": {
"<suggest_name>": {
"text": "<search_content>",
"term": {
"suggest_mode": "missing",
"field": "<field_name>",
"analyzer":"xxx",
"size":20,
"sort":"score",
"max_edits":2
}
}
}
}
参数
- text:用户搜索的文本
- field:要从哪个字段选取推荐数据
- analyzer:使用哪种分词器
- size:每个建议返回的最大结果数
- sort:如何按照提示词项排序,参数值只可以是以下两个枚举:
- score:分数>词频>词项本身
- frequency:词频>分数>词项本身
- suggest_mode:搜索推荐的推荐模式,参数值亦是枚举:
- missing:默认值,仅为不在索引中的词项生成建议词
- popular:仅返回与搜索词文档词频或文档词频更高的建议词
- always:根据 建议文本中的词项 推荐 任何匹配的建议词
- max_edits:可以具有最大偏移距离候选建议以便被认为是建议。只能是1到2之间的值。任何其他值都将导致引发错误的请求错误。默认为2
- prefix_length:前缀匹配的时候,必须满足的最少字符. 默认1.
- min_word_length:最少包含的单词数量
- min_doc_freq:最少的文档频率
- max_term_freq:最大的词频
例子
#term suggest
DELETE news
POST _bulk
{ "index" : { "_index" : "news","_id":1 } }
{ "title": "baoqiang bought a new hat with the same color of this font, which is very beautiful baoqiangba baoqiangda baoqiangdada baoqian baoqia"}
{ "index" : { "_index" : "news","_id":2 } }
{ "title": "baoqiangge gave birth to two children, one is upstairs, one is downstairs baoqiangba baoqiangda baoqiangdada baoqian baoqia"}
{ "index" : { "_index" : "news","_id":3} }
{ "title": "baoqiangge 's money was rolled away baoqiangba baoqiangda baoqiangdada baoqian baoqia"}
{ "index" : { "_index" : "news","_id":4} }
{ "title": "baoqiangda baoqiangda baoqiangda baoqiangda baoqiangda baoqian baoqia"}
GET news/_mapping
POST _analyze
{
"text": [
"BaoQiang bought a new hat with the same color of this font, which is very beautiful",
"BaoQiangGe gave birth to two children, one is upstairs, one is downstairs",
"BaoQiangGe 's money was rolled away"
]
}
POST /news/_search
{
"suggest": {
"my-suggestion": {
"text": "baoqing baoqiang",
"term": {
"suggest_mode":"always",
"field": "title",
"min_doc_freq": 3
}
}
}
}
GET /news/_search
{
"suggest": {
"my-suggestion": {
"text": "baoqing baoqiang",
"term": {
"suggest_mode": "popular",
"field": "title"
}
}
}
}
GET /news/_search
{
"suggest": {
"my-suggestion": {
"text": "baoqing baoqiang",
"term": {
"suggest_mode": "popular",
"field": "title",
"max_edits":2,
"max_term_freq":1
}
}
}
}
GET /news/_search
{
"suggest": {
"my-suggestion": {
"text": "baoqing baoqiang",
"term": {
"suggest_mode": "always",
"field": "title",
"max_edits":2
}
}
}
}
phrase suggester
phrase suggester和term suggester相比,对建议的文本会参考上下文,也就是一个句子的其他token,不只是单纯的token距离匹配,它可以基于共生和频率选出更好的建议。
结构
POST <index>/_search
{
"suggest": {
"<suggest_name>": {
"text": "<search_content>",
"phrase": {
"direct_generator": "missing",
"field": "<field_name>",
"max_errors": 2,
"gram_size": 1,
"confidence":0
}
}
}
}
参数
- real_word_error_likelihood: 此选项的默认值为 0.95。此选项告诉 Elasticsearch 索引中 5% 的术语拼写错误。这意味着随着这个参数的值越来越低,Elasticsearch 会将越来越多存在于索引中的术语视为拼写错误,即使它们是正确的
- max_errors:为了形成更正,最多被认为是拼写错误的术语的最大百分比。默认值为 1
- confidence:默认值为 1.0,最大值也是。该值充当与建议分数相关的阈值。只有得分超过此值的建议才会显示。例如,置信度为 1.0 只会返回得分高于输入短语的建议
- collate:告诉 Elasticsearch 根据指定的查询检查每个建议,以修剪索引中不存在匹配文档的建议。在这种情况下,它是一个匹配查询。由于此查询是模板查询,因此搜索查询是当前建议,位于查询中的参数下。可以在查询下的“params”对象中添加更多字段。同样,当参数“prune”设置为true时,我们将在响应中增加一个字段“collate_match”,指示建议结果中是否存在所有更正关键字的匹配
- direct_generator:phrase suggester使用候选生成器生成给定文本中每个项可能的项的列表。单个候选生成器类似于为文本中的每个单独的调用term suggester。生成器的输出随后与建议候选项中的候选项结合打分。目前只支持一种候选生成器,即direct_generator。建议API接受密钥直接生成器下的生成器列表;列表中的每个生成器都按原始文本中的每个项调用。
例子
PUT test
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"trigram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
]
}
},
"filter": {
"shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"trigram": {
"type": "text",
"analyzer": "trigram"
}
}
}
}
}
}
POST test/_bulk
{ "index" : { "_id":1} }
{"title": "lucene and elasticsearch"}
{ "index" : {"_id":2} }
{"title": "lucene and elasticsearhc"}
{ "index" : { "_id":3} }
{"title": "luceen and elasticsearch"}
POST test/_search
{
"suggest": {
"text": "Luceen and elasticsearhc",
"simple_phrase": {
"phrase": {
"field": "title.trigram",
"max_errors": 2,
"gram_size": 1,
"confidence":0,
"direct_generator": [
{
"field": "title.trigram",
"suggest_mode": "always"
}
],
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
}
}
}
}
}
completion suggester
自动补全,自动完成,支持三种查询【前缀查询(prefix)模糊查询(fuzzy)正则表达式查询(regex)】 ,主要针对的应用场景就是”Auto Completion”。 此场景下用户每输入一个字符的时候,就需要即时发送一次查询请求到后端查找匹配项,在用户输入速度较高的情况下对后端响应速度要求比较苛刻。因此实现上它和前面两个Suggester采用了不同的数据结构,索引并非通过倒排来完成,而是将analyze过的数据编码成FST和索引一起存放。对于一个open状态的索引,FST会被ES整个装载到内存里的,进行前缀查找速度极快。但是FST只能用于前缀查找,这也是Completion Suggester的局限所在。
区分