我正在尝试索引包含连字符但不包含空格,句点或任何其他标点符号的字符串。我不想基于连字符对单词进行拆分,而是希望将连字符作为索引文本的一部分。
例如,我的6个文本字符串将是:
我希望能够在这些字符串中搜索 包含“ play” 或 以“ magazine”开头 的 文本 。
我已经能够使用 ngram 使包含“播放”的文本正常工作。但是,连字符导致文本拆分,并且包含连字符后的单词“ magazine”中的结果。我只希望出现在以“ magazine”开头的字符串开头的单词。
根据上面的示例,当以“ magazine”开头时,只有这3个应出现:
请为我的ElasticSearch Index Sample提供帮助:
DELETE /sample PUT /sample { "settings": { "index.number_of_shards":5, "index.number_of_replicas": 0, "analysis": { "filter": { "nGram_filter": { "type": "nGram", "min_gram": 2, "max_gram": 20, "token_chars": [ "letter", "digit" ] }, "word_delimiter_filter": { "type": "word_delimiter", "preserve_original": true, "catenate_all" : true } }, "analyzer": { "ngram_index_analyzer": { "type" : "custom", "tokenizer": "lowercase", "filter" : ["nGram_filter", "word_delimiter_filter"] } } } } } PUT /sample/1/_create { "name" : "magazineplayon" } PUT /sample/3/_create { "name" : "magazineofhorses" } PUT /sample/4/_create { "name" : "online-magazine" } PUT /sample/5/_create { "name" : "best-magazine" } PUT /sample/6/_create { "name" : "friend-of-magazines" } PUT /sample/7/_create { "name" : "magazineplaygames" } GET /sample/_search { "query": { "wildcard": { "name": "*play*" } } } GET /sample/_search { "query": { "wildcard": { "name": "magazine*" } } }
更新1 我在样本后更新了所有create语句以使用TEST:
PUT /sample/test/7/_create { "name" : "magazinefairplay" }
然后,我运行以下命令以仅返回其中包含“ play”一词的名称,而不执行通配符搜索。这可以正常工作,并且仅返回两条记录。
POST /sample/test/_search { "query": { "bool": { "minimum_should_match": 1, "should": [ {"match": { "name.substrings": "play" }} ] } } }
我运行以下命令以仅返回以“ magazine”开头的名称。我的期望是不会出现“在线杂志”,“最佳杂志”和“杂志之友”。但是,包括这三个记录在内的所有七个记录都已返回。
POST /sample/test/_search { "query": { "bool": { "minimum_should_match": 1, "should": [ {"match": { "name.prefixes": "magazine" }} ] } } }
有没有一种方法可以过滤掉使用连字符的前缀?
您走在正确的道路上,但是,您还需要添加另一个利用edge- ngram令牌过滤器的分析器,以使“开头为”约束开始工作。您可以保留ngram用于“包含”给定单词的用于检查的字段,但是您需要edge- ngram检查用于“以”某些标记“开头”的字段。
edge- ngram
ngram
PUT /sample { "settings": { "index.number_of_shards": 5, "index.number_of_replicas": 0, "analysis": { "filter": { "nGram_filter": { "type": "nGram", "min_gram": 2, "max_gram": 20, "token_chars": [ "letter", "digit" ] }, "edgenGram_filter": { "type": "edgeNGram", "min_gram": 2, "max_gram": 20 } }, "analyzer": { "ngram_index_analyzer": { "type": "custom", "tokenizer": "keyword", "filter": [ "lowercase", "nGram_filter" ] }, "edge_ngram_index_analyzer": { "type": "custom", "tokenizer": "keyword", "filter": [ "lowercase", "edgenGram_filter" ] } } } }, "mappings": { "test": { "properties": { "name": { "type": "string", "fields": { "prefixes": { "type": "string", "analyzer": "edge_ngram_index_analyzer", "search_analyzer": "standard" }, "substrings": { "type": "string", "analyzer": "ngram_index_analyzer", "search_analyzer": "standard" } } } } } } }
然后您的查询将变为(即搜索其name字段包含play或以开头的所有文档magazine)
name
play
magazine
POST /sample/test/_search { "query": { "bool": { "minimum_should_match": 1, "should": [ {"match": { "name.substrings": "play" }}, {"match": { "name.prefixes": "magazine" }} ] } } }
注意:请勿wildcard用于搜索子字符串,因为它会破坏群集的性能(此处和此处提供更多信息)
wildcard