本文介绍了如何优化Elasticsearch的全文搜索以匹配'C ++'之类的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个文本内容搜索引擎,其中包含诸如 c ++ c#之类的字符串.切换到Elasticsearch已显示搜索与诸如"c ++"之类的词不匹配. ++ 已删除.

We have a search engine for text content which contains strings like c++ or c#. The switch to Elasticsearch has shown that the search does not match on terms like 'c++'. ++ is removed.

我们如何教导elasticsearch在全文搜索中正确匹配而不删除特殊字符?当然,仍应删除诸如逗号之类的字符.

How can we teach elasticsearch to match correctly in a full text search and not to remove special characters? Characters like comma , should of course still be removed.

推荐答案

您需要创建自己的 custom-analyzer 会根据您的要求生成令牌,对于您的示例,我创建了一个以下自定义分析器,其文本字段名称为 language 并索引了一些示例文档:

You need to create your own custom-analyzer which generates token as per your requirement, for your example I created a below custom analyzer with a text field name language and indexed some sample docs:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "whitespace",
                    "char_filter": [
                        "replace_comma"
                    ]
                }
            },
            "char_filter": {
                "replace_comma": {
                    "type": "mapping",
                    "mappings": [
                        ", => \\u0020"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "language": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

为诸如 c ++ c# c,java 之类的文本生成的令牌.

Tokens generated for text like c++, c# and c,java.

POST http://{{hostname}}:{{port}}/{{index}}/_analyze

POST http://{{hostname}}:{{port}}/{{index}}/_analyze

{
  "text" : "c#",
  "analyzer": "my_analyzer"
}

{
    "tokens": [
        {
            "token": "c#",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        }
    ]
}

对于 c,java ,它生成了2个单独的令牌 c java ,因为它用显示的空格替换了下方:

for c,java it generated 2 separate tokens c and java as it replaces , with whitespace shown below:

{
  "text" : "c, java",
  "analyzer":"my_analyzer"
}

{
    "tokens": [
        {
            "token": "c",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "java",
            "start_offset": 3,
            "end_offset": 7,
            "type": "word",
            "position": 1
        }
    ]
}

注意:您需要了解分析过程并相应地修改您的自定义分析器以使其适用于所有用例.我的示例可能不适用于您所有的用例,但希望您对如何处理此类要求有所了解.

Note: You need to understand the analysis process and accordingly modify your custom-analyzer to make it work for all of your use-case, My example might not work for all your edge cases, But hope you get an idea on how to handle such requirements.

这篇关于如何优化Elasticsearch的全文搜索以匹配'C ++'之类的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-03 11:46