ElasticSearch：设置“not_analyzed”的影响字段为“存储”：“是”？

本文介绍了ElasticSearch：设置“not_analyzed”的影响字段为“存储”：“是”？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个字符串字段在映射中指定为 not_analyzed 。如果我然后添加store：yes到映射，ElasticSearch会重复存储吗？我对 not_analyzed 字段的理解是，它们不是通过分析器运行，而是按照进行索引，但客户端能够匹配它。所以，如果一个字段都是 not_analyzed 和 store：yes ，这可能会导致ElasticSearch保留两个字符串的副本

我的问题：

如果字符串字段作为 not_analyzed 和 store：yes ，将有重复的字符串存储？

我希望足够清楚。谢谢！

解决方案

您将lucene中的索引字段和存储字段的概念混合在一起，

一个字段在反向索引中进行索引，lucene用于提供其伟大而快速的全文搜索功能的数据结构。如果要在字段上进行搜索，则必须对其进行索引。当您索引一个字段时，您可以决定是要按原样进行索引，还是要分析它，这意味着决定使用tokenizer来应用它，这将生成令牌（单词）和令牌列表过滤器可以修改生成的令牌（甚至添加或删除一些）。索引字段的方式会影响您如何搜索字段。如果您索引一个字段但不分析它，并且其文本由多个单词组成，那么您将能够找到该文档，仅搜索确切的特定文本，包括空格。

当您想要检索时，会存储一个字段。假设Lucene提供某种存储空间，这与倒排索引本身无关。
当您使用lucene进行搜索时，您将返回匹配的文档ID列表。然后，您可以从其存储的字段中检索一些文本，这是您从字面上显示为搜索结果。如果你不存储一个字段，你将永远无法从lucene中取回（弹性搜索不是这样，下面我将要解释一下）。

您可以包含您只想搜索的字段，并且不显示：索引并未存储（默认为lucene）。

您可以包含要搜索的字段，也可以检索：索引并存储。

您可以有不想搜索的字段，但您确实需要检索以显示它们。

当谈到弹性搜索时，事情会有所变化。当您不配置存储在映射中的字段时（默认为 store：no ），您可以默认检索它。这是因为弹性搜索总是将lucene中的所有源文档（除非您禁用此功能）存储在一个特殊的lucene字段中，称为。

当您使用elasticsearch进行搜索时，默认情况下您将返回整个源字段，但您也可以询问针对特定领域。在这种情况下会发生什么是弹性搜索检查这些特定字段是否存储在lucene中。如果内容将从lucene检索，否则将从lucene中检索 _source 存储字段，将其解析为json（拉解析），并将提取这些特定字段。在第一种情况下，可能会更快，但不一定。如果你的源码真的很大，你只想加载几个字段，将它们配置为存储在lucene中可能会使加载过程更快;另一方面，如果您的 _source 不是很大，并且您想要加载多个字段，那么最好只加载一个存储的字段（ _source ），这将导致单个磁盘搜索，解析等。在大多数情况下，使用 _source 字段工作正常。

回答你的问题：反向索引和lucene存储是两个完全不同的东西。只有当您决定在映射中存储一个字段（ store：yes ）时，最终只能在lucene中使用相同数据的两个副本，因为弹性搜索将内容保留在json _source ，但这与您正在索引或分析该字段的事实无关。

Suppose I have a string field specified as not_analyzed in the mapping. If I then add "store":"yes" to the mapping, will ElasticSearch duplicate the storage? My understanding of not_analyzed fields is that they are not run through an Analyzer, indexed as is, but a client is able to match against it. So, if a field is both not_analyzed and store:yes, this could cause ElasticSearch to keep two copies of the string.

My question:

If a string field is stored as both not_analyzed and store:yes, will there be duplicate storage of the string?

I hope that's clear enough. Thanks!

解决方案

You're mixing up the concept of indexed field and stored field in lucene, the library that elasticsearch is built on top of.

A field is indexed when it goes within the inverted index, the data structure that lucene uses to provide its great and fast full text search capabilities. If you want to search on a field, you do have to index it. When you index a field you can decide whether you want to index it as it is, or you want to analyze it, which means deciding a tokenizer to apply to it, which will generate a list of tokens (words) and a list of token filters that can modify the generated tokens (even add or delete some). The way you index a field affects how you can search on it. If you index a field but don't analyze it, and its text is composed of multiple words, you'll be able to find that document only searching for that exact specific text, whitespaces included.

A field is stored when you want to be able to retrieve it. Let's say Lucene provides some kind of storage too, which doesn't have anything to do with the inverted index itself.When you search using lucene you get back a list of document ids that match. Then you can retrieve some text from their stored fields, which is what you literally show as search results. If you don't store a field you'll never be able to get it back from lucene (this is not true for elasticsearch though, as I'm going to explain below).

You can have fields that you only want to search on, and never show: indexed and not stored (default in lucene).
You can have fields that you want to search on and also retrieve: indexed and stored.
You can have fields that you don't want to search on, but you do want to retrieve to show them.

Therefore the two data structures are not related to each other. If you both index and store a field in lucene, its content will not be present twice in the same form. Stored fields are stored as they are, as you send them to lucene, while indexed fields might be analyzed and will be part of the inverted index, which is something else. Stored fields are made to be retrieved for a specific document (by lucene document id), while indexed fields are made to search, in such a structure that literally inverts the text having as a result each term as key, together with a list of document ids that contain it (the postings list).

When it comes to elasticsearch things change a little though. When you don't configure a field as stored in your mapping (default is store:no) you are able to retrieve it anyway by default. This happens because elasticsearch always stores in lucene the whole source document that you send to it (unless you disable this feature) within a special lucene field, called _source.

When you search using elasticsearch you get back by default the whole source field, but you can also ask for specific fields. What happens in that case is that elasticsearch checks whether those specific fields are stored or not in lucene. If they are the content will be retrieved from lucene, otherwise the _source stored field will be retrieved from lucene, parsed as json (pull parsing) and those specific fields will be extracted. In the first case it might be a little faster, but not necessarily. If your source is really big and you only want to load a couple of fields, configuring them as stored in lucene would probably make the loading process faster; on the other hand, if your _source is not that big and you want to load many fields, then it's probably better to load only one stored field (the _source), which would lead to a single disk seek, parse it etc. In most of the cases using the _source field works just fine.

To answer your question: inverted index and lucene storage are two completely different things. You end up having two copies of the same data in lucene only if you decide to store a field (store:yes in the mapping), since elasticsearch keeps that same content within the json _source, but this doesn't have anything to do with the fact that you're indexing or analyzing the field.

这篇关于ElasticSearch：设置“not_analyzed”的影响字段为“存储”：“是”？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！