本文介绍了如何在Lucene 7+中通过文档ID获取DocValue?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要通过以下方式将DocValue添加到文档中

I'm adding a DocValue to a document with

doc.add(new BinaryDocValuesField("foo",new BytesRef("bar")));

要为ID为docId的特定文档检索该值,请致电

To retrieve that value for a specific document with ID docId, I call

DocValues.getBinary(reader,"foo").get(docId).utf8ToString();

BinaryDocValues中的get函数最多受 Lucene 6.6 ,但对于 Lucene 7.0 及其更高版本似乎不再可用.

The get function in BinaryDocValues is supported up to Lucene 6.6, but for Lucene 7.0 and up it does not seem to be available anymore.

因此,如何在Lucene 7+中按文档ID获取DocValue(无需迭代BinaryDocValues/DocIdSetIterator,而不必重新获取BinaryDocValues和每次都使用advanceExact)?

So, how do I get the DocValue by document ID in Lucene 7+ (without having to iterate over BinaryDocValues / DocIdSetIterator, and without having to re-get BinaryDocValues and use advanceExact every time) ?

推荐答案

理论

Doc值是Lucene的列跨步字段值存储.出于面值和排序的目的,Doc值在查询时用于随机访问的速度非常快.以下问题 LUCENE-7407 将访问模式从随机访问切换为迭代器.因为与任意随机访问API相比,迭代器API的访问模式要严格得多,所以此更改为Lucene使用主动压缩和其他优化提供了更大的自由度和功能:

Theory

Doc values are Lucene's column-stride field value storage. Doc values were intended to be quite fast for random access at query time for faceting and sorting purposes. The following issue LUCENE-7407 switches access pattern from random-access to an iterator. Because an iterator API is a much more restrictive access pattern than an arbitrary random access API, this change gives Lucene more freedom and power to use aggressive compression and other optimizations:

  • 在数据稀疏的情况下减少磁盘空间的使用
  • 即使在非稀疏情况下,压缩率和文档值解码速度也更快
  • 删除缺失值的特殊列(getDocsWithField)并线程本地编解码器阅读器

您可以在以下博客中了解有关此更改的信息:

You can read about this change in the following blogs:

  • Doc values as iterators
  • Sparse versus dense document values with Apache Lucene

在实践中,此更改在某些情况下会导致性能下降,例如 SOLR-9599 .在主要情况下(构面和排序),可以正确使用迭代API,甚至可以执行一些优化.实际上,在很多情况下,此API并不是一个很好的解决方案.所有这些情况都被当作不正确的用法丢弃(与sun.misc.Unsafe在java单词中遇到的相同问题).

In practice this change causes performance degradation in some cases, for example SOLR-9599. In major case(faceting and sorting) an iterative API is OK with proper usage and, even more, allows to perform some optimizations.In fact there are a lot of cases where this API is not a good solution. All these cases were discarded as an incorrect usage(the same problem we had in java word with sun.misc.Unsafe).

实际上,org.apache.lucene.index.DocValuesIterator#advanceExact相当快,并且在某些实现中具有相似的性能和复杂性.

In fact, org.apache.lucene.index.DocValuesIterator#advanceExact is quite fast and has similar performance and complexity in case of some implementations.

这篇关于如何在Lucene 7+中通过文档ID获取DocValue?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!