本文介绍了Elasticsearch - EdgeNgram + highlight + term_vector =不好的亮点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用edgengram分析仪(min = 3,max = 7,front)+ term_vector = with_positions_offsets



带有text =CouchDB p>

当我搜索couc



我的亮点是cou而不是couc






似乎我的亮点只在最小匹配标记cou,而我希望在确切的标记(如果可能的话) )或至少找到最长的令牌。



它没有分析text_vector = with_positions_offsets



删除term_vector = with_positions_offsets对于perfomances有什么影响?

解决方案

当您设置 term_vector = with_positions_offsets 对于特定字段,这意味着您正在为该字段存储每个文档的术语向量。



当突出显示时,术语向量允许你使用lucene快速矢量荧光笔,这是更快比标准荧光笔。原因是标准荧光笔没有任何快速的突出显示方式,因为索引不包含足够的信息(位置和偏移量)。它只能重新分析字段内容,截取偏移量和位置,并根据该信息进行突出显示。这可能需要相当长的时间,特别是长文本字段。



使用术语向量,您确实有足够的信息,不需要重新分析文本。指数的大小不利,这将显着增加。我必须补充说,因为Lucene 4.2术语矢量被更好的压缩和存储在一个优化的方式。还有新的PostingsHighlighter基于在贴子列表中存储偏移量的能力,这需要更少的空间。



elasticsearch自动使用基于信息可用。如果存储术语向量,则将使用快速矢量荧光笔,否则使用标准矢量。在没有术语向量的索引后,将使用标准荧光笔突出显示。对于ngram字段,描述的行为是奇怪的,因为快速向量荧光笔应该能够更好地支持ngram字段,因此我将期待完全相反的结果。


When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets

With document having text = "CouchDB"

When i search for "couc"

My highlight is on "cou" and not "couc"


It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found.

It works fine without analyzing the text with term_vector=with_positions_offsets

What's the impact of removing the term_vector=with_positions_offsets for perfomances?

解决方案

When you set term_vector=with_positions_offsets for a specific field it means that you are storing the term vectors per document, for that field.

When it comes to highlighting, term vectors allow you to use the lucene fast vector highlighter, which is faster than the standard highlighter. The reason is that the standard highlighter doesn't have any fast way to highlight since the index doesn't contain enough information (positions and offsets). It can only re-analyze the field content, intercept offsets and positions and make highlighting based on that information. This can take quite a while, especially with long text fields.

Using term vectors you do have enough information and don't need to re-analyze the text. The downside is the size of the index, which will notably increase. I must add that since Lucene 4.2 term vectors are better compressed and stored in an optimized way though. And there's also the new PostingsHighlighter based on the ability to store offsets in the postings list, which requires even less space.

elasticsearch uses automatically the best way to make highlighting based on the information available. If term vectors are stored, it will use the fast vector highlighter, otherwise the standard one. After you reindex without term vectors, highlighting will be made using the standard highlighter. It will be slower but the index will be smaller.

Regarding ngram fields, the described behaviour is weird since fast vector highlighter should have a better support for ngram fields, thus I would expect exactly the opposite result.

这篇关于Elasticsearch - EdgeNgram + highlight + term_vector =不好的亮点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-27 02:44