本文介绍了MongoDB全文搜索分数“分数意味着什么?"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我的学校从事MongoDB项目.我有一个句子集合,我进行常规的文本搜索以找到集合中最相似的句子,这是基于评分的.

I'm working on a MongoDB project for my school. I have a Collection of sentences, and I do a normal Text search to find the most similar sentence in the collection, this is based on the scoring.

我运行此查询

I run this Query

db.sentences.find({$text: {$search: "any text"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

查询句子时看看这些结果

Take a look at these results when i query sentences,

"that kicking a dog causes it pain"
----Matched With
"that kicking a dog causes it pain – is not very controversial."
----Give a Result of:
*score: 2.4*


"This sentence have nothing to do with any other"
----Matched With
"Who is the "He" in this sentence?"
----Give a result of:
*Score: 1.0* 

分数是多少?这是什么意思?如果我想显示仅具有70%或更高的相似度的结果该怎么办?

What is the score value? what does it mean?What if I want to show the results that only have similarity of 70% and above.

如何解释得分结果,以便显示相似度百分比,我正在使用C#进行此操作,但不必担心实现.我不介意伪代码解决方案!

How can I interpret the score result so I can display a similarity percentage, I'm using C# to do this, but don't worry about the implementation. I don't mind a Pseudo-code solution!

推荐答案

使用MongoDB文本索引时,它将为每个匹配的文档生成一个分数.此分数表示您的搜索字符串与文档的匹配程度.分数越高,与搜索到的文字相似的机会就越大.得分计算如下:

When you use a MongoDB text index, it generates a score for every matching document. This score indicates how strongly your search string matches the document. The higher the score more is the chances of resemblance to the searched text. The score is calculated by:

Step 1: Let the search text = S
Step 2: Break S into tokens (If you are not doing a Phrase search). Let's say T1, T2..Tn. Apply Stemming to each token
Step 3: For every search token, calculate score per index field of text index as follows:
       
score = (weight * data.freq * coeff * adjustment);
       
Where :
weight = user Defined Weight for any field. Default is 1 when no weight is specified
data.freq = how frequently the search token appeared in the text
coeff = ​(0.5 * data.count / numTokens) + 0.5
data.count = Number of matching token
numTokens = Total number of tokens in the text
adjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1
Step 4: Final score of document is calculated by adding all tokens scores per text index field
Total Score = score(T1) + score(T2) + .....score(Tn)

因此,如上所示,分数受以下因素影响:

So as we can see above a score is influenced by the following factors:

  1. 与实际搜索的文本匹配的术语数,更多的匹配项将是得分
  2. 文档字段中的令牌数量
  3. 搜索到的文本是否与文档字段完全匹配

以下是您的一个文档的推导:

Following is the derivation for one of your document:

Search String = This sentence have nothing to do with any other
Document = Who is the "He" in this sentence?

Score Calculation:
Step 1: Tokenize search string.Apply Stemming and remove stop words.
    Token 1: "sentence"
    Token 2: "nothing"
Step 2: For every search token obtained in Step 1, do steps 3-11:
        
      Step 3: Take Sample Document and Remove Stop Words
            Input Document:  Who is the "He" in this sentence?
            Document after stop word removal: "sentence"
      Step 4: Apply Stemming 
        Document in Step 3: "sentence"
        After Stemming : "sentence"
      Step 5: Calculate data.count per search token 
              data.count(sentence)= 1
              data.count(nothing)= 1
      Step 6: Calculate total number of token in document
              numTokens = 1
      Step 7: Calculate coefficient per search token
              coeff = ​(0.5 * data.count / numTokens) + 0.5
              coeff(sentence) =​ 0.5*(1/1) + 0.5 = 1.0
              coeff(nothing) =​ 0.5*(1/1) + 0.5 = 1.0    
      Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)
              adjustment(sentence) = 1
              adjustment(nothing) =​ 1
      Step 9: weight of field (1 is default weight)
              weight = 1
      Step 10: Calculate frequency of search token in document (data.freq)
           For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.
            a. Data.freq(sentence)= 1/(2^0) = 1
            b. Data.freq(nothing)= 0
      Step 11: Calculate score per search token per field:
         score = (weight * data.freq * coeff * adjustment);
         score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0
         score(nothing) = (1 * 0 * 1.0 * 1.0) = 0
Step 12: Add individual score for every token of search string to get total score
Total score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0 

以相同的方式,您可以派生另一个.

In the same way, you can derive the other one.

有关MongoDB的详细分析,请检查:蒙哥得分算法博客

For more detailed MongoDB analysis, check:Mongo Scoring Algorithm Blog

这篇关于MongoDB全文搜索分数“分数意味着什么?"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 05:41