本文介绍了PostgreSQL:查找最接近给定句子的句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张带有句子标题的图像表。给定一个新句子,我想根据新句子与存储的旧句子的接近程度来找到最匹配它的图像。



我知道我可以使用 @@ 运算符,带有 to_tsquery ,但 tsquery 接受特定单词



一个问题是我不知道如何将给定的句子转换为有意义的查询。句子中可能带有标点符号和数字。



但是,我也觉得我需要某种余弦相似性,但我不知道该如何得出。的PostgresQL。我正在使用最新的GA版本,并且很高兴使用开发版本(如果可以解决我的问题)。

解决方案

全文搜索(FTS)



您可以使用 plainto_tsquery()来()...



  SELECT plainto_tsquery('english','Sentence:不相关的单词(和标点符号)。')

plainto_tsquery
----- -------------
'sentenc'& 无关和 单词和 '穿刺'

使用方式:

  SELECT * 
从tbl
到to_tsvector('english',句子)@@ plainto_tsquery('english','我的新句子');

但这仍然很严格,仅提供了非常有限的相似性容忍度。



三字组相似度



可能更适合搜索相似度,甚至可以在一定程度上克服拼写错误。 / p>

安装附加模块,创建GiST索引并使用 / strong>:



基本上,在句子


上具有Trigram GiST索引

 -选择set_limit(0.3); -根据需要调整容忍度

选择*
从tbl
句子%'我的新句子'
ORDER BY句子<-> 我的新句子
LIMIT 10;

更多:









两者都合并



您甚至可以将FTS和三字组相似度结合起来:






I have a table of images with sentence captions. Given a new sentence I want to find the images that best match it based on how close the new sentence is to the stored old sentences.

I know that I can use the @@ operator with a to_tsquery but tsquery accepts specific words as queries.

One problem is I don't know how to convert the given sentence into a meaningful query. The sentence may have punctuation and numbers.

However, I also feel that some kind of cosine similarity thing is what I need but I don't know how to get that out of PostgresQL. I am using the latest GA version and am happy to use the development version if that would solve my problem.

解决方案

Full Text Search (FTS)

You could use plainto_tsquery() to (per documentation) ...

SELECT plainto_tsquery('english', 'Sentence: with irrelevant words (and punctuation) in it.')

 plainto_tsquery
------------------
 'sentenc' & 'irrelev' & 'word' & 'punctuat'

Use it like:

SELECT *
FROM   tbl
WHERE  to_tsvector('english', sentence) @@ plainto_tsquery('english', 'My new sentence');

But that is still rather strict and only provides very limited tolerance for similarity.

Trigram similarity

Might be better suited to search for similarity, even overcome typos to some degree.

Install the additional module pg_trgm, create a GiST index and use the similarity operator % in a nearest neighbour search:

Basically, with a trigram GiST index on sentence:

-- SELECT set_limit(0.3);  -- adjust tolerance if needed

SELECT *
FROM   tbl
WHERE  sentence % 'My new sentence'
ORDER  BY sentence <-> 'My new sentence'
LIMIT  10;

More:

Combine both

You can even combine FTS and trigram similarity:

这篇关于PostgreSQL:查找最接近给定句子的句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-24 03:42