Spacy与Word2Vec中的文档相似性

本文介绍了Spacy与Word2Vec中的文档相似性的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个约有1.2万个文档的利基语料库，并且我想测试具有相似含义的几乎重复的文档-考虑有关不同新闻机构报道的同一事件的文章.

I have a niche corpus of ~12k docs, and I want to test near-duplicate documents with similar meanings across it - think article about the same event covered by different news organisations.

我尝试过gensim的Word2Vec，即使测试文档在语料库中，它也给我带来了惊人的相似度得分(< 0.3)，并且我尝试了SpaCy，它给我提供了超过5k的文档相似度> 0.9.我测试了SpaCy最相似的文档，但是它几乎没有用.

I have tried gensim's Word2Vec, which gives me terrible similarity score(<0.3) even when the test document is within the corpus, and I have tried SpaCy, which gives me >5k documents with similarity > 0.9. I tested SpaCy's most similar documents, and it was mostly useless.

这是相关的代码.

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=40)
doc = preprocess(query)
vec_bow = dictionary.doc2bow(doc)
vec_lsi_tfidf = lsi[tfidf[vec_bow]] # convert the query to LSI space
index = similarities.Similarity(corpus = corpus, num_features = len(dictionary), output_prefix = "pqr")
sims = index[vec_lsi_tfidf] # perform a similarity query against the corpus
most_similar = sorted(list(enumerate(sims)), key = lambda x:x[1])

for mid in most_similar[-100:]:
    print(mid, file_list[mid[0]])

使用gensim，我发现了一种不错的方法，并进行了一些预处理，但是相似性评分仍然很低.有没有人遇到过这样的问题，是否有一些有用的资源或建议?

Using gensim I have found a decent approach, with some preprocessing, but the similarity score is still quite low. Has anyone faced such a problem, and are there are some resources or suggestions that could be useful?

score

Spacy与Word2Vec中的文档相似性

问题描述

推荐答案