本文介绍了Spacy与Word2Vec中的文档相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个约有1.2万个文档的利基语料库,并且我想测试具有相似含义的几乎重复的文档-考虑有关不同新闻机构报道的同一事件的文章.

I have a niche corpus of ~12k docs, and I want to test near-duplicate documents with similar meanings across it - think article about the same event covered by different news organisations.

我尝试过gensim的Word2Vec,即使测试文档在语料库中,它也给我带来了惊人的相似度得分(< 0.3),并且我尝试了SpaCy,它给我提供了超过5k的文档相似度> 0.9.我测试了SpaCy最相似的文档,但是它几乎没有用.

I have tried gensim's Word2Vec, which gives me terrible similarity score(<0.3) even when the test document is within the corpus, and I have tried SpaCy, which gives me >5k documents with similarity > 0.9. I tested SpaCy's most similar documents, and it was mostly useless.

这是相关的代码.

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=40)
doc = preprocess(query)
vec_bow = dictionary.doc2bow(doc)
vec_lsi_tfidf = lsi[tfidf[vec_bow]] # convert the query to LSI space
index = similarities.Similarity(corpus = corpus, num_features = len(dictionary), output_prefix = "pqr")
sims = index[vec_lsi_tfidf] # perform a similarity query against the corpus
most_similar = sorted(list(enumerate(sims)), key = lambda x:x[1])

for mid in most_similar[-100:]:
    print(mid, file_list[mid[0]])

使用gensim,我发现了一种不错的方法,并进行了一些预处理,但是相似性评分仍然很低.有没有人遇到过这样的问题,是否有一些有用的资源或建议?

Using gensim I have found a decent approach, with some preprocessing, but the similarity score is still quite low. Has anyone faced such a problem, and are there are some resources or suggestions that could be useful?

推荐答案

我会发表评论,但是我没有足够的声誉!在NLP中,很容易陷入方法中而忘记了预处理.

I would post a comment but I don't have enough reputation! In NLP it is easy to get caught up in the methods and forget about the preprocessing.

1)删除停用词/最常用词

1) Remove Stopwords/most frequent words

2)合并单词对-查看SpaCy的文档

2) Merge word pairs - Look at SpaCy's documentation

即纽约市"成为其自己的唯一标记,而不是纽约",纽约",城市"

i.e. "New York City" becomes its own unique token instead of "New", "York", "City"

https://spacy.io/usage/linguistic-features

3)使用Doc2Vec而不是Word2Vec(由于您已经在使用gensim,因此应该很难理解,因为它们有自己的实现)

3) Use Doc2Vec instead of Word2Vec (Since you are already using gensim, this shouldn't be too hard to figure out, they have their own implementation)

然后,一旦完成所有这些事情,您将拥有文档向量,这可能会给您带来更好的成绩.另外,请记住,您所拥有的12k文档只是事物总体方案中的一小部分示例.

Then, once you have done all of these things, you will have document vectors, which will likely give you a better score. Also, keep in mind that the 12k docs that you have are a small amount of samples in the grand scheme of things.

这篇关于Spacy与Word2Vec中的文档相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-06 05:46