本文介绍了如何在 text2vec 中对齐两个 GloVe 模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我已经基于两个不同的语料库训练了两个单独的 GloVe 向量空间模型(在 R 中使用 text2vec).这样做可能有不同的原因:例如,两个基础语料库可能来自两个不同的时间段,或者两个截然不同的类型.我有兴趣比较这两个语料库之间单词的用法/含义.如果我简单地连接两个语料库和它们的词汇表,那将不起作用(具有不同用法的词对在向量空间中的位置将只是在中间"的某个地方).

Let's say I have trained two separate GloVe vector space models (using text2vec in R) based on two different corpora. There could be different reasons for doing so: the two base corpora may come from two different time periods, or two very different genres, for example. I would be interested in comparing the usage/meaning of words between these two corpora. If I simply concatenated the two corpora and their vocabularies, that would not work (the location in the vector space for word pairs with different usages would just be somewhere in the "middle").

我最初的想法是只训练一个模型,但在准备文本时,在每个词后附加一个后缀 (_x, _y)(其中 x 和 y 代表词 A 在语料库 x/y 中的使用),如并为每个语料库单独保留一份没有后缀的副本,以便最终连接的训练语料库的词汇包括:A、A_x、A_y、B、B_x、B_y ...等,例如:

My initial idea was to train just one model, but when preparing the texts, append a suffix (_x, _y) to each word (where x and y stand for the usage of word A in corpus x/y), as well as keep a separate copy of each corpus without the suffixes, so that the vocabulary of the final concatenated training corpus would consist of: A, A_x, A_y, B, B_x, B_y ... etc, e.g.:

this is an example of corpus X
this be corpus Y yo
this_x is_x an_x example_x of_x corpus_x X_x
this_y be_y corpus_y Y_y yo_y

我认为 A 和 B 的平均"用法可以作为空间的某种坐标",我可以测量同一空间中 A_x 和 A_y 之间的距离.但后来我意识到,由于 A_x 和 A_y 从未出现在相同的上下文中(由于所有单词的后缀,包括它们周围的单词),这可能会扭曲空间并且不起作用.我也知道有一种叫做正交 procrustes 的问题,它与对齐矩阵有关,但我不知道如何在我的情况下实现它.

I figured the "mean" usages of A and B would serve as sort of "coordinates" of the space, and I could measure the distance between A_x and A_y in the same space. But then I realized since A_x and A_y never occur in the same context (due to the suffixation of all words, including the ones around them), this would probably distort the space and not work. I also know there is something called an orthogonal procrustes problem, which relates to aligning matrices, but I wouldn't know how to implement it for my case.

将两个 GloVe 模型(最好在 R 中,以便它们与 text2vec 一起使用)拟合到公共向量空间中的合理方法是什么,如果我的最终目标是测量字对的余弦相似度,它们在拼写上完全相同,但出现在两个不同的语料库中?

What would be a reasonable way to fit two GloVe models (preferably in R and so that they work with text2vec) into a common vector space, if my final goal is to measure the cosine similarity of word pairs, which are orthographically identical, but occur in two different corpora?

推荐答案

我看到了 2 个可能的解决方案:

I see 2 possible solutions:

  1. 尝试使用第一个解决方案初始化第二个手套模型,并希望在第二个模型拟合期间坐标系不会发生太大变化
  2. 拟合两个模型并得到词向量矩阵 A、B.然后找到最小化 A 和 B 行之间的角度总和的旋转矩阵(还不知道怎么做)

还要检查 http://nlp.stanford.edu/projects/histwords/, mb 这将有助于方法论.

Also check http://nlp.stanford.edu/projects/histwords/, mb it will help with methodology.

对于https://math.stackexchange.com/

这篇关于如何在 text2vec 中对齐两个 GloVe 模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 03:33