word2vec中一个单词的向量代表什么?

本文介绍了word2vec中一个单词的向量代表什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

word2vec 是Google的开源工具:

word2vec is a open source tool by Google:

为每个单词提供浮点值的向量，它们究竟代表什么?

For each word it provides a vector of float values, what exactly do they represent?

在段落向量上也有论文他们使用word2vec来获取段落的固定长度矢量.

There is also a paper on paragraph vector can anyone explain how they are using word2vec in order to obtain fixed length vector for a paragraph.

推荐答案

TLDR :Word2Vec正在潜在空间中构建单词投影(嵌入)(N是获得的单词向量的大小).浮点值表示单词在此N维空间中的坐标.

TLDR: Word2Vec is building word projections (embeddings) in a latent space of N dimensions, (N being the size of the word vectors obtained). The float values represents the coordinates of the words in this N dimensional space.

潜在空间投影背后的主要思想是将对象放置在不同且连续的维空间中，即您的对象将具有比基本对象更有趣的演算特性的表示形式(向量).

The major idea behind latent space projections, putting objects in a different and continuous dimensional space, is that your objects will have a representation (a vector) that has more interesting calculus characteristics than basic objects.

对于单词来说，有用的是您有一个密集向量空间，该空间编码了相似度(即，树的向量与木材的相似性比与舞蹈更相似).这与经典的稀疏单字或词袋"编码相反，该编码将每个单词视为一个维度，从而使其在设计上成为正交(即树木，木材和舞蹈)它们之间的距离相同)

For words, what's useful is that you have a dense vector space which encodes similarity (i.e tree has a vector which is more similar to wood than from dancing). This opposes to classical sparse one-hot or "bag-of-word" encoding which treat each word as one dimension making them orthogonal by design (i.e tree,wood and dancing all have the same distance between them)

Word2Vec算法可以做到这一点:

Word2Vec algorithms do this:

想象一下，你有一句话:

Imagine that you have a sentence:

您显然想用外部"一词来填充空白，但也可以使用外部".w2v算法受此想法启发.您希望所有填充在空白处的词都被填充，因为它们属于同一类-这称为分布假设-因此，"out"和"outside"这两个词会更靠近，而诸如胡萝卜"会更远.

You obviously want to fill the blank with the word "outside" but you could also have "out". The w2v algorithms are inspired by this idea. You'd like all words that fill in the blanks near, because they belong together - This is called the Distributional Hypothesis - Therefore the words "out" and "outside" will be closer together whereas a word like "carrot" would be farther away.

这是word2vec背后的一种直觉".有关发生了什么的更多理论解释，建议您阅读:

This is sort of the "intuition" behind word2vec. For a more theorical explanation of what's going on i'd suggest reading:

GloVe:用于词表示的全局向量
稀疏和显式单词表示形式中的语言规律
将神经词嵌入作为隐式矩阵分解

对于段落向量，其思想与w2v中的相同.每个段落都可以用其文字表示.本文提出了两种模型.

For paragraph vectors, the idea is the same as in w2v. Each paragraph can be represented by its words. Two models are presented in the paper.

以单词袋"方式(pv-dbow模型)，其中一个固定长度段落向量用于预测其单词.
通过在单词上下文中添加固定长度段落标记(pv-dm模型).通过对梯度进行重新传播，他们可以了解"缺失的内容，将具有相同单词/主题缺失"的段落紧密结合在一起.

In a "Bag of Word" way (the pv-dbow model) where one fixed length paragraph vector is used to predict its words.
By adding a fixed length paragraph token in word contexts (the pv-dm model). By retropropagating the gradient they get "a sense" of what's missing, bringing paragraph with the same words/topic "missing" close together.

文章中的内容:

要全面了解这些向量的构建方式，您需要学习神经网络的构建方式以及反向传播算法的工作方式.(我建议从此视频和Andrew NG的Coursera课程开始)

For full understanding on how these vectors are built you'll need to learn how neural nets are built and how the backpropagation algorithm works. (i'd suggest starting by this video and Andrew NG's Coursera class)

注意: Softmax只是一种表达分类的好方法，w2v算法中的每个单词都被视为一个类.分层softmax/负采样是加快softmax并处理许多类的技巧.

NB: Softmax is just a fancy way of saying classification, each word in w2v algorithms is considered as a class. Hierarchical softmax/negative sampling are tricks to speed up softmax and handle a lot of classes.

这篇关于word2vec中一个单词的向量代表什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！