本文介绍了BERT 在 Transformers 中加速句子处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个代码:

import torch
import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, AutoModel

model = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModel.from_pretrained(model)
Sentence_vectorList = []
for sent in x_train:

  input_sentence = torch.tensor(tokenizer.encode(sent)).unsqueeze(0)
  out = model(input_sentence)
  embeddings_of_last_layer = out[0]
  cls_embeddings = embeddings_of_last_layer[0]

  cls_layer = cls_embeddings.detach().numpy()

  sent_emd = np.average(cls_layer,axis=0)

任务是获取句子向量并在 [n x 768] 中分离它们,然后我将它们保存为 sent2vec.这个过程需要很多时间.有没有更有效的方法?

The task is to take the sentence vectors and detach them in [n x 768] then I save them as sent2vec. This process is taking a lot of time. Is there a more efficient way to do it?

推荐答案

你可以通过批量处理句子来获得一些小的加速.批量大小为 100 可能是一个合理的选择.批量处理句子时,模型需要知道batch中每个句子的长度.batch_encode_plus 负责处理.请注意,该方法返回一个字典,因此您需要将输出传递给模型:

You can get some small speedup by processing the sentences in batches. A batch size of 100 might be a reasonable choice. When processing the sentences in batches, the model needs to be aware of how long each sentence in the batch is. The batch_encode_plus of the tokenizer takes care of that. Note that the method returns a dictionary, so you need to pass the output to the models as:

out = model(**input_sentences)

变量 out 将包含批处理中所有句子的向量.

The variable out will contain vectors for all sentences in the batch.

批量处理句子的加速在 CPU 上相对较小,但在 GPU 上相当大.Transformer 是非常大的模型,无论你做什么,它们在 CPU 上都会很慢.如果您可以接受较低质量的向量,则可以尝试使用较小的转换器,例如 DistilBERT.

The speedup of processing the sentences in batches is relatively small on CPU, but pretty big on GPU. Transformers are pretty large models and they will be slow on CPU no matter what you do. If you are fine with a lower quality of the vectors, you can try smaller transformers such as DistilBERT.

这篇关于BERT 在 Transformers 中加速句子处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!