本文介绍了Transformers PreTrainedTokenizer add_tokens 功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

参考 Huggingface 的超棒变形金刚库的文档,我遇到了add_tokens 函数.

Referring to the documentation of the awesome Transformers library from Huggingface, I came across the add_tokens functions.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
model.resize_token_embeddings(len(tokenizer))

我通过在默认词汇表中添加以前不存在的词来尝试上述方法.然而,在保持所有其他不变的情况下,我注意到使用这个更新的 tokenizer 的微调分类器的准确性有所下降.即使只添加了 10% 以前不存在的单词,我也能够复制类似的行为.

I tried the above by adding previously absent words in the default vocabulary. However, keeping all else constant, I noticed a decrease in accuracy of the fine tuned classifier making use of this updated tokenizer. I was able to replicate similar behavior even when just 10% of the previously absent words were added.

我的问题

  1. 我是不是遗漏了什么?
  2. add_tokens 函数需要屏蔽标记,而不是整个单词,例如:'##ah', '##red''##ik''##si'等?如果是,是否有生成此类掩码令牌的程序?
  1. Am I missing something?
  2. Instead of whole words, is the add_tokens function expecting masked tokens, for example : '##ah', '##red', '##ik', '##si', etc.? If yes, is there a procedure to generate such masked tokens?

任何帮助将不胜感激.

提前致谢.

推荐答案

如果你给分词器添加分词,确实会让分词器对文本分词,但这不是 BERT 训练的分词,所以你基本上向输入添加噪声.词嵌入没有经过训练,网络的其余部分从未在上下文中看到它们.您需要大量数据来教 BERT 处理新添加的单词.

If you add tokens to the tokenizer, you indeed make the tokenizer tokenize the text differently, but this is not the tokenization BERT was trained with, so you are basically adding noise to the input. The word embeddings are not trained and the rest of the network never saw them in context. You would need a lot of data to teach BERT to deal with the newly added words.

还有一些方法可以计算单个词嵌入,这样它就不会像 这篇论文,但看起来很复杂,应该没什么区别.

There are also some ways how to compute a single word embedding, such that it would not hurt BERT like in this paper but it seems pretty complicated and should not make any difference.

BERT 使用基于词块的词汇表,因此这些词是作为单个标记出现在词汇表中还是分成多个词块并不重要.该模型可能在预训练期间看到了拆分词,并且会知道如何处理.

BERT uses a word-piece-based vocabulary, so it should not really matter if the words are present in the vocabulary as a single token or get split into multiple wordpieces. The model probably saw the split word during pre-training and will know what to do with it.

关于 ## 前缀标记,这些标记只能作为另一个词的后缀.例如,walrus 被拆分为 ['wal', '##rus'] 并且你需要两个词块都在词汇表中,但不是 ##walrus.

Regarding the ##-prefixed tokens, those are tokens can only be prepended as a suffix of another wordpiece. E.g., walrus gets split into ['wal', '##rus'] and you need both of the wordpieces to be in the vocabulary, but not ##wal or rus.

这篇关于Transformers PreTrainedTokenizer add_tokens 功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-28 22:56