七難ハック
huggingface transformers Tokenizer bug when use Sentencepiece
最終更新: 12/19/2021

huggingface transformers have the potential to mix unwanted tokens when using the Sentencepiece based Tokenizer.

Issue

For example, suppose you use AlbertTokenizer. Using the Sentencepiece model created in advance, load it as follows.

from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer(vocab_file="spiece.model")

If you use this tokenizer to divide the word, you will get the following result.

tokenizer.tokenize("こんにちは世界の人々")
# ['▁', 'こんにち', 'は', '世界', 'の人々']

So what happens if you replace some of them with [MASK] and divide them, as you would with a fill-mask task?

tokenizer.tokenize("こんにちは[MASK]の人々")
# ['▁', 'こんにち', 'は', '[MASK]', '▁', 'の人々']

how is it? An extra token (Sentencepiece meta symbol "▁" (U + 2581) )appears after [MASK]. It cannot be predicted normally because extra tokens are appearing.

Why?

Why does it behave like this? It is due to the tokenize method of the PreTrainedTokenizer class.

This method checks the special tokens in the text first and splits them before and after.

"こんにちは[MASK]の人々"
↓
["こんにちは", "[MASK]", "の人々"]

After that, the text other than the special token is divided.
This method internally does the following:

result = []
result.extend(tokenizer.tokenize("こんにちは"))
result.append("[MASK]")
result.extend(tokenizer.tokenize("の人々"))

Do you know what's wrong with this?
Yes, when I run result.extend(tokenizer.tokenize("の人々")), it becomes ['▁', 'の人々'].
Sentencepiece adds meta symbol "▁" (U + 2581) at the beginning of the sentence unless you specify an option.

In other words, I don't want to split with a special token. Because the special token is in the dictionary, and I just want to use the Sentencepiece to divide the sentence "こんにちは世界の人々". I don't want to divide it in advance.

There is no way around this problem

There is currently no way around this issue. You will always face this problem if you have a special token in your text. The only way to perform inference with correct behavior is to remove the meta symbol that appears after the special token after tokenizing.

tokens = tokenizer.tokenize("こんにちは[MASK]の人々")
print(tokens)
# ['▁', 'こんにち', 'は', '[MASK]', '▁', 'の人々']
# The index of unwanted tokens is 4.
tokens.pop(4)
print(tokens)
# ['▁', 'こんにち', 'は', '[MASK]', 'の人々']

The impact may be limited as it shouldn't be very relevant for tasks other than fill-mask, but running fill-mask using the transformers pipeline on a model that uses Sentencepiece for the Tokenizer will result in worse results than expected. This is the cause.
I think the tokenize method of the PreTrainedTokenizer class needs an option to change the processing if the tokenizer is based on Sentencepiece. I'm not sure if there are already GitHub issue to consider adding it (there are too many GitHub issue), but if not, I'll consider creating a Pull Request myself.

What we can do now is to be aware of this issue.
If your model uses the Sentencepiece for your Tokenizer and you get unexpected inference results, it's important to immediately realize that this may be the cause.

Have a good NLP life.