ireland-news-word-gap-predi.../regular_roberta_from_scratch/1_train_tokenizer.py

19 lines
372 B
Python
Raw Normal View History

2021-11-04 11:37:51 +01:00
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
2021-11-04 12:06:29 +01:00
paths = ['./train_in.csv']
2021-11-04 11:37:51 +01:00
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files=paths, vocab_size=50265, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
tokenizer.save_model("./tokenizer_model")