0

I have a connection problem with direct download Bert model(company`s privacy policy) so, I downloaded BertTokenizer at https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_bert.py

and got my model tokenizer`s txt file. "bert-base-multilingual-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",

but When I import tokenizer model, get an error. My code:

tokenizer = BertTokenizer.from_pretrained("My BERT MODEL DIRECTORY", do_lower_case=False) tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences] print (sentences[0]) print (tokenized_texts[0]) 

Error Message 'utf-8' codec can't decode bytes in position 7526-7527: invalid continuation byte

I trying to + encoding = 'utf-8', 'cp949' like this tokenizer = BertTokenizer.from_pretrained("My BERT MODEL DIRECTORY", encoding = 'uft-8', do_lower_case=False)

but It doesn`t work.. Thank you for your comment in advance.

1 Answer 1

1

Your string(s) can't be decoded, because it was truncated. Either you manually handle the error:

print (sentences[0].decode('utf-8', 'replace') # Replace the invalid characters with ? print (tokenized_texts[0].decode('utf-8', 'ignore') # Completely remove the invalid characters 

Or you register an handler globally:

import codecs codecs.register_error('strict', codecs.lookup_error('surrogateescape')) 

More info: https://docs.python.org/3/library/codecs.html

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.