Lately, I've been working on generating alt text for images using the BLIP model. The model I use is "blip-image-captioning-base" from HuggingFace. However, to generate alt text in Turkish using this model, I'm integrating a BERT tokenizer specifically trained for the Turkish language. This tokenizer is "dbmdz/bert-base-turkish-cased" from Huggingface. I manually add the [BOS] and [EOS] tokens. I configure the model (e.g., model.config.decoder_start_token_id). I also resize the token embeddings using resize_token_embedding and then tie the weights using tie_weights. When I check the output after fine-tuning, the model generate Turkish caption successfully but never uses the [BOS] token and uses a completely different word at the beginning of the sentence. There are no ID conflicts in the dictionary. The model also recognizes the [BOS] token.
In the dataset, the words used at the beginning of sentences are almost non-existent, and those that do occur are never at the beginning of sentences. I haven't been able to solve this problem.
Have you done/are you doing this kind of integration? Do you have any suggestions?
To solve this problem, I manually added [BOS] and [EOS] to the reference captions and performed the training. I reset the weight of the [BOS] token and had it learn again, but I couldn't figure it out.
Expected result: [BOS] <alt_text> [EOS]
Actual result: If I don't use forced_bos_token_id: 'geçiyoruz' <alt_text> [EOS]
If I use forced_bos_token_id: 'geçiyoruz' [BOS] <slightly_trimmed_alt_text> [EOS]