Recently, I came across the BERT model. I did some research and tried some implementations.
I wanted to tackle a NER task, so I chose the BertForSequenceClassifications provided by HuggingFace.
for epoch in range(1, args.epochs + 1): total_loss = 0 model.train() for step, batch in enumerate(train_loader): b_input_ids = batch[0].to(device) b_input_mask = batch[1].to(device) b_labels = batch[2].to(device) model.zero_grad() outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels) loss = outputs[0] total_loss += loss.item() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # modified based on their gradients, the learning rate, etc. optimizer.step() The main part of my fine-tuning follows as above.
I am curious about to what extent the fine-tuning alters the model. Does it freeze the weights that have been provided by the pre-trained model and only alter the top classification layer, or does it change the hidden layers that are contained in the already pre-trained BERT model?