fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size by harshaljanjani · Pull Request #44899 · huggingface/transformers

harshaljanjani · 2026-03-20T20:02:10Z

What does this PR do?

The following failing Perceiver use case was identified and fixed in this PR:

→ c6d2848 (🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models) refactored all vision models' interpolate_pos_encoding methods for torch.jit.trace; the canonical pattern used across other vision models (e.g. modeling_vit.py, modeling_deit.py) is that they passes the target (height, width) to nn.functional.interpolate; but the Perceiver diff passed the source grid dims practically making the interpolation a no-op; this should fix that!
→ I also checked if other models have the exact same issue; and they don't, they compute new_height = height // self.patch_size (target patch grid) and pass that.

Fixes #44898

Before the fix (feel free to cross-check; these errors are reproducible):

After the fix (feel free to cross-check):

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you fix any necessary existing tests?

zucchini-nlp

Let's add a test if we don't yet have

zucchini-nlp · 2026-03-23T10:23:29Z

src/transformers/models/perceiver/modeling_perceiver.py

 position_embeddings = nn.functional.interpolate(
 position_embeddings,
- size=(new_height, new_width),
+ size=(height, width),


sounds reasonable. I am not super familiar with perceiver, do we not need to divide the height/width by patch size, so it matches with patched image features?

Sorry I think I could've made this clearer as well in the PR description! I think this traces back to the reason it was missed as well, which was the prevalent ViT-family fix which worked for most models was applied here, only here there's no patch_size to divide by so it's incorrect (Perceiver uses conv1x1 with spatial_downsample=1).
→ ViT/DeiT: new_height = height // self.patch_size makes the value in new_height the target.
→ Perceiver: new_height = torch_int(num_positions**0.5) makes it the source grid size, so the same size=(new_height, new_width) pattern that's correct everywhere else becomes a no-op here.

harshaljanjani · 2026-03-23T11:42:14Z

Let's add a test if we don't yet have

Thank you for your time @zucchini-nlp! Checked the test coverage, this behavior is covered by test_modeling_perceiver.py::PerceiverModelIntegrationTest::test_inference_interpolate_pos_encoding which currently fails. I just verified that this change fixes that as well!
If I may, I was wondering if I might ask if there's an update on this PR https://github.com/huggingface/transformers/pull/44695 as well at your convenience :))

zucchini-nlp · 2026-03-23T11:49:57Z

@harshaljanjani I see, thaks a lot for explaining! Do you mind adding a fast test, PerceiverModelTest with dummy weights?

harshaljanjani · 2026-03-23T12:05:34Z

Added the test!

Before the fix:

After the fix:

github-actions · 2026-03-23T12:06:28Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: perceiver

fix: Correct interpolation target size

b9abe45

harshaljanjani marked this pull request as ready for review March 20, 2026 20:10

github-actions bot requested a review from zucchini-nlp March 20, 2026 20:10

zucchini-nlp reviewed Mar 23, 2026

View reviewed changes

test: Add fast test coverage

26c8d7d

harshaljanjani requested a review from zucchini-nlp March 23, 2026 12:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size#44899

fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size#44899
harshaljanjani wants to merge 2 commits intohuggingface:mainfrom
harshaljanjani:fix/perceiver-interpolate-pos-encoding

harshaljanjani commented Mar 20, 2026 •

edited

Loading

zucchini-nlp left a comment

zucchini-nlp Mar 23, 2026

harshaljanjani Mar 23, 2026

harshaljanjani commented Mar 23, 2026

zucchini-nlp commented Mar 23, 2026

harshaljanjani commented Mar 23, 2026

github-actions bot commented Mar 23, 2026

Labels

2 participants

Conversation

harshaljanjani commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Mar 23, 2026

Choose a reason for hiding this comment

harshaljanjani Mar 23, 2026

Choose a reason for hiding this comment

harshaljanjani commented Mar 23, 2026

zucchini-nlp commented Mar 23, 2026

harshaljanjani commented Mar 23, 2026

github-actions bot commented Mar 23, 2026

Labels

2 participants

harshaljanjani commented Mar 20, 2026 •

edited

Loading