Skip to content

fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size#44899

Open
harshaljanjani wants to merge 2 commits intohuggingface:mainfrom
harshaljanjani:fix/perceiver-interpolate-pos-encoding
Open

fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size#44899
harshaljanjani wants to merge 2 commits intohuggingface:mainfrom
harshaljanjani:fix/perceiver-interpolate-pos-encoding

Conversation

@harshaljanjani
Copy link
Contributor

@harshaljanjani harshaljanjani commented Mar 20, 2026

What does this PR do?

The following failing Perceiver use case was identified and fixed in this PR:

c6d2848 (🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models) refactored all vision models' interpolate_pos_encoding methods for torch.jit.trace; the canonical pattern used across other vision models (e.g. modeling_vit.py, modeling_deit.py) is that they passes the target (height, width) to nn.functional.interpolate; but the Perceiver diff passed the source grid dims practically making the interpolation a no-op; this should fix that!
→ I also checked if other models have the exact same issue; and they don't, they compute new_height = height // self.patch_size (target patch grid) and pass that.

Fixes #44898

Before the fix (feel free to cross-check; these errors are reproducible):

1

After the fix (feel free to cross-check):

2

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you fix any necessary existing tests?
@harshaljanjani harshaljanjani marked this pull request as ready for review March 20, 2026 20:10
@github-actions github-actions bot requested a review from zucchini-nlp March 20, 2026 20:10
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test if we don't yet have

position_embeddings = nn.functional.interpolate(
position_embeddings,
size=(new_height, new_width),
size=(height, width),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable. I am not super familiar with perceiver, do we not need to divide the height/width by patch size, so it matches with patched image features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think I could've made this clearer as well in the PR description! I think this traces back to the reason it was missed as well, which was the prevalent ViT-family fix which worked for most models was applied here, only here there's no patch_size to divide by so it's incorrect (Perceiver uses conv1x1 with spatial_downsample=1).
→ ViT/DeiT: new_height = height // self.patch_size makes the value in new_height the target.
→ Perceiver: new_height = torch_int(num_positions**0.5) makes it the source grid size, so the same size=(new_height, new_width) pattern that's correct everywhere else becomes a no-op here.

@harshaljanjani
Copy link
Contributor Author

Let's add a test if we don't yet have

Thank you for your time @zucchini-nlp! Checked the test coverage, this behavior is covered by test_modeling_perceiver.py::PerceiverModelIntegrationTest::test_inference_interpolate_pos_encoding which currently fails. I just verified that this change fixes that as well!
If I may, I was wondering if I might ask if there's an update on this PR https://github.com/huggingface/transformers/pull/44695 as well at your convenience :))

image
@zucchini-nlp
Copy link
Member

@harshaljanjani I see, thaks a lot for explaining! Do you mind adding a fast test, PerceiverModelTest with dummy weights?

@harshaljanjani
Copy link
Contributor Author

Added the test!

Before the fix:

image

After the fix:

image
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: perceiver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants