Skip to content

[Cache] Native mamba & hybrid cache#44950

Open
Cyrilvallez wants to merge 44 commits intomainfrom
clean-mamba-cache
Open

[Cache] Native mamba & hybrid cache#44950
Cyrilvallez wants to merge 44 commits intomainfrom
clean-mamba-cache

Conversation

@Cyrilvallez
Copy link
Member

@Cyrilvallez Cyrilvallez commented Mar 23, 2026

What does this PR do?

As per the title. This PR finally makes mamba layer caches first class citizen, and adds native support for them.

It supports the following layers combinations:

  • all mamba layers
  • alternating attention layer/mamba layer
  • layers that are BOTH mamba and attention (zamba models)

For this, it adds the 2 following layer classes:

  • MambaLayer
  • MambaAndAttentionLayer (combining both)

Everything integrates smoothly with the existing cache machinery in the case of hybrid attention/mamba archs, i.e. functions such as get_seq_length, get_mask_sizes (used for mask creation notably) will always look at attention layers.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: bamba, falcon_h1, falcon_mamba, granitemoehybrid, jamba, lfm2, lfm2_moe, mamba, mamba2, nemotron_h

@Cyrilvallez
Copy link
Member Author

run-slow: mamba2 zamba2 granitemoehybrid falcon_h1 lfm2 lfm2_moe qwen3_5 bamba mamba nemotron_h qwen3_next zamba jamba qwen3_5_moe falcon_mamba

@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/bamba", "models/falcon_h1", "models/falcon_mamba", "models/granitemoehybrid", "models/jamba", "models/lfm2", "models/lfm2_moe", "models/mamba", "models/mamba2", "models/nemotron_h", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_next", "models/zamba", "models/zamba2"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN f5ffd694 workflow commit (merge commit)
PR fc27c37f branch commit (from PR)
main 28af8184 base commit (on main)

Model CI Report

14 new failed tests from this PR 😭

  • bamba:
    tests/models/bamba/test_modeling_bamba.py::BambaModelIntegrationTest::test_simple_batched_generate_with_padding (❌ ⟹ ❌)
    tests/models/bamba/test_modeling_bamba.py::BambaModelIntegrationTest::test_simple_generate (❌ ⟹ ❌)

  • falcon_h1:
    tests/models/falcon_h1/test_modeling_falcon_h1.py::FalconH1ModelIntegrationTest::test_falcon_h1_hard (❌ ⟹ ❌)

  • falcon_mamba:
    tests/models/falcon_mamba/test_modeling_falcon_mamba.py::FalconMambaIntegrationTests::test_generation_4bit (❌ ⟹ ❌)
    tests/models/falcon_mamba/test_modeling_falcon_mamba.py::FalconMambaIntegrationTests::test_generation_fp16 (❌ ⟹ ❌)
    tests/models/falcon_mamba/test_modeling_falcon_mamba.py::FalconMambaIntegrationTests::test_generation_torch_compile (❌ ⟹ ❌)

  • mamba:
    tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_compile_mamba_cache (✅ ⟹ ❌)

  • mamba2:
    tests/models/mamba2/test_modeling_mamba2.py::Mamba2IntegrationTest::test_batched_equivalence_with_cache (❌ ⟹ ❌)
    tests/models/mamba2/test_modeling_mamba2.py::Mamba2IntegrationTest::test_batched_equivalence_without_cache (❌ ⟹ ❌)
    tests/models/mamba2/test_modeling_mamba2.py::Mamba2IntegrationTest::test_simple_generate (❌ ⟹ ❌)

  • zamba:
    tests/models/zamba/test_modeling_zamba.py::ZambaModelTest::test_cpu_offload (✅ ⟹ ❌)
    tests/models/zamba/test_modeling_zamba.py::ZambaModelTest::test_disk_offload_bin (✅ ⟹ ❌)
    tests/models/zamba/test_modeling_zamba.py::ZambaModelTest::test_disk_offload_safetensors (✅ ⟹ ❌)

  • zamba2:
    tests/models/zamba2/test_modeling_zamba2.py::Zamba2ModelTest::test_cpu_offload (✅ ⟹ ❌)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants