Conversation
| The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| [For maintainers] Suggested jobs to run (before merge) run-slow: bamba, falcon_h1, falcon_mamba, granitemoehybrid, jamba, lfm2, lfm2_moe, mamba, mamba2, nemotron_h |
| run-slow: mamba2 zamba2 granitemoehybrid falcon_h1 lfm2 lfm2_moe qwen3_5 bamba mamba nemotron_h qwen3_next zamba jamba qwen3_5_moe falcon_mamba |
| This comment contains models: ["models/bamba", "models/falcon_h1", "models/falcon_mamba", "models/granitemoehybrid", "models/jamba", "models/lfm2", "models/lfm2_moe", "models/mamba", "models/mamba2", "models/nemotron_h", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_next", "models/zamba", "models/zamba2"] |
CI ResultsCommit Info
Model CI Report❌ 14 new failed tests from this PR 😭
|
What does this PR do?
As per the title. This PR finally makes mamba layer caches first class citizen, and adds native support for them.
It supports the following layers combinations:
For this, it adds the 2 following layer classes:
Everything integrates smoothly with the existing cache machinery in the case of hybrid attention/mamba archs, i.e. functions such as
get_seq_length,get_mask_sizes(used for mask creation notably) will always look at attention layers.