Skip to content

release state cache#4462

Merged
lvhan028 merged 1 commit intoInternLM:mainfrom
CUHKSZzxy:release-state-cache
Mar 25, 2026
Merged

release state cache#4462
lvhan028 merged 1 commit intoInternLM:mainfrom
CUHKSZzxy:release-state-cache

Conversation

@CUHKSZzxy
Copy link
Collaborator

@CUHKSZzxy CUHKSZzxy commented Mar 25, 2026

Test

Test script
import torch import requests from time import sleep BASE_URL = 'http://0.0.0.0:23334' api_key = 'sk-xxx' headers = { "Content-Type": "application/json", "Authorization": f"Bearer {api_key}", } def log_memory(msg: str = '') -> None: # for rank in range(8): for rank in [6, 7]: free_mem, total_mem = torch.cuda.mem_get_info(rank) used_mem = (total_mem - free_mem) / 1024**3 print(f'rank {rank}, {msg}, total used mem {used_mem:.2f} GB') for i in range(3): print(f"=== iteration {i} ===") log_memory(f"before sleep {i}") # offloads weights and kv cache with level=2 response = requests.post(f"{BASE_URL}/sleep", headers=headers, params=dict(tags=['weights', 'kv_cache'], level=2)) assert response.status_code == 200, response.status_code log_memory(f"after sleep {i}") sleep(1) # wake up weights, the server is ready for update  log_memory(f"before wakeup weight {i}") response = requests.post(f"{BASE_URL}/wakeup", headers=headers, params=dict(tags=['weights'])) assert response.status_code == 200, response.status_code log_memory(f"after wakeup weight {i}") sleep(1) # wake up kv cache, the server is ready for update kv cache log_memory(f"before wakeup kv cache {i}") response = requests.post(f"{BASE_URL}/wakeup", headers=headers, params=dict(tags=['kv_cache'])) assert response.status_code == 200, response.status_code log_memory(f"after wakeup kv cache {i}") sleep(1)

Compare

Tested with Qwen3.5-35B-A3B with TP=2, GPU mem after sleep then wakeup 70.39 GB -> 68.71 GB

Before PR
=== iteration 0 === rank 6, before sleep 0, total used mem 68.79 GB rank 7, before sleep 0, total used mem 68.79 GB rank 6, after sleep 0, total used mem 4.06 GB rank 7, after sleep 0, total used mem 4.06 GB rank 6, before wakeup weight 0, total used mem 4.06 GB rank 7, before wakeup weight 0, total used mem 4.06 GB rank 6, after wakeup weight 0, total used mem 37.81 GB rank 7, after wakeup weight 0, total used mem 37.81 GB rank 6, before wakeup kv cache 0, total used mem 37.81 GB rank 7, before wakeup kv cache 0, total used mem 37.81 GB rank 6, after wakeup kv cache 0, total used mem 70.39 GB rank 7, after wakeup kv cache 0, total used mem 70.39 GB === iteration 1 === rank 6, before sleep 1, total used mem 70.39 GB rank 7, before sleep 1, total used mem 70.39 GB rank 6, after sleep 1, total used mem 4.06 GB rank 7, after sleep 1, total used mem 4.06 GB rank 6, before wakeup weight 1, total used mem 4.06 GB rank 7, before wakeup weight 1, total used mem 4.06 GB rank 6, after wakeup weight 1, total used mem 37.81 GB rank 7, after wakeup weight 1, total used mem 37.81 GB rank 6, before wakeup kv cache 1, total used mem 37.81 GB rank 7, before wakeup kv cache 1, total used mem 37.81 GB rank 6, after wakeup kv cache 1, total used mem 70.39 GB rank 7, after wakeup kv cache 1, total used mem 70.39 GB === iteration 2 === rank 6, before sleep 2, total used mem 70.39 GB rank 7, before sleep 2, total used mem 70.39 GB rank 6, after sleep 2, total used mem 4.06 GB rank 7, after sleep 2, total used mem 4.06 GB rank 6, before wakeup weight 2, total used mem 4.06 GB rank 7, before wakeup weight 2, total used mem 4.06 GB rank 6, after wakeup weight 2, total used mem 37.81 GB rank 7, after wakeup weight 2, total used mem 37.81 GB rank 6, before wakeup kv cache 2, total used mem 37.81 GB rank 7, before wakeup kv cache 2, total used mem 37.81 GB rank 6, after wakeup kv cache 2, total used mem 70.39 GB rank 7, after wakeup kv cache 2, total used mem 70.39 GB 
After PR
=== iteration 0 === rank 6, before sleep 0, total used mem 68.79 GB rank 7, before sleep 0, total used mem 68.79 GB rank 6, after sleep 0, total used mem 2.37 GB rank 7, after sleep 0, total used mem 2.37 GB rank 6, before wakeup weight 0, total used mem 2.37 GB rank 7, before wakeup weight 0, total used mem 2.37 GB rank 6, after wakeup weight 0, total used mem 36.13 GB rank 7, after wakeup weight 0, total used mem 36.13 GB rank 6, before wakeup kv cache 0, total used mem 36.13 GB rank 7, before wakeup kv cache 0, total used mem 36.13 GB rank 6, after wakeup kv cache 0, total used mem 68.71 GB rank 7, after wakeup kv cache 0, total used mem 68.71 GB === iteration 1 === rank 6, before sleep 1, total used mem 68.71 GB rank 7, before sleep 1, total used mem 68.71 GB rank 6, after sleep 1, total used mem 2.37 GB rank 7, after sleep 1, total used mem 2.37 GB rank 6, before wakeup weight 1, total used mem 2.37 GB rank 7, before wakeup weight 1, total used mem 2.37 GB rank 6, after wakeup weight 1, total used mem 36.13 GB rank 7, after wakeup weight 1, total used mem 36.12 GB rank 6, before wakeup kv cache 1, total used mem 36.13 GB rank 7, before wakeup kv cache 1, total used mem 36.12 GB rank 6, after wakeup kv cache 1, total used mem 68.71 GB rank 7, after wakeup kv cache 1, total used mem 68.71 GB === iteration 2 === rank 6, before sleep 2, total used mem 68.71 GB rank 7, before sleep 2, total used mem 68.71 GB rank 6, after sleep 2, total used mem 2.37 GB rank 7, after sleep 2, total used mem 2.37 GB rank 6, before wakeup weight 2, total used mem 2.37 GB rank 7, before wakeup weight 2, total used mem 2.37 GB rank 6, after wakeup weight 2, total used mem 36.13 GB rank 7, after wakeup weight 2, total used mem 36.12 GB rank 6, before wakeup kv cache 2, total used mem 36.13 GB rank 7, before wakeup kv cache 2, total used mem 36.12 GB rank 6, after wakeup kv cache 2, total used mem 68.71 GB rank 7, after wakeup kv cache 2, total used mem 68.71 GB 
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ensures the “state cache” GPU allocations are released when the model agent is put to sleep or fully released, reducing residual GPU memory usage after sleep/wakeup cycles.

Changes:

  • Clear self.state_cache_engine during sleep() so state-cache buffers can be freed.
  • Clear self.state_cache_engine during release() for full teardown parity with cache_engine.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lvhan028 lvhan028 merged commit 90245a3 into InternLM:main Mar 25, 2026
8 of 9 checks passed
@CUHKSZzxy CUHKSZzxy deleted the release-state-cache branch March 25, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 participants