01 · 路由层 Prompt Cache 亲和机制：从注册到 TTL 过期的全链路¶

本文按"注册 → 触发条件 → cache key 计算 → 路由命中 → 成功写入 → TTL 过期"的顺序梳理完整链路。读完应该能回答以下问题：

PromptCachingDeploymentCheck 在 Router 启动时的哪一步被注册？
4 个硬约束分别是什么？哪一个不满足整套机制就退化？
extract_cacheable_prefix 截取规则到底是什么？同一 prompt 改了用户问题会不会算同一 key？
Cache value 存的是什么？怎么过期？
写入是请求前还是后？为什么不看 response 里有没有 cache_write？

1. 注册位置：Router 初始化时¶

整套机制的入口在 litellm/router.py 的 add_optional_pre_call_checks() 函数。

1.1 字符串配置进 Pydantic 白名单¶

litellm/types/router.py:807-816 定义了 optional_pre_call_checks 字段的合法字符串：

OptionalPreCallChecks = List[
    Literal[
        "prompt_caching",
        "router_budget_limiting",
        "responses_api_deployment_check",
        "session_affinity",
        "deployment_affinity",
        "enforce_model_rate_limits",
    ]
]

YAML 里写 optional_pre_call_checks: ["prompt_caching"] 时这个字符串会被存进 router.optional_pre_call_checks 列表。

1.2 注册成 CustomLogger callback¶

litellm/router.py:1249-1274：

for pre_call_check in optional_pre_call_checks:
    _callback: Optional[CustomLogger] = None
    if pre_call_check in (
        "deployment_affinity",
        "responses_api_deployment_check",
        "session_affinity",
    ):
        continue                                              # 这三个走另一条路径
    if pre_call_check == "prompt_caching":
        _callback = PromptCachingDeploymentCheck(cache=self.cache)
    elif pre_call_check == "router_budget_limiting":
        _callback = RouterBudgetLimiting(...)
    elif pre_call_check == "enforce_model_rate_limits":
        _callback = ModelRateLimitingCheck(dual_cache=self.cache)

    if _callback is None:
        continue

    if self.optional_callbacks is None:
        self.optional_callbacks = []
    self.optional_callbacks.append(_callback)
    litellm.logging_callback_manager.add_litellm_callback(_callback)

→ PromptCachingDeploymentCheck 被同时挂到两个地方： 1. self.optional_callbacks —— 供 router 自己的 pre-call 检查链遍历 2. litellm.logging_callback_manager —— 接入全局 logging callback 链，这样 async_log_success_event 才能在请求成功后被触发

1.3 注入的 cache 是 Router 自己的 DualCache¶

注意 PromptCachingDeploymentCheck(cache=self.cache) 传的是 router 自己的 DualCache 实例。这意味着：

Router 配了 Redis（redis_host 等）→ DualCache 内含 RedisCache → 多 Pod 共享映射
Router 没配 Redis → DualCache 只有 InMemoryCache → 单 Pod 进程内有效，重启即失

⚠️ 这是 02-config-reference.md §Redis 是不是必需的根因。

1.4 为什么 UI 看不到 / `/router/settings` GET 可能不返回此字段¶

字段写进了 router，但读路径走的是另一条逻辑：

UI Router Settings 页（RouterSettingsForm.tsx）按 ROUTER_SETTINGS_FIELDS 静态列表渲染，列表里有 22 个字段（routing_strategy / num_retries / fallbacks / enable_pre_call_checks / enable_tag_filtering / disable_cooldowns / ...），optional_pre_call_checks 不在其中。
/router/settings GET endpoint（router_settings_endpoints.py:69-131）：
拿 ROUTER_SETTINGS_FIELDS 作 schema
从 llm_router 对象 + YAML config 提取每个白名单字段的当前值
用 current_values.update(router_settings_from_config) 合并 → 不在白名单但在 router 实例上的字段不会被读出

代码佐证 router_settings_endpoints.py:106-120：

current_values = {}
if llm_router is not None:
    for field in router_fields:               # ← 只遍历 ROUTER_SETTINGS_FIELDS 白名单
        if hasattr(llm_router, field.field_name):
            value = getattr(llm_router, field.field_name)
            current_values[field.field_name] = value

current_values.update(router_settings_from_config)  # ← 不在白名单的字段不会主动列出

结果：

场景	`/router/settings` 返回
字段配在 YAML 的 `router_settings` 段	⚠️ 可能不返回（取决于 `update` 是否合并了原 YAML 字典）
字段配在 DB 的 `LiteLLM_Config.router_settings`	⚠️ 同上
字段不在任何配置里	不返回

→ /router/settings GET 不能用来证明字段没生效。它的返回是 UI 渲染需求驱动的，不是 router 真实状态。可靠验证手段：

/get/config/callbacks 看 callback 列表（05-best-practices.md §5.3a）
实际跑请求看 usage 字段（05-best-practices.md §5.3c）
直接看 Redis key 是否被创建（如果有 shell 访问）

写入路径不受白名单影响：/config/update 接受任意 JSON 写入 DB，YAML 启动加载也接受任意 router_settings 字段。配置确实生效，只是 UI 不显示。

2. 触发前提的硬约束¶

PromptCachingDeploymentCheck.async_filter_deployments 在每个请求的路由选址阶段都被调用。它层层短路：

# litellm/router_utils/pre_call_checks/prompt_caching_deployment_check.py:23-49
async def async_filter_deployments(
    self, model, healthy_deployments, messages, request_kwargs=None, parent_otel_span=None,
) -> List[dict]:
    if messages is not None and is_prompt_caching_valid_prompt(
        messages=messages, model=model,
    ):                                                        # ← 约束 1+2
        prompt_cache = PromptCachingCache(cache=self.cache)
        model_id_dict = await prompt_cache.async_get_model_id(
            messages=cast(List[AllMessageValues], messages),
            tools=None,                                       # ← 约束 3: tools 写死 None
        )                                                     # async_get_model_id 内部还有约束 4
        if model_id_dict is not None:
            model_id = model_id_dict["model_id"]
            for deployment in healthy_deployments:
                if deployment["model_info"]["id"] == model_id:
                    return [deployment]                       # ★ 命中：只返回这一个 deployment

    return healthy_deployments                                # 未命中：原样返回

2.1 约束 1：messages 不为 None¶

messages is None → 直接走 else，返回原列表。embedding、image generation 等无 messages 的请求自然不参与。

2.2 约束 2：token 数 ≥ MINIMUM_PROMPT_CACHE_TOKEN_COUNT (1024)¶

litellm/utils.py:8969-8996：

def is_prompt_caching_valid_prompt(
    model: str,
    messages: Optional[List[AllMessageValues]],
    tools: Optional[List[ChatCompletionToolParam]] = None,
    custom_llm_provider: Optional[str] = None,
) -> bool:
    """OpenAI + Anthropic providers have a minimum token count of 1024 for prompt caching."""
    try:
        if messages is None and tools is None:
            return False
        if custom_llm_provider is not None and not model.startswith(custom_llm_provider):
            model = custom_llm_provider + "/" + model
        token_count = token_counter(
            messages=messages, tools=tools, model=model,
            use_default_image_token_count=True,
        )
        return token_count >= MINIMUM_PROMPT_CACHE_TOKEN_COUNT
    except Exception as e:
        verbose_logger.error(f"Error in is_prompt_caching_valid_prompt: {e}")
        return False

阈值定义在 litellm/constants.py:252-254：

MINIMUM_PROMPT_CACHE_TOKEN_COUNT = int(
    os.getenv("MINIMUM_PROMPT_CACHE_TOKEN_COUNT", 1024)
)  # minimum number of tokens to cache a prompt by Anthropic

→ 用 MINIMUM_PROMPT_CACHE_TOKEN_COUNT 环境变量可改，没有 YAML 等价配置。

异常时保守返回 False（不参与路由黏性，等价于跳过）。

2.3 约束 3：tools 实际写死 None¶

注意行 41 调用 async_get_model_id 时第二个参数固定 tools=None。PromptCachingCache.serialize_object 虽然支持序列化 tools，但这条路径用不上。

实际后果：

# 业务请求 1：
messages = [{"role": "system", "content": [{"type": "text", "text": "S", "cache_control": {"type": "ephemeral"}}]}]
tools = [tool_A, tool_B]

# 业务请求 2：
messages = [{"role": "system", "content": [{"type": "text", "text": "S", "cache_control": {"type": "ephemeral"}}]}]
tools = [tool_C]   # ← 完全不同

# 两个请求会被 router 当成同一个 cache key，黏到同一 deployment。

如果你担心这种情况导致缓存失效，你只能在客户端层面把 tools 也塞进 system text 里或者等 LiteLLM 修复 TODO（行 97 注释）。

2.4 约束 4：messages 必须包含 `cache_control` 块¶

这是最容易忽略的硬约束，藏在 get_prompt_caching_cache_key 内部：

litellm/router_utils/prompt_caching_cache.py:140-173：

@staticmethod
def get_prompt_caching_cache_key(
    messages: Optional[List[AllMessageValues]],
    tools: Optional[List[ChatCompletionToolParam]],
) -> Optional[str]:
    if messages is None and tools is None:
        return None

    cacheable_messages = None
    if messages is not None:
        cacheable_messages = PromptCachingCache.extract_cacheable_prefix(messages)
        if not cacheable_messages:                     # ← 空列表
            return None                                # ← 直接 None
    ...

extract_cacheable_prefix 找不到任何 cache_control 块时返回空列表，导致 cache_key 为 None，导致 async_get_model_id 返回 None，导致整套机制 no-op。

这意味着：客户端不主动在 messages 里打 cache_control 标记 → 整个 optional_pre_call_checks: ["prompt_caching"] 等于摆设。

3. Cache Key 算法¶

3.1 extract_cacheable_prefix：截到哪？¶

prompt_caching_cache.py:55-137：

核心规则：找出最后一个带 cache_control: {"type": "ephemeral"} 的位置（跨所有消息），cacheable prefix = 从开头到这个位置（含）的所有内容。这个位置之后的消息和块全部丢弃。

支持两种 cache_control 位置：

消息级别（content 是字符串时）：
```
{"role": "user", "content": "long text", "cache_control": {"type": "ephemeral"}}
```
行 83-92：检查 message.get("cache_control") 是否是 ephemeral。

内容块级别（content 是 list 时）：

{"role": "system", "content": [
  {"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}}
]}

行 94-107：遍历每个 content_block 检查 cache_control 字段。

截取逻辑（行 113-137）：

cacheable_prefix = []
for msg_idx, message in enumerate(messages):
    if msg_idx < last_cacheable_message_idx:
        cacheable_prefix.append(message)              # 之前的消息：完整保留
    elif msg_idx == last_cacheable_message_idx:
        content = message.get("content")
        if isinstance(content, list) and last_cacheable_content_idx is not None:
            message_copy = cast(AllMessageValues, {
                **message,
                "content": content[: last_cacheable_content_idx + 1],  # 只保留到那个块
            })
            cacheable_prefix.append(message_copy)
        else:
            cacheable_prefix.append(message)          # 消息级 cache_control：整条留下
    else:
        break                                         # 之后的消息：丢弃

3.2 实战例子¶

例 1：典型 system prompt 缓存¶

[
  {"role": "system", "content": [
    {"type": "text", "text": "<5000 token 的系统提示>", "cache_control": {"type": "ephemeral"}}
  ]},
  {"role": "user", "content": "查询 A"}
]

最后一个 cache_control 块：msg_idx=0, content_idx=0
cacheable_prefix = 第 0 条消息的第 0 个 content block
后续相同 system 但 user 改成 "查询 B" 的请求 → cache key 完全相同 → 黏到同一 deployment ✅

例 2：多轮对话缓存¶

[
  {"role": "system", "content": [{"type": "text", "text": "S", "cache_control": {"type": "ephemeral"}}]},
  {"role": "user", "content": "Q1"},
  {"role": "assistant", "content": "A1"},
  {"role": "user", "content": [
    {"type": "text", "text": "Q2", "cache_control": {"type": "ephemeral"}}
  ]}
]

最后一个 cache_control：msg_idx=3, content_idx=0
cacheable_prefix = 消息 0~2 全部 + 消息 3 的第 0 个 block（即 Q2 文本）
下一轮对话改成 Q3 时，cache_key 就变了（因为前一轮的最后一条用户消息进了指纹）
这是典型的"多轮渐进式缓存"模式，每加一轮就建一个新的黏性键

例 3：没标 cache_control¶

[
  {"role": "system", "content": "你是一个助手"},
  {"role": "user", "content": "你好"}
]

extract_cacheable_prefix 返回 []
get_prompt_caching_cache_key 返回 None
整套机制 no-op，请求正常负载均衡

3.3 serialize_object：序列化稳定性¶

prompt_caching_cache.py:36-53：

@staticmethod
def serialize_object(obj: Any) -> Any:
    if hasattr(obj, "dict"):
        return obj.dict()                              # Pydantic
    elif isinstance(obj, dict):
        return json.dumps(obj, sort_keys=True, separators=(",", ":"))  # 排序键
    elif isinstance(obj, list):
        return [PromptCachingCache.serialize_object(item) for item in obj]
    elif isinstance(obj, (int, float, bool)):
        return obj
    return str(obj)                                    # 兜底：转字符串

关键：dict 走 sort_keys=True、separators=(",", ":")，避免空格和键顺序差异导致 hash 不同。即使客户端两次构造 dict 时 key 插入顺序不同，序列化结果也一样。

3.4 最终 cache_key 格式¶

prompt_caching_cache.py:165-173：

data_to_hash_str = json.dumps(data_to_hash, sort_keys=True, separators=(",", ":"))
hashed_data = hashlib.sha256(data_to_hash_str.encode()).hexdigest()
return f"deployment:{hashed_data}:prompt_caching"

→ Redis 里的 key 长这样：

deployment:a1b2c3d4...64hex...:prompt_caching

可用 KEYS deployment:*:prompt_caching 直接列出所有黏性映射。

4. 路由命中流程¶

4.1 `_run_pre_call_checks` 调用链¶

litellm/router.py:6093-6140：

async def _run_pre_call_checks(
    self,
    model: str,
    healthy_deployments: List[Dict],
    messages: Optional[List[AllMessageValues]],
    parent_otel_span: Optional[Span],
    request_kwargs: Optional[dict] = None,
    logging_obj: Optional[LiteLLMLogging] = None,
):
    returned_healthy_deployments = healthy_deployments
    for _callback in litellm.callbacks:
        if isinstance(_callback, CustomLogger):
            try:
                returned_healthy_deployments = (
                    await _callback.async_filter_deployments(
                        model=model,
                        healthy_deployments=returned_healthy_deployments,
                        messages=messages,
                        request_kwargs=request_kwargs,
                        parent_otel_span=parent_otel_span,
                    )
                )
            except Exception as e:
                ## LOG FAILURE EVENT
                if logging_obj is not None:
                    asyncio.create_task(
                        logging_obj.async_failure_handler(
                            exception=e, traceback_exception=traceback.format_exc(),
                            end_time=time.time(),
                        )
                    )
                    threading.Thread(
                        target=logging_obj.failure_handler,
                        args=(e, traceback.format_exc()),
                    ).start()
                raise e
    return returned_healthy_deployments

→ 多个 callback 串行：每个 callback 的输出是下一个 callback 的输入。PromptCachingDeploymentCheck、DeploymentAffinityCheck、ModelRateLimitingCheck 等都会顺次过滤。

4.2 命中后的强制路由¶

async_filter_deployments 命中时返回 [matched_deployment] 单元素列表，router 后续的 routing strategy（latency-based / lowest-tpm 等）拿到的就是这一个元素的列表，等于没得选。

→ Prompt cache 黏性覆盖任何 routing_strategy，命中时优先级最高。

4.3 未命中：原样返回¶

async_get_model_id 返回 None（cache miss / TTL 已过期 / 该 model_id 不在 healthy_deployments 里）→ 返回原 healthy_deployments，下游 routing_strategy 正常工作。

5. 写入时机：成功后回调¶

5.1 触发点：async_log_success_event¶

prompt_caching_deployment_check.py:51-100：

async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
    standard_logging_object: Optional[StandardLoggingPayload] = kwargs.get(
        "standard_logging_object", None
    )
    if standard_logging_object is None:
        return

    call_type = standard_logging_object["call_type"]
    if (
        call_type != CallTypes.completion.value
        and call_type != CallTypes.acompletion.value
        and call_type != CallTypes.anthropic_messages.value
    ):
        verbose_logger.debug(...)
        return

    model = standard_logging_object["model"]
    messages = standard_logging_object["messages"]
    model_id = standard_logging_object["model_id"]

    if messages is None or not isinstance(messages, list):
        return
    if model_id is None:
        return

    if is_prompt_caching_valid_prompt(
        model=model,
        messages=cast(List[AllMessageValues], messages),
    ):
        cache = PromptCachingCache(cache=self.cache)
        await cache.async_add_model_id(
            model_id=model_id,
            messages=messages,
            tools=None,  # [TODO]: add tools once standard_logging_object supports it
        )

5.2 关键设计：不看 response，只看 request¶

这个写入逻辑只检查请求侧条件： - call_type 合法（completion / acompletion / anthropic_messages） - messages 是 list - model_id 已知 - token >= 1024（is_prompt_caching_valid_prompt） - messages 含 cache_control 块（隐式：async_add_model_id 内部会调 get_prompt_caching_cache_key，没 cache_control 时 key 为 None，写入静默 return）

它不查 response 里的 cache_creation_input_tokens。

这有两层含义：

优点：上游不告诉你也照样建立黏性¶

即使上游 response 不返回 cache 字段（比如某些中转站吃掉了 usage 字段），LiteLLM 也会基于"我相信这个 prompt 触发了 cache"的乐观假设建立映射。后续相同前缀的请求继续黏到这个 deployment，让上游有机会形成 cache。

风险：可能黏到一个根本不支持 cache 的中转站¶

如果你的中转站根本不传递 cache_control 到上游（很多便宜中转站会偷偷剥离），那它从来没真的发生过 cache write。但 LiteLLM 把所有"看起来该缓存"的请求都黏到它身上 —— 结果是负载不均衡 + 缓存没用上 + 双输。

→ 详见 05-best-practices.md §中转站排雷.

5.3 call_type 限制清单¶

只有这 3 种调用类型会触发写入：

call_type	来源
`completion`	同步 `litellm.completion()`
`acompletion`	异步 `litellm.acompletion()`、`/v1/chat/completions` 主入口
`anthropic_messages`	`/v1/messages` Anthropic 原生入口

不触发：embedding / image / responses / batch / fine-tuning / speech 等所有非聊天补全调用。

5.4 standard_logging_object 数据来源¶

model：实际选中的 model name（可能是 bedrock/... 这类完整名）
messages：原始 messages（含 cache_control 标记）
model_id：实际被路由到的 deployment 的 UUID（即 model_info.id）—— 这是写入映射的 value

6. 存储层：DualCache 双层¶

6.1 PromptCachingCache 初始化¶

prompt_caching_cache.py:31-34：

class PromptCachingCache:
    def __init__(self, cache: DualCache):
        self.cache = cache
        self.in_memory_cache = InMemoryCache()           # 注意：这个属性建了但没用上

⚠️ self.in_memory_cache 这一行看起来是多余的（grep 全文件没看到它被使用）。实际存储完全走 self.cache（router 传进来的 DualCache）。

6.2 DualCache 的双写双读¶

DualCache 内部封装了 InMemoryCache + RedisCache：

# 伪代码
class DualCache:
    async def async_set_cache(self, key, value, ttl):
        self.in_memory.set(key, value, ttl)
        if self.redis is not None:
            await self.redis.async_set(key, value, ttl)

    async def async_get_cache(self, key):
        v = self.in_memory.get(key)
        if v is not None:
            return v
        if self.redis is not None:
            return await self.redis.async_get(key)
        return None

→ 写入：双写 Redis + 本地内存 → 读取：先本地内存（快），miss 后查 Redis

6.3 TTL 300s 硬编码¶

prompt_caching_cache.py:189-192 和行 208-212：

self.cache.set_cache(
    cache_key, PromptCachingCacheValue(model_id=model_id), ttl=300
)
# 异步版本同理:
await self.cache.async_set_cache(
    cache_key,
    PromptCachingCacheValue(model_id=model_id),
    ttl=300,  # store for 5 minutes
)

→ TTL 在代码里写死，没有 YAML 等价配置。要改只能改源码。

为什么是 300 秒？跟 Anthropic 默认 ephemeral cache 的 5 分钟 TTL 一致 —— 上游缓存最长存 5 分钟，路由黏性映射也只在这个窗口内维护意义。Anthropic 后来支持 1h ephemeral（Claude 4.5+），但 LiteLLM 路由层目前还没区分。

6.4 多 Pod 行为¶

场景	配了 Redis	没配 Redis
Pod A 写入 → Pod A 读取	✅ 本地缓存命中	✅ 本地缓存命中
Pod A 写入 → Pod B 读取	✅ Redis 共享	❌ Pod B 完全不知道
Pod A 写入 → Pod A 重启后读取	✅ Redis 共享	❌ 内存清空

→ 生产环境上必须配 Redis，否则等价于"每个 Pod 各算各的"，prompt cache 命中率几乎为零。

7. 跟其它 pre-call check 的对比¶

Check	类	过滤逻辑	TTL	存储
`prompt_caching`	`PromptCachingDeploymentCheck`	messages 指纹 → 1 个 deployment	300s 硬编码	DualCache
`deployment_affinity`	`DeploymentAffinityCheck`	user_api_key 或其它指纹 → deployment	可配 `deployment_affinity_ttl_seconds`	DualCache
`session_affinity`	`DeploymentAffinityCheck`	session_id → deployment	同上	DualCache
`responses_api_deployment_check`	`DeploymentAffinityCheck`	previous_response_id → deployment	同上	DualCache
`enforce_model_rate_limits`	`ModelRateLimitingCheck`	模型级别 RPM/TPM 限流	N/A	DualCache（计数器）
`router_budget_limiting`	`RouterBudgetLimiting`	provider/model 级别预算限制	N/A	DualCache（spend 计数）

→ 前 4 个都是"亲和路由"类，都从 healthy_deployments 里筛 1 个出来。其中 deployment_affinity / session_affinity / responses_api_deployment_check 三个共享同一个 DeploymentAffinityCheck 实例（router.py:1200-1244），而 prompt_caching 是独立实例。

串行执行顺序：按 litellm.callbacks 列表的注册顺序。如果同时启用 prompt_caching 和 deployment_affinity，前者先命中就直接锁死了，后者拿到的列表已经是单元素。

8. 一次完整请求的时间线¶

T0   客户端发请求 → /v1/chat/completions
      body 含 messages, 其中 system 块带 cache_control: {type: ephemeral}, ~5000 tokens

T1   Proxy 鉴权完成 → 调 router.acompletion()

T2   router 选 deployment：
      - 获取 healthy_deployments (N 个候选)
      - 进入 _run_pre_call_checks 循环
      - PromptCachingDeploymentCheck.async_filter_deployments:
          * is_prompt_caching_valid_prompt: token_count=5000 >= 1024 → True
          * PromptCachingCache.async_get_model_id(messages, tools=None)
              - extract_cacheable_prefix → 截取 system block
              - serialize + sha256 → cache_key = "deployment:abc123...:prompt_caching"
              - DualCache.async_get_cache(cache_key)
                  本地内存查 → miss
                  Redis 查 → 命中! 返回 {"model_id": "deploy-uuid-X"}
          * 遍历 healthy_deployments 找 model_info.id == "deploy-uuid-X"
          * 找到 → 返回 [matched_deployment]
      - 后续 callbacks: 拿到单元素列表, 都是 no-op

T3   router 用 deploy-uuid-X 的 litellm_params 发请求到上游
      （比如 Anthropic API，cache_control 标记被透传过去）

T4   上游返回 200，response.usage 含 cache_read_input_tokens=4800

T5   logging 链路触发:
      - response → StandardLoggingPayload, 含 model_id, messages, model
      - litellm.logging_callback_manager 遍历所有 callback
      - PromptCachingDeploymentCheck.async_log_success_event:
          * call_type == acompletion ✅
          * messages 是 list ✅
          * model_id 已知 ✅
          * is_prompt_caching_valid_prompt → True
          * PromptCachingCache.async_add_model_id(model_id="deploy-uuid-X",
                                                  messages=..., tools=None)
              - 生成同样的 cache_key
              - DualCache.async_set_cache(cache_key, {"model_id": "deploy-uuid-X"}, ttl=300)
                  本地内存写入
                  Redis 写入 (覆盖, 重置 TTL)

T6   响应返回客户端，链路结束

T0+300s
      Redis key "deployment:abc123...:prompt_caching" 自然过期
      下一次相同前缀的请求 → cache miss → 退化负载均衡
      （但只要每 300s 内至少有一次请求过来，TTL 就会被 T5 的写入续期）

9. 边界情况速查¶

情况	行为	文档参考
messages 为 None	filter no-op	§2.1
messages 不含 cache_control	filter no-op	§2.4
token < 1024	filter no-op	§2.2
Redis 没配	单 Pod 进程内有效，多 Pod 失效	§6.4
TTL 过期	下一次未命中，正常负载均衡，新请求成功后重新写入	§6.3
命中的 model_id 不在 healthy_deployments 里（已 cooldown / 已下线）	filter 返回原列表	§4.2
tools 不同但 messages 相同	当成同一 cache key（已知问题）	§2.3
流式调用	同非流式（看 call_type）	§5.3
同时启用 deployment_affinity	串行过滤，先命中者锁死	§7
call_type 是 embedding	写入跳过，filter 也因 messages=None 跳过	§5.3

10. 看代码的入口清单¶

文件	关键函数	作用
litellm/router_utils/pre_call_checks/prompt_caching_deployment_check.py	`async_filter_deployments` / `async_log_success_event`	路由前过滤 + 成功后写入
litellm/router_utils/prompt_caching_cache.py	`extract_cacheable_prefix` / `get_prompt_caching_cache_key` / `async_add_model_id` / `async_get_model_id`	指纹算法 + 存储包装
litellm/utils.py:8969	`is_prompt_caching_valid_prompt`	1024 token 阈值检查
litellm/constants.py:252	`MINIMUM_PROMPT_CACHE_TOKEN_COUNT`	阈值常量
litellm/router.py:1249-1274	`add_optional_pre_call_checks`	注册 callback
litellm/router.py:6093-6140	`_run_pre_call_checks`	路由时的 callback 串行执行
litellm/types/router.py:807	`OptionalPreCallChecks` Literal	配置白名单