03 · Provider 兼容矩阵：cache_control 请求与响应字段差异¶

本文按 provider 逐个梳理： - 客户端怎么传 cache_control - LiteLLM 怎么转换 / 透传 - 上游响应里 cache 字段叫什么名 - LiteLLM 怎么归一化到 Usage

→ 用于判断你的中转站 / Bedrock 账户 / Vertex 项目能不能用上 prompt cache。

0. 总览¶

Provider	请求侧 cache_control 形态	响应侧 cache 字段	LiteLLM 归一化
Anthropic 原生	`cache_control: {type: ephemeral, ttl?: "5m"\\|"1h"}` 嵌在 message content block / system / tool	`usage.cache_creation_input_tokens` / `usage.cache_read_input_tokens` + `usage.cache_creation.{ephemeral_5m_input_tokens, ephemeral_1h_input_tokens}`	`Usage.cache_creation_input_tokens` / `cache_read_input_tokens` + `PromptTokensDetailsWrapper.cached_tokens` / `cache_creation_tokens`
Bedrock Converse	自动转换为 `cachePointBlock` (类型 `default`)，ttl 仅 Claude 4.5+ 支持	`usage.cacheReadInputTokens` / `usage.cacheWriteInputTokens`	同上
Bedrock Invoke (anthropic_claude3)	原生 Anthropic 格式透传，但部分 ttl 被剥离（非 Claude 4.5+）	同 Anthropic 原生	同上
Vertex AI Anthropic	继承 `AnthropicConfig`，原生 Anthropic 格式	同 Anthropic 原生	同上
Azure AI Anthropic	继承 `AnthropicConfig`，原生 Anthropic 格式	同 Anthropic 原生	同上
OpenAI 原生	自动缓存（无需 cache_control 标记），通过前缀匹配	`usage.prompt_tokens_details.cached_tokens`（无 cache_creation）	`PromptTokensDetailsWrapper.cached_tokens`
OpenAI 兼容中转站	取决于中转站是否透传 cache_control	取决于中转站是否返回 cache 字段	取决于响应实际字段
Vertex Gemini	通过 cached_content 资源 ID（独立 API），跟 cache_control 不同机制	`usage.cached_content_token_count`	`PromptTokensDetailsWrapper.cached_tokens`

→ 本路由黏性机制（PromptCachingDeploymentCheck）只看 messages 里有没有 cache_control 字段。也就是说：

Anthropic 系（原生 / Bedrock Converse / Bedrock Invoke / Vertex / Azure AI）—— 客户端打 cache_control 标记就能触发
OpenAI 原生：客户端不需要打 cache_control，但 LiteLLM 路由黏性也因此不会触发（extract_cacheable_prefix 找不到 cache_control 块 → cache_key = None → 路由 no-op）
Gemini：缓存机制完全不同，本路由黏性机制不适用

1. Anthropic 原生 API¶

1.1 请求侧：cache_control 三种位置¶

位置 1：系统消息文本块

{
  "role": "system",
  "content": [
    {
      "type": "text",
      "text": "<long system prompt>",
      "cache_control": {"type": "ephemeral"}
    }
  ]
}

处理代码：litellm/llms/anthropic/chat/transformation.py:1097-1100:

if "cache_control" in system_message_block:
    anthropic_system_message_content["cache_control"] = (
        system_message_block["cache_control"]
    )

位置 2：用户/助手消息内容块

{
  "role": "user",
  "content": [
    {"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}},
    {"type": "image", "source": {...}}
  ]
}

位置 3：工具定义

{
  "tools": [
    {
      "type": "function",
      "function": {"name": "...", "description": "...", "parameters": {...}},
      "cache_control": {"type": "ephemeral"}
    }
  ]
}

处理代码：litellm/llms/anthropic/chat/transformation.py:495-512:

_cache_control = tool.get("cache_control", None)
_cache_control_function = tool.get("function", {}).get("cache_control", None)
if returned_tool is not None:
    tool_type = returned_tool.get("type", "")
    if tool_type not in (
        "tool_search_tool_regex_20251119",
        "tool_search_tool_bm25_20251119",
    ):                                                # ← tool_search 工具不支持 cache_control
        if _cache_control is not None:
            returned_tool["cache_control"] = _cache_control
        elif _cache_control_function is not None and isinstance(_cache_control_function, dict):
            returned_tool["cache_control"] = ChatCompletionCachedContent(**_cache_control_function)

⚠️ tool_search_tool_* 工具不支持 cache_control，会被静默剥离。

1.2 cache_control 的 ttl 字段（Claude 4.5+ 才有）¶

{"cache_control": {"type": "ephemeral", "ttl": "5m"}}    // 默认
{"cache_control": {"type": "ephemeral", "ttl": "1h"}}    // 1 小时缓存，更贵

仅 Claude 4.5+ 模型识别
计费时分别走 cache_creation_input_token_cost 和 cache_creation_input_token_cost_above_1hr

1.3 Beta Header（已弃用，无需手动加）¶

litellm/llms/anthropic/chat/transformation.py:337-343:

def get_cache_control_headers(self) -> dict:
    # Anthropic no longer requires the prompt-caching beta header
    # Prompt caching now works automatically when cache_control is used in messages
    return {
        "anthropic-version": "2023-06-01",
    }

→ 历史上需要 anthropic-beta: prompt-caching-2024-07-31 header，现在不再需要。LiteLLM 不会额外加这个 header。

1.4 响应解析¶

litellm/llms/anthropic/chat/transformation.py:1539-1602:

cache_creation_input_tokens: int = 0
cache_read_input_tokens: int = 0
cache_creation_token_details: Optional[CacheCreationTokenDetails] = None

if "cache_creation_input_tokens" in _usage and _usage["cache_creation_input_tokens"] is not None:
    cache_creation_input_tokens = _usage["cache_creation_input_tokens"]
    prompt_tokens += cache_creation_input_tokens             # ★ 累加到 prompt_tokens
if "cache_read_input_tokens" in _usage and _usage["cache_read_input_tokens"] is not None:
    cache_read_input_tokens = _usage["cache_read_input_tokens"]
    prompt_tokens += cache_read_input_tokens                 # ★ 累加到 prompt_tokens

if "cache_creation" in _usage and _usage["cache_creation"] is not None:
    cache_creation_token_details = CacheCreationTokenDetails(
        ephemeral_5m_input_tokens=_usage["cache_creation"].get("ephemeral_5m_input_tokens"),
        ephemeral_1h_input_tokens=_usage["cache_creation"].get("ephemeral_1h_input_tokens"),
    )

prompt_tokens_details = PromptTokensDetailsWrapper(
    cached_tokens=cache_read_input_tokens,
    cache_creation_tokens=cache_creation_input_tokens,
    cache_creation_token_details=cache_creation_token_details,
)

→ 关键：Anthropic 上游返回的 input_tokens 不含 cache 部分（是"未命中缓存的新 token 数"），LiteLLM 把它们加回到 prompt_tokens 里以维护"总输入 token"语义。

1.5 字段映射表¶

上游字段	LiteLLM Usage 字段	LiteLLM PromptTokensDetails 字段
`usage.input_tokens`	`prompt_tokens` 的基数（之后会加上 cache tokens）	-
`usage.cache_creation_input_tokens`	`Usage.cache_creation_input_tokens`	`cache_creation_tokens`
`usage.cache_read_input_tokens`	`Usage.cache_read_input_tokens`	`cached_tokens`
`usage.cache_creation.ephemeral_5m_input_tokens`	-	`cache_creation_token_details.ephemeral_5m_input_tokens`
`usage.cache_creation.ephemeral_1h_input_tokens`	-	`cache_creation_token_details.ephemeral_1h_input_tokens`

2. Bedrock Converse API¶

2.1 请求侧：cache_control → cachePointBlock¶

Anthropic 的 cache_control 在 Bedrock Converse API 里形态完全不同，叫 cachePoint。LiteLLM 自动转换：

litellm/llms/bedrock/chat/converse_transformation.py:1110-1122:

cache_control = message_block.get("cache_control", None)
if cache_control is None:
    return None

cache_point = CachePointBlock(type="default")
if isinstance(cache_control, dict) and "ttl" in cache_control:
    ttl = cache_control["ttl"]
    if ttl in ["5m", "1h"] and model is not None:
        if is_claude_4_5_on_bedrock(model):
            cache_point["ttl"] = ttl

→ 客户端继续写 cache_control: {type: ephemeral}，LiteLLM 在发到 Bedrock 时转换成 {"cachePoint": {"type": "default"}} 块。

2.2 ttl 限制¶

ttl: "5m" / ttl: "1h"：仅 Claude 4.5+ 模型透传
其它模型的 ttl 字段被静默丢弃
is_claude_4_5_on_bedrock() 判断模型版本

2.3 Beta header 在 Converse 路径被过滤¶

litellm/llms/bedrock/chat/converse_transformation.py:86-90:

UNSUPPORTED_BEDROCK_CONVERSE_BETA_PATTERNS = [
    "advanced-tool-use",
    "prompt-caching",           # ← 这个 beta 不应该出现在 Converse 调用
    "compact-2026-01-12",
]

→ 如果客户端误传 anthropic-beta: prompt-caching header，LiteLLM 在 Converse 路径会过滤掉它（避免上游报错）。但 prompt cache 功能本身仍然工作（通过 cachePoint 块）。

2.4 响应解析¶

def _transform_usage(self, usage: ConverseTokenUsageBlock) -> Usage:
    input_tokens = usage["inputTokens"]
    output_tokens = usage["outputTokens"]
    total_tokens = usage["totalTokens"]
    cache_creation_input_tokens: int = 0
    cache_read_input_tokens: int = 0

    if "cacheReadInputTokens" in usage:
        cache_read_input_tokens = usage["cacheReadInputTokens"]
        input_tokens += cache_read_input_tokens          # 累加到 input_tokens
    if "cacheWriteInputTokens" in usage:
        cache_creation_input_tokens = usage["cacheWriteInputTokens"]
        input_tokens += cache_creation_input_tokens      # 累加到 input_tokens
    ...

→ Bedrock Converse 用的是驼峰命名：cacheReadInputTokens / cacheWriteInputTokens。注意是 cacheWrite 不是 cacheCreation。

2.5 字段映射表¶

Bedrock Converse 上游字段	LiteLLM Usage 字段
`usage.inputTokens`	`prompt_tokens` 基数（会被累加）
`usage.cacheReadInputTokens`	`Usage.cache_read_input_tokens` + `PromptTokensDetails.cached_tokens`
`usage.cacheWriteInputTokens`	`Usage.cache_creation_input_tokens`

⚠️ Bedrock Converse 不返回 ephemeral_5m_input_tokens / ephemeral_1h_input_tokens 细分 → 计费时只能按基础价 cache_creation_input_token_cost 算，无法用 1hr 分级价。

3. Bedrock Invoke API（anthropic_claude3）¶

3.1 请求侧：保留 Anthropic 原生格式¶

Invoke 路径直接调 bedrock-runtime/InvokeModel，body 是 Anthropic JSON 格式，cache_control 字段保留。

3.2 ttl 处理¶

litellm/llms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py:117-150:

_remove_ttl_from_cache_control() 默认移除所有 ttl 字段；例外：Claude 4.5+ 上的 ttl: "5m" 或 ttl: "1h" 保留。

3.3 响应解析¶

走 AnthropicConfig._transform_response_for_modes 链路，跟 Anthropic 原生完全一致。

4. Vertex AI Anthropic（partner model）¶

litellm/llms/vertex_ai/vertex_ai_partner_models/anthropic/transformation.py:

class VertexAIAnthropicConfig(AnthropicConfig):
    """Reference: https://docs.anthropic.com/claude/reference/claude-on-vertex-ai"""
    ...

继承自 AnthropicConfig，请求 / 响应转换逻辑完全继承
区别只在认证（使用 Google OAuth）和 endpoint
cache_control 行为跟 Anthropic 原生 100% 一致

5. Azure AI Anthropic¶

litellm/llms/azure_ai/anthropic/transformation.py:

class AzureAIAnthropicConfig(AnthropicConfig):
    ...
    def validate_environment(self, ...):
        ...
        if is_cache_control_set(messages=messages):
            ...

继承 AnthropicConfig
认证用 Azure 的 api-key 或 Authorization: Bearer（非 x-api-key）
cache_control 行为跟 Anthropic 原生一致

6. OpenAI 原生¶

6.1 没有 `cache_control` 字段¶

OpenAI 的 prompt cache 是自动的，基于请求前缀匹配（前 1024 tokens 必须完全相同）。客户端不需要传任何标记。

6.2 响应里只有 `cached_tokens`¶

{
  "usage": {
    "prompt_tokens": 5000,
    "completion_tokens": 200,
    "prompt_tokens_details": {
      "cached_tokens": 4800
    }
  }
}

→ 没有 cache_creation_input_tokens 概念（OpenAI 不区分"创建缓存"和"读取缓存"，缓存写入是免费的）。

6.3 LiteLLM 路由黏性对 OpenAI 不工作¶

关键问题：LiteLLM 的 PromptCachingDeploymentCheck 只看 messages 里有没有 cache_control 块。客户端调 OpenAI 模型时没有这个字段 → extract_cacheable_prefix 返回空 → cache_key 为 None → 路由 no-op。

变通方案：

如果你的 OpenAI 兼容客户端层愿意改造，可以手动在 messages 加 cache_control（即使 OpenAI 不识别它，LiteLLM 会触发路由黏性，OpenAI 自己会根据前缀算法判断缓存）：

{"role": "system", "content": [
  {"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}}
]}

⚠️ 这要求你的中转站 / OpenAI API 能容忍 cache_control 字段（被 drop_params 过滤或被忽略）。

用 deployment_affinity 替代：基于 user_api_key 或 session_id 做粘性，不依赖 messages 内容。
```
router_settings:
  optional_pre_call_checks: ["deployment_affinity"]
```
适合"同一用户的连续请求"场景，但不能按 prompt 内容做黏性。

7. OpenAI 兼容中转站¶

7.1 三件事必须确认¶

7.1.1 中转站会不会透传 `cache_control` 到上游？¶

许多中转站会用 OpenAI 的 schema 做严格验证，把 cache_control 字段剥离。这样即使 LiteLLM 路由黏性建立了，上游也从未真的发生 cache 写入。

验证方法：抓一份中转站对 Claude 模型的实际请求 body（不是 LiteLLM 发的，是中转站对 Anthropic 上游发的），看 cache_control 是否还在。

7.1.2 中转站会不会返回 `cache_creation_input_tokens` / `cache_read_input_tokens`？¶

中转站常见 3 种行为：

A. 完整透传：response.usage 里有 Anthropic 原生字段 → LiteLLM 解析正常
B. 转换为 OpenAI 形态：只返回 prompt_tokens_details.cached_tokens（无 cache_creation）
C. 完全吃掉：response.usage 只有 prompt_tokens / completion_tokens → cache 信息丢失

→ 行为 C 时，LiteLLM 路由黏性仍然能工作（因为它不看 response，只看 request），但计费会按全价（cache_creation 走 prompt_tokens 单价）。

7.1.3 多个中转站是不是真的有独立 cache pool？¶

Anthropic 的 prompt cache 按 API key / organization 维度隔离。如果两个中转站背后用的是同一个 Anthropic API key，它们的 cache 是共享的 —— 路由黏性反而没意义。

但通常中转站会用各自的 key，所以默认假设是独立 pool。

7.2 客户端调用形态¶

如果你的中转站走 Anthropic 原生协议（/v1/messages）：

LiteLLM litellm_params.model: anthropic/claude-sonnet-4-5-20250929
客户端 messages 里写 cache_control
走 Anthropic 完整 transformation

如果走 OpenAI 兼容协议（/v1/chat/completions）：

LiteLLM litellm_params.model: openai/claude-sonnet-4-5-20250929
客户端可以在 messages 里写 cache_control（LiteLLM 不会剥离，会原样发到中转站）
中转站是否识别 / 透传，取决于中转站实现

7.3 验证脚本¶

import litellm
import os

os.environ["LITELLM_LOG"] = "DEBUG"

resp = litellm.completion(
    model="openai/claude-sonnet-4-5",
    api_base="https://your-relay.example.com/v1",
    api_key="sk-...",
    messages=[
        {"role": "system", "content": [
            {"type": "text", "text": "<long system prompt 大于 1024 tokens>",
             "cache_control": {"type": "ephemeral"}}
        ]},
        {"role": "user", "content": "查询"}
    ],
)

print("usage:", resp.usage)
print("cache_creation_input_tokens:", getattr(resp.usage, "cache_creation_input_tokens", None))
print("cache_read_input_tokens:", getattr(resp.usage, "cache_read_input_tokens", None))
print("cached_tokens:",
      getattr(resp.usage.prompt_tokens_details, "cached_tokens", None)
      if resp.usage.prompt_tokens_details else None)

判定：

第一次跑：cache_creation_input_tokens > 0，cache_read_input_tokens == 0 → cache write 成功
第二次跑（1 分钟内，相同 system）：cache_read_input_tokens > 0 → cache read 命中
两次都是 0 或 None → 中转站没透传 cache_control，LiteLLM 路由黏性建议不开

8. Vertex Gemini（不适用本机制）¶

Gemini 的 prompt cache 是通过独立的 Cached Content 资源实现：

客户端先调 cachedContents.create() 创建缓存资源，返回 cache name
后续请求带上 cached_content: "cachedContents/xxx" 引用

→ 跟 cache_control 字段完全不同的 API 形态。LiteLLM 的 PromptCachingDeploymentCheck 看不到 cache_control 标记，路由黏性不参与。

响应里有 usage.cached_content_token_count，会映射到 PromptTokensDetails.cached_tokens。

9. is_cache_control_set 辅助函数¶

litellm/llms/anthropic/common_utils.py（搜 def is_cache_control_set）

用法：检测整个 messages list 里有没有任何位置带 cache_control。各 provider 的 validate_environment 里用它决定要不要加 beta header / 启用相关逻辑。

它跟 PromptCachingCache.extract_cacheable_prefix 互相独立： - is_cache_control_set：bool 检测器 - extract_cacheable_prefix：截取出可缓存前缀

10. 速查表：客户端怎么写最稳¶

10.1 最稳形态（兼容所有 Anthropic 系 provider）¶

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "<大于 1024 tokens 的稳定 system prompt>",
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    {"role": "user", "content": "动态查询内容"}
]

✅ Anthropic 原生
✅ Bedrock Converse（自动转 cachePoint）
✅ Bedrock Invoke
✅ Vertex AI Anthropic
✅ Azure AI Anthropic
⚠️ OpenAI 兼容中转站：取决于中转站

10.2 不要这样写¶

# ❌ 把 cache_control 放在错误的层级
{"role": "system", "content": "...", "cache_control": {"type": "ephemeral"}}
# 这种"消息级 cache_control"在 LiteLLM 的 PromptCachingCache 里能识别（extract_cacheable_prefix 行 83-92）
# 但发到 Anthropic 上游时不一定被识别 —— 应该放在 content block 内

# ❌ 给 ttl 但模型不是 Claude 4.5+
{"cache_control": {"type": "ephemeral", "ttl": "1h"}}    # 老模型上 ttl 被剥离

# ❌ 给 tool_search 工具加 cache_control
{"type": "tool_search_tool_bm25_20251119", "cache_control": {...}}    # 静默剥离

10.3 优先把可变内容放在最后¶

# ✅ 推荐：可变部分在 cache_control 之后
messages = [
    {"role": "system", "content": [
        {"type": "text", "text": "<稳定 system>", "cache_control": {"type": "ephemeral"}}
    ]},
    {"role": "user", "content": "<用户问题>"},   # 这部分变化不影响路由 cache_key
]

# ⚠️ 反例：cache_control 在用户消息上
messages = [
    {"role": "system", "content": "..."},
    {"role": "user", "content": [
        {"type": "text", "text": "<用户问题>", "cache_control": {"type": "ephemeral"}}
    ]},
]
# 用户问题每次都变 → cache_key 每次都变 → 路由黏性等于没建