04 · 哪里能看到这个错：客户端、日志、监控、PaaS¶

本文回答：一次失败发生后，它的痕迹会出现在哪几个地方，每个地方的字段长什么样。

读完应该能回答：

客户端拿到的 JSON 跟 proxy 日志里的字符串对得上吗？字段都是怎么映射的？
S3 里那条 standard_logging_payload 怎么过滤"5xx 失败"？
Prometheus 里 litellm_deployment_failure_responses{exception_status="502"} 跟 exception_class="BadGatewayError" 谁更可靠？
PaaS 告警里看到一个"Level3 异常"，怎么定位是哪条请求？

1. 一次失败的痕迹分布图¶

flowchart LR
    Excep["LiteLLM 异常对象"]

    subgraph "进客户端的两份"
        Client["客户端 HTTP 响应<br/>status_code + JSON body"]
        Headers["响应 headers<br/>x-litellm-* 元数据"]
    end

    subgraph "进 proxy 进程的两份"
        Stdout["proxy stdout<br/>verbose_proxy_logger"]
        Sentry["Sentry / OTEL span<br/>(可选)"]
    end

    subgraph "进存储的两份"
        SLP["standard_logging_payload<br/>error_information 字段"]
        SpendLog["DB 表 LiteLLM_SpendLogs<br/>spend_logs 同款 payload"]
    end

    subgraph "进监控的两类指标"
        PromFail["Prometheus 失败计数<br/>litellm_proxy_failed_requests_metric<br/>litellm_llm_api_failed_requests_metric<br/>litellm_deployment_failure_responses"]
        PromCooldown["Prometheus 健康状态<br/>litellm_deployment_cooled_down<br/>litellm_deployment_state"]
    end

    subgraph "下游可见"
        S3["S3 raw log<br/>(配 s3_v2 callback)"]
        PaaS["PaaS 告警<br/>Level3 + 关键字"]
    end

    Excep --> Client
    Excep --> Headers
    Excep --> Stdout
    Excep --> Sentry
    Excep --> SLP
    SLP --> SpendLog
    SLP --> S3
    Excep --> PromFail
    Excep --> PromCooldown
    Stdout -.告警规则.-> PaaS

关键事实： 1. 客户端、日志、监控是三条独立管道——同一次失败可能在三处看到，但写入时间和字段细节不同。客户端看到 JSON error.code 取自 ProxyException.code；日志和 SLP 用 exception.status_code；Prometheus 优先 status_code 兜底 code（prometheus.py:1833-1840）。 2. status_code 在不同地方可能不一致——03 §8 RouterErrors 的 No deployments available 在客户端是 429，但 SLP / 日志里的原始异常可能是 500 内部错。 3. S3 不是必填路径——只有配了 s3 callback 才有。proxy stdout 一定有；SLP 落 DB 通常有；Prometheus 大多 prod 有。

2. 客户端可见：HTTP 响应¶

2.1 status_code 与 JSON 结构¶

proxy_server.py:1051-1062 是把 ProxyException 转 HTTP 响应的唯一入口：

@app.exception_handler(ProxyException)
async def openai_exception_handler(request: Request, exc: ProxyException):
    headers = exc.headers
    error_dict = exc.to_dict()
    return JSONResponse(
        status_code=(
            int(exc.code) if exc.code else status.HTTP_500_INTERNAL_SERVER_ERROR
        ),
        content={"error": error_dict},
        headers=headers,
    )

客户端 JSON 模板：

{
  "error": {
    "message": "litellm.RateLimitError: VertexAIException - 429 Quota exceeded for ...\nModel: gemini-1.5-pro\nAPI Base: `https://us-central1-aiplatform.googleapis.com/...` LiteLLM Retried: 2 times",
    "type": "throttling_error",
    "param": "None",
    "code": "429"
  }
}

status_code 来源链（从前往后覆写）： 1. LiteLLM 异常的 .status_code（如 RateLimitError(429)） 2. ProxyException 构造时 code=getattr(e, "status_code", 500)（proxy_server.py 各 raise 位置） 3. ProxyException 特殊重写（_types.py:3221-3227）： - "No healthy deployment available" / "No deployments available" → 强制 429 - "Not allowed to access model due to tags configuration" → 强制 401 4. openai_exception_handler 用 int(exc.code)；为空兜底 500

⚠️ code 是字符串——_types.py:3210 self.code = str(code)。JSON 里 "code": "429"（字符串），客户端做数值比较要先转 int。这是上游 LiteLLM #4834 提的兼容性问题。

2.2 `message` 字段的拼装规则¶

message 不是简单的"原始错"，而是分段拼出来的：

litellm.<XxxError>: <ProviderName>Exception - <error_str>
\nModel: <model_name>
\nAPI Base: `<api_base>`
\nMessages: <truncated_messages>
\nDeployment: <deployment_dict>
\nmodel_group: `<model_group_name>`
\n\nReceived Model Group=<model_group>\nAvailable Model Group Fallbacks=<fallbacks>
 LiteLLM Retried: N times, LiteLLM Max Retries: M

拼装逻辑分布在： - 异常类 __init__ 加前缀 "litellm.RateLimitError: ..."（exceptions.py:337） - exception_type 加 extra_information（exception_mapping_utils.py:285-325） - 异常类 __str__ 加 retry 信息（exceptions.py:360-374） - async_function_with_fallbacks 加 fallback 提示（router.py:5021-5022）

⚠️ 客户端拿到的 message 可能含敏感信息——API Base 字段会暴露 deployment 的真实 URL（含 vertex_project / region），Messages 字段会暴露用户输入前几个字符。如果代理给外部用户，考虑用 guardrail / pre_call_hook 截断。

2.3 响应 headers：`x-litellm-*` 元数据¶

LiteLLM 在响应 headers 里塞了一组 x-litellm-* 标记，失败响应也带。常见的：

Header	含义	例
`x-litellm-attempted-retries`	这次 retry 了几次	`2`
`x-litellm-max-retries`	配置的最大 retry	`3`
`x-litellm-model-id`	实际打到的 deployment UUID	`25a8bd00-b1ba-...`
`x-litellm-model-group`	model_group 名	`gpt-4o`
`x-litellm-fallback-attempt`	当前 fallback 链深度	`1`
`Retry-After`	上游 429 时的退避秒数（透传）	`60`

✅ 运维诊断诀窍：客户端报 429，看 headers x-litellm-attempted-retries 判断是真上游 429（retry > 0）还是 No deployments available（retry = 0，因为根本没机会发请求）。

3. proxy stdout 日志：`verbose_proxy_logger`¶

3.1 grep 关键词¶

日志特征	grep 模板	来源
LiteLLM 异常被抛	`litellm\.\w\+Error:`	异常 `__str__`
Router 在尝试 fallback	`Trying to fallback b/w models`	router.py:4911
没有 fallback model_group	`No fallback model group found`	router.py:5018
fallback 自己又失败	`Error occurred while trying to do fallbacks`	router.py:5041
cooldown 被触发	`cool_down_deployment` / 看 `litellm.router_utils.cooldown_handlers`	docs/cooldown/01-mechanism.md
ContextWindowExceededError 没 context_window_fallbacks	`Got 'ContextWindowExceededError'. No context_window_fallback set`	router.py:4961
ContentPolicyViolation 没 content_policy_fallbacks	`Got 'ContentPolicyViolationError'. No content_policy_fallback set`	router.py:4996
上游 502/503	`litellm.BadGatewayError\\|litellm.ServiceUnavailableError`	02-provider-mapping.md
上游真挂了但被映射成 APIConnectionError	`litellm\.APIConnectionError`	02 §5 全局兜底

3.2 verbose_proxy_logger 行格式（不同级别）¶

级别	何时打	典型一行
`error()`	异常被 raise 之前最后日志	`litellm.proxy.proxy_server.completion(): Exception occured - <repr(e)>`
`exception()`	同 error 但带 traceback	`litellm.proxy.proxy_server.<endpoint>(): Exception occured - <str(e)>` 后面跟 traceback
`warning()`	不影响业务但需关注	`Failure callback failed - <str(e)>` 等
`debug()`	详细决策路径（默认关）	`Retrying request with num_retries: N` / `Trying to fallback`
`info()`	主线状态	`Trying to fallback b/w models`

⚠️ 打开 debug：设 env LITELLM_LOG=DEBUG（参考 litellm/_logging.py）或 --detailed_debug 启动参数。prod 不要常开——会大量打印 request body / response body。

3.3 日志样本¶

样本 A：上游 429 + 重试成功¶

Trying to fallback b/w models
Retrying request with num_retries: 3
[INFO] Retried successfully on attempt 1

→ 客户端看到 200，但 SLP 里 attempted_retries=1，Prometheus litellm_llm_api_failed_requests_metric{exception_status="429"} +1（第一次失败已计数）。

样本 B：上游 502 全部 fallback 失败¶

litellm.BadGatewayError: BadGatewayError: VertexAi - 502 Bad Gateway ...
Trying to fallback b/w models
No fallback model group found for original model_group=gemini-1.5-pro. Fallbacks=None
litellm.proxy.proxy_server.completion(): Exception occured - <repr>

→ 客户端看到 502。SLP 里 error_class="BadGatewayError"、error_code="502"。

样本 C：No deployments available（全 cooldown）¶

No deployments available for selected model, Try again in 5 seconds. Passed model=gemini-1.5-pro. pre-call-checks=False, cooldown_list=[<25 个 cooldown 的 deployment_id>]

→ 客户端看到 429（被 ProxyException 重写，§2.1）。看到 429 别马上找上游 rate limit——先 redis-cli KEYS 'deployment:*:cooldown' 看 cooldown 列表是否爆炸。

4. `StandardLoggingPayload.error_information`：DB / S3 / callback 用¶

4.1 字段定义¶

types/utils.py:2590-2596:

class StandardLoggingPayloadErrorInformation(TypedDict, total=False):
    error_code: Optional[str]
    error_class: Optional[str]
    llm_provider: Optional[str]
    traceback: Optional[str]
    error_message: Optional[str]

填充规则在 litellm_logging.py:4921-4960 get_error_information：

字段	取值规则	备注
`error_code`	优先 `exception.code`（ProxyException 用，字符串）；其次 `exception.status_code`（LiteLLM 异常用，转字符串）；都没有空串	`BadGatewayError` → `"502"`；`APIConnectionError` → `"500"`（硬编码）
`error_class`	`exception.__class__.__name__`	`"BadGatewayError"` / `"APIConnectionError"`
`llm_provider`	`exception.llm_provider`	LiteLLM 异常都有；上游 SDK 异常可能没
`traceback`	`traceback.format_tb(__traceback__)[:100]` 行	默认前 100 行；env `MAXIMUM_TRACEBACK_LINES_TO_LOG` 可调
`error_message`	`str(exception)`	含 `LiteLLM Retried: N times` 后缀

4.2 在 SLP 整体里的位置¶

types/utils.py:2770-2771:

class StandardLoggingPayload(TypedDict, total=False):
    ...
    error_str: Optional[str]
    error_information: Optional[StandardLoggingPayloadErrorInformation]
    ...

完整 SLP 还含：status: Literal["success", "failure"]、call_type、api_base、model、model_id、model_group、response_cost、messages、response、metadata、cache_hit、request_tags 等几十个字段。

判定"是否失败"：status == "failure"（不是看 error_information 是否为空——成功请求 error_information 是 None，但失败请求里 error_information 也可能字段为空，例如异常对象没有 __traceback__ 时 traceback=""）。

4.3 S3 raw log JSON 示例¶

如果配了 s3 callback（callbacks=["s3_v2"]），失败请求会落 S3 一份完整 SLP JSON。关键字段：

{
  "status": "failure",
  "call_type": "acompletion",
  "model": "gemini-1.5-pro-002",
  "model_id": "25a8bd00-b1ba-489d-92f6-d878bbe46bae",
  "model_group": "gemini-1.5-pro",
  "api_base": "https://us-central1-aiplatform.googleapis.com/...",
  "startTime": 1716900000.123,
  "endTime": 1716900003.456,
  "messages": [...],
  "response": null,
  "error_str": "litellm.BadGatewayError: ...",
  "error_information": {
    "error_code": "502",
    "error_class": "BadGatewayError",
    "llm_provider": "vertex_ai",
    "traceback": "  File \"litellm/router.py\", line ...\n",
    "error_message": "litellm.BadGatewayError: BadGatewayError: VertexAi - 502 ... LiteLLM Retried: 2 times"
  },
  "metadata": {
    "user_api_key_alias": "...",
    "user_api_key_team_alias": "...",
    "requester_ip_address": "...",
    "user_agent": "..."
  },
  "request_tags": [...]
}

4.4 SLP 在 DB 里：`LiteLLM_SpendLogs`¶

prisma schema 里 LiteLLM_SpendLogs.messages / LiteLLM_SpendLogs.response 等列存的是 SLP 的扁平化。SQL 查询失败请求：

SELECT
    request_id,
    model,
    api_base,
    standard_logging_payload->'error_information'->>'error_class' AS error_class,
    standard_logging_payload->'error_information'->>'error_code' AS error_code,
    standard_logging_payload->>'status' AS status,
    startTime
FROM "LiteLLM_SpendLogs"
WHERE
    standard_logging_payload->>'status' = 'failure'
    AND startTime > NOW() - INTERVAL '1 hour'
ORDER BY startTime DESC;

⚠️ standard_logging_payload 字段不是所有 schema 版本都有——老版本可能用其它字段（metadata->>error 等）。检查你那个 LiteLLM 版本的 schema.prisma。

5. Prometheus 指标¶

LiteLLM 内置 prometheus.py callback（integrations/prometheus.py）暴露 30+ 个指标。本节只列跟错误相关的。

5.1 失败计数（4 个核心）¶

指标	来源行号	触发	关键 labels	用法
`litellm_proxy_failed_requests_metric`	83, 1634	客户端最终没拿到 success 响应（含 fallback 后仍失败）	`exception_status`, `exception_class`, `litellm_model_name`, `requested_model`, `hashed_api_key`, `team_alias`, `tags`	看"业务真受影响"的失败
`litellm_llm_api_failed_requests_metric`	398, 1409	任何一次对上游的请求失败（含被 retry 救回来的）	`model`, `hashed_api_key`, `team_alias`, `model_id`, `user`	看"上游真实故障率"，不论是否被 retry 救
`litellm_deployment_failure_responses`	353, 1879	单个 deployment 失败	`litellm_model_name`, `model_id`, `api_base`, `api_provider`, `exception_status`, `exception_class`	按 deployment 维度分析故障；做 dashboard
`litellm_deployment_total_requests`	361, 1887	所有打到 deployment 的请求（含成功）	同上	算失败率 = failure / total

⚠️ 三个失败指标之间的关系——按"距离客户端"由远到近：

litellm_llm_api_failed_requests_metric    （每次上游失败都计数）
        ↓
litellm_deployment_failure_responses      （按 deployment 维度的同上）
        ↓
litellm_proxy_failed_requests_metric      （客户端最终没拿到 success）

如果你看到 llm_api_failed >> proxy_failed，说明 retry / fallback 救回了大部分失败 → 健康。如果两者接近，说明 retry / fallback 没救回 → 配置可能有问题。

5.2 健康状态（2 个）¶

指标	触发	labels	含义
`litellm_deployment_cooled_down`	cooldown 写入时	`litellm_model_name`, `model_id`, `exception_status`	累计 cooldown 次数（counter）。来源 docs/cooldown/01-mechanism.md §8
`litellm_deployment_state`	心跳 + cooldown	同	gauge：0=healthy / 1=partial outage / 2=outage

5.3 Fallback 计数（2 个）¶

指标	触发	labels	用法
`litellm_deployment_successful_fallbacks`	fallback 链救回失败	`requested_model`, `fallback_model`, `exception_status`, `exception_class`	看"fallback 真有用"——配的越多越值钱
`litellm_deployment_failed_fallbacks`	fallback 链也失败	同上	这条涨说明你的 fallback model_group 也挂了

5.4 推荐告警规则¶

# 1. 上游真实故障率 > 5%（按 deployment 看，5min 滑动）
- alert: LiteLLM_UpstreamFailureRate
  expr: |
    sum(rate(litellm_deployment_failure_responses[5m])) by (litellm_model_name, model_id)
    /
    sum(rate(litellm_deployment_total_requests[5m])) by (litellm_model_name, model_id)
    > 0.05
  for: 5m
  labels:
    severity: L3

# 2. fallback 救不回来（业务受影响）
- alert: LiteLLM_ClientFailureRate
  expr: |
    sum(rate(litellm_proxy_failed_requests_metric[5m])) by (requested_model)
    > 1
  for: 3m

# 3. cooldown 异常活跃（说明某个 deployment 反复挂）
- alert: LiteLLM_DeploymentFlapping
  expr: |
    rate(litellm_deployment_cooled_down[5m]) > 0.5
  for: 5m

# 4. APIConnectionError 涨 —— 上游连不上 或 映射 bug（见 02-provider-mapping）
- alert: LiteLLM_APIConnectionError_Surge
  expr: |
    sum(rate(litellm_proxy_failed_requests_metric{exception_class="APIConnectionError"}[5m])) > 0.5
  for: 5m
  annotations:
    runbook: docs/errors/02-provider-mapping.md#5-全局兜底ensure-generic-errors-always-return-apiconnectionerror

5.5 `exception_status` vs `exception_class` 哪个更可靠¶

prometheus.py:1833-1840:

exception_status = str(getattr(exception, "status_code", None))
if exception_status == "None" or not exception_status:
    code = getattr(exception, "code", None)
    if code is not None:
        exception_status = str(code)

⚠️ exception_status 可能是 "None" 字符串——status_code 属性不存在且 code 也没有时，会出现这种 label 值。Prometheus 不会拒绝它，但 dashboard 上看起来很丑。

→ exception_class 更稳定：永远是异常类名（__class__.__name__），不会有 "None" 字符串。但 status_code 维度对运维更直观。两个一起用：先看 exception_class 定位异常类型，再用 exception_status 确认 HTTP 状态。

6. PaaS 告警字段¶

LiteLLM proxy 在 prod 通常会接内部 PaaS 的日志告警平台。告警规则一般匹配 proxy stdout 关键字。常用的：

告警关键字	对应日志	应当处理
`litellm.BadGatewayError`	上游 502	看 03-router-behavior.md §9 速查，确认 cooldown 是否冷却
`litellm.APIConnectionError`	上游连不上或映射坑	看 02 §5 判定是不是某 provider 分支漏处理
`No deployments available`	全 cooldown	看 redis cooldown key 数量 + 各 deployment 健康度
`Error occurred while trying to do fallbacks`	fallback 链也炸了	看 `Available Model Group Fallbacks=` 配置是不是错了
`Got 'ContextWindowExceededError'. No context_window_fallback set`	用户 token 超了但没配 fallback	业务方加 token 计数 / 配 `context_window_fallbacks`
`Got 'ContentPolicyViolationError'. No content_policy_fallback set`	内容被审	决定要不要配 `content_policy_fallbacks`（注意安全语义）

⚠️ 不要给 BadRequestError / UnprocessableEntityError / PermissionDeniedError 配告警——这些是用户错，频率高且跟系统健康无关。会把告警淹掉。

7. 三处可见的对应关系总表¶

同一个失败，在三处看到的字段对应：

概念	客户端 JSON	proxy 日志	SLP / S3	Prometheus label
HTTP 状态	`error.code` (字符串)	`:status_code=` 或 message 里嵌	`error_information.error_code` (字符串)	`exception_status`
异常类名	通过 `error.type` 半隐含	message 前缀 `litellm.XxxError:`	`error_information.error_class`	`exception_class`
上游 provider	通过 message 拼装	message 含 `<Provider>Exception -`	`error_information.llm_provider`	`api_provider`
模型名（用户请求的）	不直接给	message 含 `Model: xxx`	`model_group`	`requested_model`
实际 deployment id	response header `x-litellm-model-id`	message 含 `Deployment: ...`	`model_id`	`model_id`
重试次数	message 后缀 `LiteLLM Retried: N times` + header `x-litellm-attempted-retries`	message 后缀同	没单独字段（在 `metadata.attempted_retries`）	没单独 label
Fallback 链次	已经是 fallback 后的结果，看 message 中 `fallback` 提示	`Trying to fallback b/w models`	`metadata.fallback_attempts`	`fallback_model`
Stack trace	❌ 不给客户端	`verbose_proxy_logger.exception` 打 traceback	`error_information.traceback`（前 100 行）	❌ 没有

8. 一分钟自检：一个错怎么定位¶

客户端报了什么 status_code？
401 但你确信 key 没错 → 可能是 tag routing 拦的（03 §8.2）
429 但上游不该限流 → 可能是 No deployments available（03 §8.1）
500 但上游真的是 502/504 → 可能是某 provider 分支漏处理（02 §4 不一致点）
看 response header x-litellm-* 确认是不是 retry 后的结果
去 proxy stdout / 容器日志 grep <exception_class>: 看上下文（一般含 deployment_id 和 model_group）
去 Prometheus 看：
litellm_proxy_failed_requests_metric{exception_class="XxxError"} 是不是涨了
litellm_deployment_failure_responses{model_id="<id>"} 看是哪台 deployment
litellm_deployment_cooled_down{model_id="<id>"} 看是不是反复 cooldown（说明上游真的挂）

如果配了 SLP DB：

SELECT * FROM "LiteLLM_SpendLogs"
WHERE request_id = '<x-litellm-request-id 头里的 ID>';

能拿到 traceback 和原始 messages（如果未脱敏）。

下一步¶

从症状（"看到 X" / "监控涨了 Y"）反推应该看哪里 → 05-troubleshooting-by-symptom.md
Router 行为细节 → 03-router-behavior.md
异常类定义 → 01-exception-catalog.md