03 — Post-call:spend 累加链路¶
请求成功后(包括流式),LiteLLM 在 async_success_handler 中按以下顺序工作:
- 算 cost —
response_cost_calculator()把响应 usage 转换为 USD - 写 DB —
_PROXY_track_cost_callback调用db_spend_update_writer.update_database(),把 cost 累加到 PG 各表的spend列 - 刷缓存 —
update_cache()把新spend写回 DualCache,下次 pre-call 检查就能看到
整条链路的任意一环把 cost 算成 0,最终的 spend 就不会增长 → 限流失效。
Step 1:算 cost¶
入口(流式与非流式共享):litellm/litellm_core_utils/litellm_logging.py:1474
# litellm_logging.py:1474-1479
response_cost = litellm.response_cost_calculator(**response_cost_calculator_kwargs)
→ litellm/cost_calculator.py:1528 response_cost_calculator()
→ litellm/cost_calculator.py:954 completion_cost()
→ litellm/cost_calculator.py:247 cost_per_token()
→ litellm/litellm_core_utils/llm_cost_calc/utils.py:580 generic_cost_per_token()
generic_cost_per_token 内部完成所有 token 类型分类(cache_hit / cache_creation / text / audio / image / ...)并查询价格表:
# llm_cost_calc/utils.py:645-662
(
prompt_base_cost,
completion_base_cost,
cache_creation_cost,
cache_creation_cost_above_1hr,
cache_read_cost,
) = _get_token_base_cost(model_info=model_info, usage=usage, ...)
prompt_cost = _calculate_input_cost(
prompt_tokens_details=prompt_tokens_details,
model_info=model_info,
prompt_base_cost=prompt_base_cost,
cache_read_cost=cache_read_cost,
...
)
价格表查询走 _get_cost_per_unit(),这是整条链路的"静默 0"陷阱(详见 04-cache-pricing-trap.md):
# llm_cost_calc/utils.py:318-357
def _get_cost_per_unit(model_info, cost_key, default_value=0.0):
cost_per_unit = model_info.get(cost_key)
if isinstance(cost_per_unit, float): return cost_per_unit
...
return default_value # ← 字段缺失时返回 0.0,不报错
最终 response_cost 被写到 kwargs["standard_logging_object"]["response_cost"] 与 kwargs["response_cost"]。
Step 2:写 DB¶
入口:litellm/proxy/hooks/proxy_track_cost_callback.py:123 _PROXY_track_cost_callback()
# proxy_track_cost_callback.py:151-202
sl_object = kwargs.get("standard_logging_object", None)
response_cost = (
sl_object.get("response_cost", None) if sl_object else kwargs.get("response_cost", None)
)
...
if response_cost is not None:
if kwargs.get("cache_hit", False) is True:
response_cost = 0.0 # 整体响应被 LiteLLM 内部缓存命中(不是 prompt cache)
...
if _should_track_cost_callback(...):
await proxy_logging_obj.db_spend_update_writer.update_database(
token=user_api_key,
response_cost=response_cost,
user_id=user_id, end_user_id=..., team_id=..., org_id=...,
kwargs=kwargs, completion_response=completion_response,
start_time=..., end_time=...,
)
asyncio.create_task(update_cache(token=..., response_cost=response_cost, ...))
注意两个 cache_hit 是完全不同的概念:
- kwargs["cache_hit"] = LiteLLM 自身的 in-process / Redis 响应缓存命中(response 整体被复用),命中则 response_cost=0
- usage.prompt_tokens_details.cached_tokens = 上游 provider 的 prompt cache 命中(部分 prompt token),按 cache_read_input_token_cost 计费
本系列文档讨论的 bug 是后者(prompt cache 价格漏配)。
db_spend_update_writer.update_database() 内部对 LiteLLM_VerificationToken / LiteLLM_UserTable / LiteLLM_TeamTable / LiteLLM_SpendLogs 等多张表做 UPDATE ... SET spend = spend + :response_cost。
如果 response_cost = 0,UPDATE 仍然执行,但 spend 自然不变。LiteLLM_SpendLogs 中会留下一条 spend=0 的明细行——事后审计很难发现,因为请求确实被记录了。
Step 3:刷新 DualCache¶
update_cache() 把新的 spend 写到 DualCache 中两个 key:
{token_hash}← key 级 spend{user_id}_user_api_key_user_id← user 级 spend
下次 pre-call 检查时(02-pre-call-flow.md),就能立即读到新值。
注:这是
asyncio.create_task,不阻塞当前请求。短时间高并发场景下可能有秒级滞后。
完整时序图¶
sequenceDiagram
participant L as litellm_logging
participant CC as cost_calculator
participant U as llm_cost_calc/utils
participant TC as _PROXY_track_cost_callback
participant DB as db_spend_update_writer
participant PG as PostgreSQL
participant DC as DualCache
L->>CC: response_cost_calculator(usage, model)
CC->>U: generic_cost_per_token(usage, model_info)
U->>U: _get_cost_per_unit(model_info, "cache_read_input_token_cost")
Note over U: 字段缺失 → 返回 0.0
U-->>CC: prompt_cost (低估), completion_cost
CC-->>L: response_cost (低估)
L->>TC: success_handler 触发 _PROXY_track_cost_callback
TC->>DB: update_database(response_cost)
DB->>PG: UPDATE spend = spend + response_cost
Note over PG: response_cost 偏小 → spend 增长缓慢
TC->>DC: update_cache(response_cost)
Note over DC: 下次请求读到的 spend 仍远低于 max_budget
与 TPM 的对比¶
| 项 | spend 累加 | TPM 累加 |
|---|---|---|
| 累加时机 | success_handler | success_handler |
| 累加来源 | response_cost(USD) |
usage.total_tokens − cached_tokens |
| 受 cache 价格配置影响? | 是(致命) | 否 |
| 受 cache_hit_tokens 字段影响? | 是(数值越大、cost 越被低估) | 是(V3 主动剔除) |
| 数据落点 | PG 多张表 + DualCache | Redis 计数器 |
这就是为什么"TPM/RPM 限住了,但 max_budget 限不住"是诊断本类 bug 的强信号。