跳转至

03 — Post-call:spend 累加链路

请求成功后(包括流式),LiteLLM 在 async_success_handler 中按以下顺序工作:

  1. 算 costresponse_cost_calculator() 把响应 usage 转换为 USD
  2. 写 DB_PROXY_track_cost_callback 调用 db_spend_update_writer.update_database(),把 cost 累加到 PG 各表的 spend
  3. 刷缓存update_cache() 把新 spend 写回 DualCache,下次 pre-call 检查就能看到

整条链路的任意一环把 cost 算成 0,最终的 spend 就不会增长 → 限流失效。


Step 1:算 cost

入口(流式与非流式共享):litellm/litellm_core_utils/litellm_logging.py:1474

# litellm_logging.py:1474-1479
response_cost = litellm.response_cost_calculator(**response_cost_calculator_kwargs)

litellm/cost_calculator.py:1528 response_cost_calculator()litellm/cost_calculator.py:954 completion_cost()litellm/cost_calculator.py:247 cost_per_token()litellm/litellm_core_utils/llm_cost_calc/utils.py:580 generic_cost_per_token()

generic_cost_per_token 内部完成所有 token 类型分类(cache_hit / cache_creation / text / audio / image / ...)并查询价格表:

# llm_cost_calc/utils.py:645-662
(
    prompt_base_cost,
    completion_base_cost,
    cache_creation_cost,
    cache_creation_cost_above_1hr,
    cache_read_cost,
) = _get_token_base_cost(model_info=model_info, usage=usage, ...)

prompt_cost = _calculate_input_cost(
    prompt_tokens_details=prompt_tokens_details,
    model_info=model_info,
    prompt_base_cost=prompt_base_cost,
    cache_read_cost=cache_read_cost,
    ...
)

价格表查询走 _get_cost_per_unit()这是整条链路的"静默 0"陷阱(详见 04-cache-pricing-trap.md):

# llm_cost_calc/utils.py:318-357
def _get_cost_per_unit(model_info, cost_key, default_value=0.0):
    cost_per_unit = model_info.get(cost_key)
    if isinstance(cost_per_unit, float): return cost_per_unit
    ...
    return default_value   # ← 字段缺失时返回 0.0,不报错

最终 response_cost 被写到 kwargs["standard_logging_object"]["response_cost"]kwargs["response_cost"]


Step 2:写 DB

入口:litellm/proxy/hooks/proxy_track_cost_callback.py:123 _PROXY_track_cost_callback()

# proxy_track_cost_callback.py:151-202
sl_object = kwargs.get("standard_logging_object", None)
response_cost = (
    sl_object.get("response_cost", None) if sl_object else kwargs.get("response_cost", None)
)
...
if response_cost is not None:
    if kwargs.get("cache_hit", False) is True:
        response_cost = 0.0       # 整体响应被 LiteLLM 内部缓存命中(不是 prompt cache)
    ...
    if _should_track_cost_callback(...):
        await proxy_logging_obj.db_spend_update_writer.update_database(
            token=user_api_key,
            response_cost=response_cost,
            user_id=user_id, end_user_id=..., team_id=..., org_id=...,
            kwargs=kwargs, completion_response=completion_response,
            start_time=..., end_time=...,
        )
        asyncio.create_task(update_cache(token=..., response_cost=response_cost, ...))

注意两个 cache_hit完全不同的概念: - kwargs["cache_hit"] = LiteLLM 自身的 in-process / Redis 响应缓存命中(response 整体被复用),命中则 response_cost=0 - usage.prompt_tokens_details.cached_tokens = 上游 provider 的 prompt cache 命中(部分 prompt token),按 cache_read_input_token_cost 计费

本系列文档讨论的 bug 是后者(prompt cache 价格漏配)。

db_spend_update_writer.update_database() 内部对 LiteLLM_VerificationToken / LiteLLM_UserTable / LiteLLM_TeamTable / LiteLLM_SpendLogs 等多张表做 UPDATE ... SET spend = spend + :response_cost

如果 response_cost = 0,UPDATE 仍然执行,但 spend 自然不变。LiteLLM_SpendLogs 中会留下一条 spend=0 的明细行——事后审计很难发现,因为请求确实被记录了


Step 3:刷新 DualCache

asyncio.create_task(
    update_cache(token=..., user_id=..., response_cost=response_cost, ...)
)

update_cache() 把新的 spend 写到 DualCache 中两个 key:

  • {token_hash} ← key 级 spend
  • {user_id}_user_api_key_user_id ← user 级 spend

下次 pre-call 检查时(02-pre-call-flow.md),就能立即读到新值。

注:这是 asyncio.create_task,不阻塞当前请求。短时间高并发场景下可能有秒级滞后。


完整时序图

sequenceDiagram
    participant L as litellm_logging
    participant CC as cost_calculator
    participant U as llm_cost_calc/utils
    participant TC as _PROXY_track_cost_callback
    participant DB as db_spend_update_writer
    participant PG as PostgreSQL
    participant DC as DualCache

    L->>CC: response_cost_calculator(usage, model)
    CC->>U: generic_cost_per_token(usage, model_info)
    U->>U: _get_cost_per_unit(model_info, "cache_read_input_token_cost")
    Note over U: 字段缺失 → 返回 0.0
    U-->>CC: prompt_cost (低估), completion_cost
    CC-->>L: response_cost (低估)
    L->>TC: success_handler 触发 _PROXY_track_cost_callback
    TC->>DB: update_database(response_cost)
    DB->>PG: UPDATE spend = spend + response_cost
    Note over PG: response_cost 偏小 → spend 增长缓慢
    TC->>DC: update_cache(response_cost)
    Note over DC: 下次请求读到的 spend 仍远低于 max_budget

与 TPM 的对比

spend 累加 TPM 累加
累加时机 success_handler success_handler
累加来源 response_cost(USD) usage.total_tokens − cached_tokens
受 cache 价格配置影响? (致命)
受 cache_hit_tokens 字段影响? 是(数值越大、cost 越被低估) 是(V3 主动剔除)
数据落点 PG 多张表 + DualCache Redis 计数器

这就是为什么"TPM/RPM 限住了,但 max_budget 限不住"是诊断本类 bug 的强信号。