05 — Issue #11616 和 PR #14730 详解¶

GitHub 上有人遇到了和你一样的问题，并且已经有人修复了它。

一、Issue #11616¶

标题：[Feature]: adjustable provider reconnection attempt count
作者：HarmonClaw
创建时间：2026-04-17
状态：已关闭（被 PR #14730 修复）
链接：https://github.com/NousResearch/hermes-agent/issues/11616

问题描述¶

HarmonClaw 遇到了和你几乎一模一样的情况：

[2026/4/17 20:12] ⚠️ No response from provider for 180s (model: qwen3-coder-480b-a35b-instruct)
[2026/4/17 20:13] ⏳ Retrying in 2.0s (attempt 1/3)...
[2026/4/17 20:16] ⚠️ No response from provider for 180s
[2026/4/17 20:17] ⏳ Retrying in 4.5s (attempt 2/3)...
[2026/4/17 20:19] ⏳ Still working... (10 min elapsed — iteration 1/120)
[2026/4/17 20:20] ⚠️ No response from provider for 180s
[2026/4/17 20:24] ⚠️ No response from provider for 180s
[2026/4/17 20:25] ⏳ Retrying in 2.2s (attempt 1/3)...
...（重复了 30 分钟以上）

他的原文：

Problem or Use Case
unstable provider cause agent reconnect again and again, that is a waste of time. any other fellas run into same issue as i do?

（不稳定的 provider 导致 agent 一遍遍重连，浪费大量时间。有人跟我一样吗？）

他提出的解决方案¶

HarmonClaw 建议在 config.yaml 中添加一个配置项，让用户可以控制最大重试次数：

方案 1：在 agent 配置中添加

agent:
  max_turns: 120
  max_retry_before_failover: 1    # ← 新增！重试 1 次就切换
  gateway_timeout: 3600
  restart_drain_timeout: 60

方案 2：在 provider 配置中添加

custom_providers:
  - name: nvidia
    base_url: https://integrate...com/v1
    api_key: ${NDA_API_KEY}
    models:
      - id: z-ai/glm4.7
        max_retry_before_failover: 1  # ← 每个模型单独配置

Feature 分类¶

类型：Performance / reliability（性能/可靠性）
范围：Small（单文件，< 50 行改动）
贡献：作者表示愿意自己写 PR

二、PR #14730¶

标题：feat(agent): make API retry count configurable via agent.api_max_retries
作者：teknium1（Contributor）
合并时间：2026-04-23
状态：已合并到 main 分支
链接：https://github.com/NousResearch/hermes-agent/pull/14730

摘要¶

Closes #11616 — expose the hardcoded API retry count as agent.api_max_retries in config.yaml so users with fallback providers can fail over on flaky primaries instead of burning ~3 × provider_timeout on the same stall.

（关闭 #11616 — 将硬编码的重试次数暴露为 config.yaml 中的 agent.api_max_retries，这样使用 fallback provider 的用户可以在主 provider 不稳定时快速切换，而不是在同一个卡死上浪费 3 倍 provider_timeout 的时间。）

Reporter 的场景¶

qwen-coder hit 3 × 180s provider-silence reconnect loops back-to-back — ~9 minutes of dead time before retry budget exhausted. With api_max_retries: 1, the retry loop surfaces the error on the first failure, giving the user's fallback chain (or the error-handler path) a fast handoff.

（qwen-coder 连续 3 次 180 秒的 provider 静默重连循环 — 在重试预算耗尽前浪费了约 9 分钟。设置 api_max_retries: 1 后，第一次失败就暴露错误，让 fallback 链（或错误处理路径）快速接管。）

改了哪些文件？¶

文件	改动
`hermes_cli/config.py`	添加 `agent.api_max_retries`，默认值 3，带注释
`run_agent.py`	在 `AIAgent.__init__` 中读取 `self._api_max_retries`；替换重试循环中的硬编码 `max_retries = 3`；值 < 1 时 clamp 到 1；非整数回退到默认值
`cli-config.yaml.example`	添加示例配置
`hermes_cli/tips.py`	添加可发现的提示行
`tests/run_agent/test_api_max_retries_config.py`	4 个测试用例（默认=3、自定义值生效、clamp 到 1、无效值回退）

改动范围¶

This wraps the Hermes-level retry loop ONLY — the OpenAI SDK's own low-level retries (max_retries=2 default) still run beneath this for transient network errors.

（这个改动只包裹 Hermes 级别的重试循环 — OpenAI SDK 自身的底层重试（默认 2 次）仍然在之下运行，用于处理瞬态网络错误。）

行为变化表¶

配置	之前	之后
不设置（默认）	`max_retries = 3`	`max_retries = 3`（不变）
`api_max_retries: 1`	被忽略	重试循环只运行 1 次，快速失败
`api_max_retries: 5`	被忽略	重试循环最多运行 5 次
`api_max_retries: 0`	被忽略	Clamp 到 1（防止零次尝试的退化情况）
`api_max_retries: "xyz"`	被忽略	回退到 3，不崩溃

测试¶

tests/run_agent/test_api_max_retries_config.py — 4/4 通过： 1. 默认值 = 3 2. 自定义值生效 3. 值 < 1 时 clamp 到 1 4. 无效值回退到默认值

三、这个修复对你的意义¶

你的 fork 的状态¶

你的 fork（Setsuna-Yukirin/hermes-agent）还没有这个修复。

当前状态： - 最新提交：88b6eb9a chore(release): map Nan93 in AUTHOR_MAP - run_agent.py:9293 仍然是 max_retries = 3（硬编码） - hermes_cli/config.py 的 DEFAULT_CONFIG["agent"] 中没有 api_max_retries

上游的状态¶

上游（NousResearch/hermes-agent）已经有这个修复。

PR #14730 于 2026-04-23 合并
你的 Issue 发生在 2026-04-25
如果你当时用的是上游最新版本，就不会遇到这个问题

怎么获取这个修复？¶

方案 1：合并上游最新代码（推荐）

cd ~/.hermes/hermes-agent/hermes-agent

# 添加上游仓库（如果还没加）
git remote add upstream https://github.com/NousResearch/hermes-agent.git

# 获取上游更新
git fetch upstream

# 合并到当前分支
git merge upstream/main

# 解决可能的冲突后提交
git commit -am "merge upstream: include api_max_retries fix"

方案 2：只 cherry-pick 这个 PR 的提交

# PR #14730 的提交 hash 是 19a6771
git cherry-pick 19a6771ee9fc23f17fb012ae828f40124b9daf78

方案 3：手动修改（临时方案）

编辑 run_agent.py:9293：

# 改前
max_retries = 3

# 改后
max_retries = 1  # 快速失败，立刻走 fallback

配置示例¶

获取修复后，在 ~/.hermes/config.yaml 中添加：

agent:
  api_max_retries: 1  # 只重试 1 次就切换 fallback

fallback_providers:
  - provider: openai
    model: gpt-4o-mini

这样配置后： - DeepSeek 第 1 次失败 → 立刻切换到 OpenAI - 不再浪费 15 分钟在无效重试上

四、相关 Issue 和 PR¶

其他相关 Issue¶

Issue #12013：Open — 可能也是相关的重试/超时问题

五、总结¶

项目	内容
问题	不稳定的 provider 导致无限重连，浪费大量时间
发现者	HarmonClaw（Issue #11616，2026-04-17）
修复者	teknium1（PR #14730，2026-04-23）
修复方式	添加 `agent.api_max_retries` 配置项
你的状态	fork 还没有这个修复
建议	合并上游最新代码，或手动设置 `api_max_retries: 1`

PR	状态	说明
#14730	✅ 已合并	添加 `api_max_retries` 配置
#15274	✅ 已合并	修复飞书 fenced code block 解析（2026-04-25，17 分钟前）