Search-R1论文浅析与代码实现-博客园

Search-R1论文浅析与代码实现

2025-10-21 13:30:28 发布 10 浏览

页面报错/反馈

已收藏点赞

GitHub: https://github.com/PeterGriffinJin/Search-R1

论文： link1, link2

Motivation

使用seach engine给reasoning LLM赋能

Method

在PPO的基础上，基于给定的Search Egine (R)，进行轨迹生成。

[J_{PPO}(theta) = mathbb{E}_{(q,a)simmathcal{D}, osim{pi_{old}(cdot|q;R)}}frac{1}{sum_{t=1}^{|o|}I(o_t)} min[frac{pi_{theta}(o_t|q, o_{

其中需要对(R)返回的token进行mask

[I(o_t) = begin{cases} 0, & o_tmathrm{ is a retrived token};\ 1, & otherwise; end{cases} ]

Experiments

默认使用PPO，整体效果来看search-r1强化是有效的。training dataset来自NQ和Hotpot QA

PPO vs GRPO

认为PPO比GRPO更加稳定，效果更好；GRPO收敛更快
Instruct model vs base model

认为虽然instruct model在最开始的reward要优于base model，但是在step的后期，两者reward是可比的，且base model的效果优于instruct model。

（我认为，这里instruct好于base，可能是因为instruct后，模型的多样性下降了（因为RL的对齐），导致模型在search task的探索能力下降。但是，WebDancer等文章均使用的是Instruct model，我认为是那些工作并不是一上来就search RL的，而是先做RFT的SFT，想让instruct model适应RL的格式，并注入search task的领域知识（planing能力、工具调用能力、总结能力等等）。如果是对base model做post-training的RFT（数据量可能不大），base model会出现指令不遵循的问题。因此在SFT+RL的后续WebAgent的工作中，一半以Instruct model为基座。）
Response length and valid study
- early stage：response length明显下降，同时reward有小幅度提升（更好的理解search 任务，输出更精简）
- latter stage：response length回升，reward也提升（可以发现是seach call的次数提升导致）
ablation of retrived token mask

mask是必要的，因为model的预测目标本就不是预测出retrieved token，而是学会工具调用与计划总结
Number of Retrieved Passages Study in SEARCH-R1 Training

召回的docs不是越多越好（actor model总结时会更容易出现幻觉或是遗漏细节），也不是越少越好（巧妇难为无米之炊）
group size of GRPO

GRPO的size 大的话，效果好收敛快，但是不太稳定（感觉是论文工作设计有问题，我没有遇到过这种reward sharp decrease）