FuturixAI
Home
About us
Publications
Shivaay
Try Zerodesk
Home
About us
Publications
Shivaay
Back to Publications
Technology
June 16, 2025
FuturixAI
2min read

Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance

Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance

Abstract

Reward-model training is the cost bottleneck in modern Reinforcement Learning Human Feedback (RLHF) pipelines, often requiring tens of billions of parameters and an offline preference-tuning phase. In the proposed method, a frozen, instruction-tuned 7B LLM is augmented with only a one-line JSON rubric and a rank-16 LoRA adapter (affecting just 0.8% of the model's parameters), enabling it to serve as a complete substitute for the previously used heavyweight evaluation models. The plug-and-play judge achieves 96.2% accuracy on RewardBench, outperforming specialized reward networks ranging from 27B to 70B parameters. Additionally, it allows a 7B actor to outperform the top 70B DPO baseline, which scores 61.8%, by achieving 92% exact-match accuracy on GSM-8K utilizing online PPO. Thorough ablations indicate that (i) six in-context demonstrations deliver the majority of the zero-to-few-shot improvements (+2pp), and (ii) the LoRA effectively addresses the remaining disparity, particularly in the safety and adversarial Chat-Hard segments. The proposed model introduces HH-Rationales, a subset of 10,000 pairs from Anthropic HH-RLHF, to examine interpretability, accompanied by human-generated justifications. GPT-4 scoring indicates that our LoRA judge attains approximately ≈9/10 in similarity to human explanations, while zero-shot judges score around ≈5/10. These results indicate that the combination of prompt engineering and tiny LoRA produces a cost-effective, transparent, and easily adjustable reward function, removing the offline phase while achieving new state-of-the-art outcomes for both static evaluation and online RLHF.

Full Paper

Join our Newsletter

Occasional updates from FuturixAI — on research, systems, and how AI is actually being built. No noise. No spam. Just thoughtful insights.

Address : 07th Floor, A131 , Noida Sector 136 , Uttar Pradesh , 201304

Contact Email : Connect@Futurixai.com

Company

HomeAboutResearchContact

Resources

Privacy PolicyTerms of Use

Products

ZerodeskIntelligent OCRShivaay

FuturixAI

FuturixAI Lab Pvt. Ltd. is a research driven AI studio focused on designing and engineering intelligent systems for real-world use. We work at the intersection of AI, product design, and system architecture building intelligence that is structured, reliable, and built to scale.

40%

manual effort

70%

Faster decision cycles

90%

Adoption across teams