Can your AI agent actually learn from its mistakes or just keep repeating them?

Agent skills—the instructions and guidelines that govern how AI models behave when solving problems—exist in an awkward middle ground. They’re either hand-crafted once and frozen, generated fresh each time without learning, or loosely self-revised without any real feedback mechanism. None of these approaches behaves like actual optimization.Compare this to how we train neural networks. With weights, we have a clear loss signal, bounded update steps, validation gates, and reproducible improvement. We can inspect learning curves. We can measure generalization. We know whether we’re making progress or just fitting noise. With skills, we’ve been winging it. Someone writes a prompt, maybe tweaks it based on a few examples, and ships it. If it doesn’t work well enough, the process starts again, but there’s no systematic way to improve.As agents become more capable and deployed at scale, the skill becomes the bottleneck. A frozen model can’t improve its behavior without retraining, which is expensive. Self-revision is unreliable, and hand-crafting doesn’t scale. We need a way to improve skills the way we improve models: systematically, with reproducible results, with validation gates that prevent chasing noise.Treating skills as trainable parametersThe core insight behind a new SkillOpt paper is simple: a skill document is just external state that modifies how a model behaves. It’s not fundamentally different from internal weights, except it lives outside the model and can be edited without retraining. What if we treated it exactly like a neural network parameter, just in text space instead of number space?The key move is to freeze the model completely and optimize only the skill document itself. This is backward from how we usually think about improving agents—fine-tune the model, scale it up, use a better architecture. But it’s actually more aligned with how optimization works in practice. The model becomes a fixed function. The skill becomes the variable we’re training.Once skills are framed as parameters, we can apply real optimization techniques. We get reproducibility. We get validation gates that prevent accepting false improvements. We get learning curves that show actual progress instead of random wandering. The skill becomes a learnable object, no different in principle from training a neural network weight.How SkillOpt optimizes skills systematicallyThe machinery works like a very disciplined form of skill editing. Run the target model many times with the current skill, collecting successes and failures. Feed those rollouts to a separate optimizer model, asking it to identify what went wrong and propose targeted edits. The optimizer suggests changes: add this guideline, remove that constraint, replace vague language with specific examples. But crucially, each proposed edit gets tested on held-out validation data first. If it improves the validation score, keep it. If not, reject it. Only confirmed improvements stick.This validation gating is the crucial difference from self-revision. You’re not letting the main agent tinker with its own skill unsupervised. Instead, there’s a referee (validation data) and a thoughtful editor (the optimizer model) checking every change before it lands.The full pipeline cycles across epochs:Rollout and collection starts each epoch. Run the target model many times with the current skill, recording trajectories, successes, and failures.Optimizer reflection comes next. The optimizer model analyzes the rollout batch, identifying patterns in what succeeded and what failed. It then proposes bounded edits to the skill document. Crucially, the edits are constrained: add/delete/replace single statements rather than wholesale rewrites. A textual learning-rate budget caps how much the skill can change per epoch, keeping updates stable and preventing wild swings.Validation gating tests each proposed edit on held-out validation data. An edit is accepted only if it strictly improves the validation score. Rejected edits go into a buffer so the optimizer doesn’t propose the same failing changes repeatedly.Meta-updates and scheduling across epochs keep optimization stable and avoid overfitting to individual rollouts. The system uses slow updates and epoch-wise adjustments inspired by meta-learning.A subtle but important detail: the optimized skill is just text. At inference time, you pass it to the model. No extra models running, no additional latency overhead. The entire optimization happens offline.The target model executes tasks with a current skill, the optimizer model analyzes trajectories and proposes bounded edits, and a validation gate accepts only edits that improve held-out performanceEach epoch: the frozen target model executes rollouts with the current skill, the optimizer model reflects on successes and failures, proposes bounded edits, merges candidates, and only accepts edits that improve validation performanceEvidence of improvement across diverse models and benchmarks

Related Posts

Microsoft Bets on Humans to Scale AI

Prompt: The Next AI Challenge Isn’t the Model. It’s the Organization.

NVIDIA BioNeMo accelerates Anthropic Claude Science