Can an AI agent run the entire scientific method without human supervision?

I bet you can imagine a researcher who runs an experiment, fails, and then forgets everything that led to the failure. The next attempt starts fresh, with no memory of what was tried or why it didn’t work. Over hours or days of research, this researcher would waste enormous effort revisiting dead ends, trying contradictory approaches, and never building cumulative knowledge about the problem.This is roughly how current AI agents approach autonomous research tasks. When Claude or Codex are asked to optimize a machine learning model, improve a data pipeline, or engineer an algorithm, each interaction operates in isolation. The agent calls the model, gets code, runs it, observes results, and then starts the next iteration from scratch. If something fails, there’s no structured way to remember why, what assumptions broke, or what that teaches about the problem space. The agent might try a similar failing approach again, or swing to a completely different strategy with no intermediate learning.Over long horizons, this becomes profoundly wasteful. Current systems explore locally but their exploration is essentially memory-less. The agent can’t say “we learned that approach A doesn’t work because of property X, so approach B, which relies on property X, probably won’t work either.” They plateau because they never develop a coherent theory of the problem space.This is the inefficiency that a new framework called Arbor addresses. The framework treats autonomous research not as a sequence of disconnected trials, but as a cumulative process where strategy, execution, and evidence compound over time.Reframing research as a process of knowledge accumulationInstead of asking “how can we execute experiments faster,” ask “how can we help an AI agent actually think like a researcher.” A human researcher doesn’t run experiments randomly. They maintain a mental model of the problem, track hypotheses and whether they’ve held up under evidence, and use that accumulated understanding to choose the next experiment.Arbor materializes this mental model as a hypothesis tree, a persistent data structure that links hypotheses, artifacts, evidence, and distilled insights across time. Unlike a linear notebook, this tree structure lets information propagate sideways. A lesson learned in one branch can inform decisions in another. A hypothesis might be refined into sub-hypotheses as evidence accumulates. A branch might be pruned if results rule it out.This transforms the research process. Instead of executing experiments in isolation, each result updates not just “the best solution found so far” but “what we know about this problem.” That knowledge becomes the basis for the next decision.The hypothesis tree as persistent memoryThe architecture has three components. The coordinator is a language model that reads the hypothesis tree, interprets accumulated evidence, and decides which hypotheses to test next. The executors are isolated processes that implement and run specific experiments. And the hypothesis tree itself is the persistent record that connects them, storing every hypothesis, its experimental evidence, the interpretation of that evidence, and lessons that generalize.Let’s consider a concrete task: improving a model’s accuracy by tuning training procedures. The tree might start with a single hypothesis at the root: “we can improve performance by adjusting hyperparameters.” An executor tests this by running an experiment. Results come back showing that learning rate matters but batch size doesn’t. The tree records this. The coordinator reads the evidence and branches the tree further: “learning rate has a sweet spot around 0.001; let’s explore how it interacts with warmup schedules” versus “batch size doesn’t matter, so let’s explore data augmentation.”As the tree grows, the coordinator’s decisions become more informed. It no longer makes random choices about what to try. It works from the map that the tree builds. It recognizes when two branches are testing the same underlying assumption and can merge findings across them. It knows which areas have been thoroughly explored and which deserve more investigation.Coordination as strategic research directionAIModels.fyi is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Related Posts

Is your most capable AI agent also your biggest data leak?

ICE Appears to Be Buying Immigrants’ Tax Identifiers from a Data Broker

Podcast: The Government Wants to End Anonymity on Phones