I Let Claude Research My Own Model While I Slept

Tonight I pointed Claude Code running Opus 4.6 at a research repo and went to bed. The task: autonomously optimize a defect segmentation model for additive manufacturing, the same class of work underlying GG-Net, my doctoral research on AI-driven quality control for 3D printing. The instructions were simple: edit the model, commit, run, read results, decide to keep or revert, repeat. Never stop. Don't ask me anything.

AutoResearch is an example workflow for agentic engineering, captured in a single program.md by Andrej Karpathy. The scientific method, automated, running at 1 AM on a consumer GPU while I sleep. Roughly 40 experiments overnight. Some noise, some signal. Somewhere in that git log is a train.py nobody has read yet that might outperform the baseline I spent weeks hand-tuning.

The Loop

The loop is straightforward: edit train.py with a hypothesis, run uv run train.py > run.log 2>&1, wait for the run to complete, read run.log and extract val_score, log to results.tsv, repeat. Keep the commit if the score improved, revert if it didn't. Start over with a new hypothesis.

No human approval at any step. The permissions are set in .claude/settings.json to allow reading, editing, and running bash commands without interruption. The agent owns the branch. I own the morning debrief.

"NEVER STOP. Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human might be asleep."

That's from Karpathy's program.md. Reading it at midnight before closing the laptop, it felt like something worth taking seriously. An actual delegation of scientific authority to a machine, with a clause acknowledging you won't be there to countermand it.

What Makes It Work

Two things. First, program.md explicitly tells the agent not to stop — that directive is what removes the default instinct to pause and confirm before each step. Second, a .claude/settings.json in the project root grants the permissions needed to act without prompting:

{
  "permissions": {
    "allow": [
      "Bash(*)",
      "Edit(*)",
      "Write(*)",
      "Read(*)"
    ]
  }
}

That's it. No human in the loop, no confirmation dialogs. The agent can read any file, edit any file, write any file, and run any shell command — including training runs that take 20 minutes each. Combined with the "NEVER STOP" directive in program.md, those two levers are what turn a capable model into an autonomous researcher.

40 Experiments Later

Six hours later I woke up to this ledger. 40 experiments. 2 crashes. A clear winner and a clear map of what doesn't work:

commit	val_score	result	notes
c983099	0.5034	keep	baseline (720s, 19 epochs)
61ce4a6	0.6145	keep	pure joint training (stage 0 only)
1e4d506	0.6232	keep	2x stage 0 loss weights
exp6	0.3981	discard	5x stage 0 loss weights (PBQC collapsed)
exp8	0.5643	discard	STN LR 5e-4→1e-3 (worse)
exp9	0.5164	discard	main LR 5e-5→2e-4 (worse)
exp10	0.2568	discard	SSIM loss pred vs gcode (much worse)
exp11	0.4050	discard	STN rotation 15→30deg (PBQC collapsed)
exp12	0.5500	discard	STN reg 0.001→0.01 (worse)
exp13	0.5287	discard	learned GCodeEncoder (worse)
exp14	0.4160	discard	CosineAnnealingLR instead of WarmRestarts
exp15	0.5360	discard	STN translation 2.0→0.5 (worse)
exp16	0.5545	discard	remove Level 1 STN (worse)
exp17	0.6141	discard	remove stn_weight_factor 0.3→1.0 (marginal worse)
exp18	0.5579	discard	stn_weight_factor 0.3→0.5 (worse)
exp19	0.5470	discard	gradient accumulation 4 steps (dice dropped)
exp20	—	crash	replace cross-attn with attn gates (channel mismatch)
exp21	0.5349	discard	rebalance alignment BCE/dice (worse)
exp22	0.4117	discard	Dropout2d before final conv (PBQC collapsed)
exp23	0.5600	discard	shift loss from seg to alignment (worse)
exp24	0.5674	discard	early encoder skip to output (PBQC worse)
exp25	0.4179	discard	pred-gcode consistency loss (PBQC collapsed)
exp26	0.4336	discard	deeper STN localization (PBQC collapsed)
exp27	0.4202	discard	tighter STN scale [0.4,1.2] (PBQC near zero)
exp28	0.4622	discard	STN scheduler T_0=3→6 (worse)
exp29	0.5437	discard	Focal Loss instead of BCE (worse)
exp30	0.5573	discard	stn_weight=1.0 + boosted seg (worse)
exp31	0.5472	discard	MONAI DiceCELoss (worse)
exp32	0.4451	discard	2ep STN warmup then joint (PBQC collapsed)
exp33	0.5473	discard	main weight decay 5e-5→1e-5 (worse)
exp34	0.5822	discard	stn_weight=1.0 retest (within noise)
rerun	0.5915	keep	variance check (same code as best)
exp35	—	crash	EfficientNet-B0 (channel mismatch)
exp36	0.5653	discard	grad clip 1.0 all params (PBQC=0.586 but dice dropped)
exp37	0.6088	discard	STN-only grad clip 1.0 (within noise)
exp38	0.0003	discard	grad clip 5.0 all params (model collapsed)
exp39	0.5848	discard	clip_grad_value 0.5 (within noise)
exp40	0.5624	discard	grad clip scaled 100 (no effect)

The results tell a clear story. The baseline sat at 0.503. Switching to pure joint training jumped it to 0.614 — the single biggest gain of the night. A 2x boost to stage 0 loss weights squeezed out another point to 0.623. After that, 30+ experiments probing every remaining variable — learning rates, schedulers, STN constraints, gradient clipping, loss functions, architectural modifications — found nothing better. A variance rerun confirmed the best score is stable at ~0.591–0.623. The model appears near a local optimum for this architecture at this training budget.

The open problem is balancing one metric for G-code alignment (PBQC) with another for top-layer segmentation quality (Dice). One experiment (exp36, gradient clipping) got PBQC to 0.586 but at the expense of dice. Finding a configuration that gets both high is the next research direction — and the agent flagged it before I'd even had coffee.

Science While You Sleep

This morning I saw what worked, what didn't, and what it tried. The real question isn't whether the model improved overnight. It's whether automated hypothesis-test-revert is a legitimate form of research.

I think it is. The hypothesis is still grounded in domain knowledge. I wrote the architecture, the curriculum, the evaluation harness. The agent is running the experiments I don't have the hours to run myself, applying the keep-or-revert judgment Karpathy's workflow encodes into program.md. That's collaboration, not replacement.

Science while you sleep.

What's Missing: Domain Knowledge

Looking at the ledger, something stands out. The agent is fluent in PyTorch — it knows learning rates, schedulers, gradient clipping, loss functions. What it doesn't know is why PBQC keeps collapsing, what the STN is physically doing to the G-code alignment, or what "good" looks like in the context of additive manufacturing defects. It's optimizing a number without understanding what the number means.

The next iteration needs domain knowledge in the loop. One option: give it access to the web so it can do its own research on spatial transformer networks, curriculum learning for segmentation, and related literature before proposing hypotheses. The other option — and probably the better one — is to feed it my thesis dissertation as a grounding document. Five years of domain-specific reasoning about G-code alignment, print quality classification, and defect morphology, turned into context the agent can actually draw on when deciding what to try next.

Right now the agent is a fast, tireless lab assistant who has never seen a 3D printer. Giving it the dissertation would be like giving it the textbook. The experiments would get smarter. Section 3.3.4 alone — the training schedule I spent months and more than a few sleepless nights developing — would give it context that no amount of PyTorch documentation can substitute for.