I Let Claude Research My Own Model While I Slept
Tonight I pointed Claude Code running Opus 4.6 at a research repo and went to bed. The task: autonomously optimize a defect segmentation model for additive manufacturing, the same class of work underlying GG-Net, my doctoral research on AI-driven quality control for 3D printing. The instructions were simple: edit the model, commit, run, read results, decide to keep or revert, repeat. Never stop. Don't ask me anything.
AutoResearch is an example workflow for agentic engineering, captured in a single program.md by Andrej Karpathy. The scientific method, automated, running at 1 AM on a consumer GPU while I sleep. Roughly 40 experiments overnight. Some noise, some signal. Somewhere in that git log is a train.py nobody has read yet that might outperform the baseline I spent weeks hand-tuning.
The Loop
The loop is straightforward: edit train.py with a hypothesis, run uv run train.py > run.log 2>&1, wait for the run to complete, read run.log and extract val_score, log to results.tsv, repeat. Keep the commit if the score improved, revert if it didn't. Start over with a new hypothesis.
No human approval at any step. The permissions are set in .claude/settings.json to allow reading, editing, and running bash commands without interruption. The agent owns the branch. I own the morning debrief.
"NEVER STOP. Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human might be asleep."
That's from Karpathy's program.md. Reading it at midnight before closing the laptop, it felt like something worth taking seriously. An actual delegation of scientific authority to a machine, with a clause acknowledging you won't be there to countermand it.
What Makes It Work
Two things. First, program.md explicitly tells the agent not to stop — that directive is what removes the default instinct to pause and confirm before each step. Second, a .claude/settings.json in the project root grants the permissions needed to act without prompting:
{
"permissions": {
"allow": [
"Bash(*)",
"Edit(*)",
"Write(*)",
"Read(*)"
]
}
}
That's it. No human in the loop, no confirmation dialogs. The agent can read any file, edit any file, write any file, and run any shell command — including training runs that take 20 minutes each. Combined with the "NEVER STOP" directive in program.md, those two levers are what turn a capable model into an autonomous researcher.
40 Experiments Later
Six hours later I woke up to this ledger. 40 experiments. 2 crashes. A clear winner and a clear map of what doesn't work:
| commit | val_score | result | notes |
|---|---|---|---|
| c983099 | 0.5034 | keep | baseline (720s, 19 epochs) |
| 61ce4a6 | 0.6145 | keep | pure joint training (stage 0 only) |
| 1e4d506 | 0.6232 | keep | 2x stage 0 loss weights |
| exp6 | 0.3981 | discard | 5x stage 0 loss weights (PBQC collapsed) |
| exp8 | 0.5643 | discard | STN LR 5e-4→1e-3 (worse) |
| exp9 | 0.5164 | discard | main LR 5e-5→2e-4 (worse) |
| exp10 | 0.2568 | discard | SSIM loss pred vs gcode (much worse) |
| exp11 | 0.4050 | discard | STN rotation 15→30deg (PBQC collapsed) |
| exp12 | 0.5500 | discard | STN reg 0.001→0.01 (worse) |
| exp13 | 0.5287 | discard | learned GCodeEncoder (worse) |
| exp14 | 0.4160 | discard | CosineAnnealingLR instead of WarmRestarts |
| exp15 | 0.5360 | discard | STN translation 2.0→0.5 (worse) |
| exp16 | 0.5545 | discard | remove Level 1 STN (worse) |
| exp17 | 0.6141 | discard | remove stn_weight_factor 0.3→1.0 (marginal worse) |
| exp18 | 0.5579 | discard | stn_weight_factor 0.3→0.5 (worse) |
| exp19 | 0.5470 | discard | gradient accumulation 4 steps (dice dropped) |
| exp20 | — | crash | replace cross-attn with attn gates (channel mismatch) |
| exp21 | 0.5349 | discard | rebalance alignment BCE/dice (worse) |
| exp22 | 0.4117 | discard | Dropout2d before final conv (PBQC collapsed) |
| exp23 | 0.5600 | discard | shift loss from seg to alignment (worse) |
| exp24 | 0.5674 | discard | early encoder skip to output (PBQC worse) |
| exp25 | 0.4179 | discard | pred-gcode consistency loss (PBQC collapsed) |
| exp26 | 0.4336 | discard | deeper STN localization (PBQC collapsed) |
| exp27 | 0.4202 | discard | tighter STN scale [0.4,1.2] (PBQC near zero) |
| exp28 | 0.4622 | discard | STN scheduler T_0=3→6 (worse) |
| exp29 | 0.5437 | discard | Focal Loss instead of BCE (worse) |
| exp30 | 0.5573 | discard | stn_weight=1.0 + boosted seg (worse) |
| exp31 | 0.5472 | discard | MONAI DiceCELoss (worse) |
| exp32 | 0.4451 | discard | 2ep STN warmup then joint (PBQC collapsed) |
| exp33 | 0.5473 | discard | main weight decay 5e-5→1e-5 (worse) |
| exp34 | 0.5822 | discard | stn_weight=1.0 retest (within noise) |
| rerun | 0.5915 | keep | variance check (same code as best) |
| exp35 | — | crash | EfficientNet-B0 (channel mismatch) |
| exp36 | 0.5653 | discard | grad clip 1.0 all params (PBQC=0.586 but dice dropped) |
| exp37 | 0.6088 | discard | STN-only grad clip 1.0 (within noise) |
| exp38 | 0.0003 | discard | grad clip 5.0 all params (model collapsed) |
| exp39 | 0.5848 | discard | clip_grad_value 0.5 (within noise) |
| exp40 | 0.5624 | discard | grad clip scaled 100 (no effect) |
The results tell a clear story. The baseline sat at 0.503. Switching to pure joint training jumped it to 0.614 — the single biggest gain of the night. A 2x boost to stage 0 loss weights squeezed out another point to 0.623. After that, 30+ experiments probing every remaining variable — learning rates, schedulers, STN constraints, gradient clipping, loss functions, architectural modifications — found nothing better. A variance rerun confirmed the best score is stable at ~0.591–0.623. The model appears near a local optimum for this architecture at this training budget.
The open problem is balancing one metric for G-code alignment (PBQC) with another for top-layer segmentation quality (Dice). One experiment (exp36, gradient clipping) got PBQC to 0.586 but at the expense of dice. Finding a configuration that gets both high is the next research direction — and the agent flagged it before I'd even had coffee.
Science While You Sleep
This morning I saw what worked, what didn't, and what it tried. The real question isn't whether the model improved overnight. It's whether automated hypothesis-test-revert is a legitimate form of research.
I think it is. The hypothesis is still grounded in domain knowledge. I wrote the architecture, the curriculum, the evaluation harness. The agent is running the experiments I don't have the hours to run myself, applying the keep-or-revert judgment Karpathy's workflow encodes into program.md. That's collaboration, not replacement.
Science while you sleep.
What's Missing: Domain Knowledge
Looking at the ledger, something stands out. The agent is fluent in PyTorch — it knows learning rates, schedulers, gradient clipping, loss functions. What it doesn't know is why PBQC keeps collapsing, what the STN is physically doing to the G-code alignment, or what "good" looks like in the context of additive manufacturing defects. It's optimizing a number without understanding what the number means.
The next iteration needs domain knowledge in the loop. One option: give it access to the web so it can do its own research on spatial transformer networks, curriculum learning for segmentation, and related literature before proposing hypotheses. The other option — and probably the better one — is to feed it my thesis dissertation as a grounding document. Five years of domain-specific reasoning about G-code alignment, print quality classification, and defect morphology, turned into context the agent can actually draw on when deciding what to try next.
Right now the agent is a fast, tireless lab assistant who has never seen a 3D printer. Giving it the dissertation would be like giving it the textbook. The experiments would get smarter. Section 3.3.4 alone — the training schedule I spent months and more than a few sleepless nights developing — would give it context that no amount of PyTorch documentation can substitute for.