Casey Tran
← Blog · · 4 min read

I Let Claude Research My Own Model While I Slept

AutoResearch Claude Opus 4.6 Agentic Engineering ML RTX 3060

Tonight I pointed Claude Code running Opus 4.6 at a research repo and went to bed. The task: autonomously optimize a defect segmentation model for additive manufacturing, the same class of work underlying GG-Net, my doctoral research on AI-driven quality control for 3D printing. The instructions were simple: edit the model, commit, run, read results, decide to keep or revert, repeat. Never stop. Don't ask me anything.

AutoResearch is an example workflow for agentic engineering, captured in a single program.md by Andrej Karpathy. The scientific method, automated, running at 1 AM on a consumer GPU while I sleep. Roughly 40 experiments overnight. Some noise, some signal. Somewhere in that git log is a train.py nobody has read yet that might outperform the baseline I spent weeks hand-tuning.

The Loop

The loop is straightforward: edit train.py with a hypothesis, run uv run train.py > run.log 2>&1, wait for the run to complete, read run.log and extract val_score, log to results.tsv, repeat. Keep the commit if the score improved, revert if it didn't. Start over with a new hypothesis.

No human approval at any step. The permissions are set in .claude/settings.json to allow reading, editing, and running bash commands without interruption. The agent owns the branch. I own the morning debrief.

"NEVER STOP. Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human might be asleep."

That's from Karpathy's program.md. Reading it at midnight before closing the laptop, it felt like something worth taking seriously. An actual delegation of scientific authority to a machine, with a clause acknowledging you won't be there to countermand it.

What Makes It Work

Two things. First, program.md explicitly tells the agent not to stop — that directive is what removes the default instinct to pause and confirm before each step. Second, a .claude/settings.json in the project root grants the permissions needed to act without prompting:

{
  "permissions": {
    "allow": [
      "Bash(*)",
      "Edit(*)",
      "Write(*)",
      "Read(*)"
    ]
  }
}

That's it. No human in the loop, no confirmation dialogs. The agent can read any file, edit any file, write any file, and run any shell command — including training runs that take 20 minutes each. Combined with the "NEVER STOP" directive in program.md, those two levers are what turn a capable model into an autonomous researcher.

40 Experiments Later

Six hours later I woke up to this ledger. 40 experiments. 2 crashes. A clear winner and a clear map of what doesn't work:

commit val_score result notes
c9830990.5034keepbaseline (720s, 19 epochs)
61ce4a60.6145keeppure joint training (stage 0 only)
1e4d5060.6232keep2x stage 0 loss weights
exp60.3981discard5x stage 0 loss weights (PBQC collapsed)
exp80.5643discardSTN LR 5e-4→1e-3 (worse)
exp90.5164discardmain LR 5e-5→2e-4 (worse)
exp100.2568discardSSIM loss pred vs gcode (much worse)
exp110.4050discardSTN rotation 15→30deg (PBQC collapsed)
exp120.5500discardSTN reg 0.001→0.01 (worse)
exp130.5287discardlearned GCodeEncoder (worse)
exp140.4160discardCosineAnnealingLR instead of WarmRestarts
exp150.5360discardSTN translation 2.0→0.5 (worse)
exp160.5545discardremove Level 1 STN (worse)
exp170.6141discardremove stn_weight_factor 0.3→1.0 (marginal worse)
exp180.5579discardstn_weight_factor 0.3→0.5 (worse)
exp190.5470discardgradient accumulation 4 steps (dice dropped)
exp20crashreplace cross-attn with attn gates (channel mismatch)
exp210.5349discardrebalance alignment BCE/dice (worse)
exp220.4117discardDropout2d before final conv (PBQC collapsed)
exp230.5600discardshift loss from seg to alignment (worse)
exp240.5674discardearly encoder skip to output (PBQC worse)
exp250.4179discardpred-gcode consistency loss (PBQC collapsed)
exp260.4336discarddeeper STN localization (PBQC collapsed)
exp270.4202discardtighter STN scale [0.4,1.2] (PBQC near zero)
exp280.4622discardSTN scheduler T_0=3→6 (worse)
exp290.5437discardFocal Loss instead of BCE (worse)
exp300.5573discardstn_weight=1.0 + boosted seg (worse)
exp310.5472discardMONAI DiceCELoss (worse)
exp320.4451discard2ep STN warmup then joint (PBQC collapsed)
exp330.5473discardmain weight decay 5e-5→1e-5 (worse)
exp340.5822discardstn_weight=1.0 retest (within noise)
rerun0.5915keepvariance check (same code as best)
exp35crashEfficientNet-B0 (channel mismatch)
exp360.5653discardgrad clip 1.0 all params (PBQC=0.586 but dice dropped)
exp370.6088discardSTN-only grad clip 1.0 (within noise)
exp380.0003discardgrad clip 5.0 all params (model collapsed)
exp390.5848discardclip_grad_value 0.5 (within noise)
exp400.5624discardgrad clip scaled 100 (no effect)

The results tell a clear story. The baseline sat at 0.503. Switching to pure joint training jumped it to 0.614 — the single biggest gain of the night. A 2x boost to stage 0 loss weights squeezed out another point to 0.623. After that, 30+ experiments probing every remaining variable — learning rates, schedulers, STN constraints, gradient clipping, loss functions, architectural modifications — found nothing better. A variance rerun confirmed the best score is stable at ~0.591–0.623. The model appears near a local optimum for this architecture at this training budget.

The open problem is balancing one metric for G-code alignment (PBQC) with another for top-layer segmentation quality (Dice). One experiment (exp36, gradient clipping) got PBQC to 0.586 but at the expense of dice. Finding a configuration that gets both high is the next research direction — and the agent flagged it before I'd even had coffee.

Science While You Sleep

This morning I saw what worked, what didn't, and what it tried. The real question isn't whether the model improved overnight. It's whether automated hypothesis-test-revert is a legitimate form of research.

I think it is. The hypothesis is still grounded in domain knowledge. I wrote the architecture, the curriculum, the evaluation harness. The agent is running the experiments I don't have the hours to run myself, applying the keep-or-revert judgment Karpathy's workflow encodes into program.md. That's collaboration, not replacement.

Science while you sleep.

What's Missing: Domain Knowledge

Looking at the ledger, something stands out. The agent is fluent in PyTorch — it knows learning rates, schedulers, gradient clipping, loss functions. What it doesn't know is why PBQC keeps collapsing, what the STN is physically doing to the G-code alignment, or what "good" looks like in the context of additive manufacturing defects. It's optimizing a number without understanding what the number means.

The next iteration needs domain knowledge in the loop. One option: give it access to the web so it can do its own research on spatial transformer networks, curriculum learning for segmentation, and related literature before proposing hypotheses. The other option — and probably the better one — is to feed it my thesis dissertation as a grounding document. Five years of domain-specific reasoning about G-code alignment, print quality classification, and defect morphology, turned into context the agent can actually draw on when deciding what to try next.

Right now the agent is a fast, tireless lab assistant who has never seen a 3D printer. Giving it the dissertation would be like giving it the textbook. The experiments would get smarter. Section 3.3.4 alone — the training schedule I spent months and more than a few sleepless nights developing — would give it context that no amount of PyTorch documentation can substitute for.

Casey · Huntsville, AL AutoResearch · RTX 3060 · Claude Opus 4.6