Two weeks of fine-tuning VLAs in sim: what worked, what broke

I spent a couple of weeks fine-tuning VLAs to pick colored books off a shelf in MuJoCo, one 5070 Ti, no cloud. SmolVLA on the simple scene peaked at 60% canonical. Shuffle the books at eval and the same model dropped to 17.5%. Replaying the eval RNG, the model was 14/26 when the target landed in its canonical slot and 0/54 when it didn’t. It had memorized which slot was which. The textbook fix made it worse on both. The harder scene went 0/10.

The setup

The simple scene I used for most of this is small: MuJoCo Franka Panda, a single-shelf rack with 4 colored books (red, green, blue, yellow at fixed y-positions), an open carton in front. A hand-tuned damped-least-squares IK pick-and-place state machine generates demonstrations at 20 Hz with two cameras (base + wrist) at 224×224 RGB. Per-episode randomization on book positions (±5 mm), carton (±20 mm), base camera (±30 mm xyz), light intensity (±20%), and 12 instruction templates like “pick the red book and place it in the box”.

Each demo is a per-tick stream of (state, action, per-camera RGB, instruction string) aligned to the 20 Hz step, packed into the LeRobot dataset format. The scripted controller is what the policy learns from. It runs the actual pick-and-place, and the policy imitates. Demos are filtered to successes only (~62% of attempts succeed at ±15 mm jitter).

I wanted a single policy that maps (image, prompt) to joint targets.

Phase 1: ACT plateaus at 37.5%

I started with ACT (Action Chunking Transformer, ~52M params). ACT has no language pathway, so a single ACT on all 4 colors should fail by construction. It did. 0/40. The model picks something, just not the correct color.

The standard workaround is per-color specialists. Train four separate ACT models, parse the color from the prompt at eval, dispatch to the right one. This worked, sort of:

ACT setupdemos / colorsuccess
ACT v1 specialists5037.5%
ACT v2 specialists14037.5%

Doubling the demos per specialist bought me nothing. ACT has hit its capacity ceiling on this task, and “ceiling” turned out not to be data-bound. Reasonable result, well-known story.

The takeaway from phase 1 wasn’t “ACT is bad.” It was “I want one model that reads the prompt instead of four routed by string-matching.”

Phase 2: SmolVLA gets language to work

I swapped in SmolVLA-450M. It’s a 100M-param trainable action expert with a flow-matching head, bolted onto a frozen SigLIP + SmolLM2 backbone. Frozen vision is locked at the recipe level. The training set was the same 524-episode dataset I’d been using for ACT, just with the per-episode instruction string passed through.

First run, 10K steps, batch 4, ~0.42 epochs: 25%. Worse than ACT. The dataset was barely getting touched.

Second run, 50K steps (~2 epochs from the same 10K checkpoint): 50%. Better than the ACT plateau by 12.5 percentage points, and on harder eval distributions (12 prompt templates, 8 of which weren’t in the training set) it still scored 40%.

The language head was real, not just memorized-template lookup. Eval successes included unseen phrasings:

The one clean failure mode was “the X book goes in the box”. 0/8 across all colors. The non-imperative grammar (book as syntactic subject) broke the language pathway entirely. Two other prompt variants showed the same pattern. “drop the blue book INTO the box” and “pick the yellow book TO DROP to the box” both got the model to hesitate and abort the trajectory. The failing prompts all break the canonical verb + the + color + book template in some way (INTO instead of in, pick X to drop instead of pick X or drop X, book as subject instead of object). The language head latches on surface patterns, not abstract verb-object structure. I noted that as a known issue and moved on.

One more thing about this checkpoint. It could only pick one book at a time. Asked to pick all four colors in a row in the same scene, starting each pick from wherever the arm happened to be after the last one, it scored 1/4. Only the first pick worked. The rest timed out.

The model had no recovery behavior for the post-first-pick state. It had only seen episodes that start from the home pose with all four books still on the rack. I fixed this in Phase 3 with a trick that turned out to be embarrassingly important.

Phase 3: the dataset rebuild that peaked at 60%

Inspired by the 40-50% range, I rebuilt the dataset. 3 cameras (added an overhead view), 12 prompt templates baked in, 600 quality-scored demonstrations (the scripted controller occasionally produces partial successes, and I filtered those out). Same model architecture and recipe.

A quick look at the v3 training pipeline. The lighting, camera, and prompt variation all carry over to v3-final.

This clip is actually from an earlier iteration of the v3 dataset where I’d shuffled the books at training time (random color↔slot permutation per episode). That run never broke 15%. v3-final retrained on a non-shuffled version of the same scene, which is what hit 60%. The shuffle question is what comes back to bite me two sections from now.

The v3-final run peaked at 60% on an 80-episode eval at step 22,000. The 32K and beyond checkpoints regressed slightly. 22K was the global maximum.

The clip I’m still proud of came from this checkpoint. In an interactive demo with the arm auto-resetting to its home pose between picks, the model handled all four colors back-to-back in a single scene, recovering between picks, picking the next book from a rack state (3 books left, then 2, then 1, then 0) it never saw at training:

blue     OK   76 ticks   "pick up the blue book"
yellow   OK   99 ticks   "pick up the yellow book"
green    OK   67 ticks   "move the green book now"
red      OK   96 ticks   "drop the red book in the box"

The trick is the arm-reset between picks. Each prompt starts in the in-distribution start state SmolVLA was trained on. Every training episode begins with mj_resetDataKeyframe(model, data, 0). The model has literally never seen a non-home arm pose at t=0. After pick 1, the arm is wherever the policy left it. Carton-adjacent, gripper closed, joints in a configuration the policy has zero training data for. Pick 2 starts from there. The action expert is being asked to generate trajectories from a qpos distribution that lies entirely outside its training support. Without the reset, you regress to 1/4. With it, 4/4 falls out for free. The model has no recovery behavior for the post-first-pick distribution. It only ever needed one start, because nobody trained it for any other.

Without that reset trick, I would have called this a partial failure. With it, it’s the only clip from the project I’d actually show someone.

The break: a different test, same model, half the success

So far, books had always sat in the same shelf-slot order at training AND at eval. Red was always on the far left, yellow on the far right, blue and green in between. The model saw the same color↔slot mapping every episode.

I flipped on position shuffle at eval. Random color↔slot permutation per episode. Same model, same prompts, only the visual scene changed.

60% → 17.5%. A clean 3.4× collapse.

I didn’t know whether 17.5% meant the model was guessing or something cleaner. To find out, I needed to know episode-by-episode where the target color landed under the shuffle. Eval was already deterministic (cuDNN seeded, RNG seed 42), so I could replay it and recover each episode’s slot permutation.

Replaying the eval RNG (deterministic seed 42) lets me recover the per-episode slot permutation and cross-tabulate it against OK/FAIL outcomes. The result was binary:

                          OK    FAIL
target at canonical slot:  14     12      → 53.8% (≈ in-distribution)
target at non-canonical:    0     54      → 0.0%   ← never. not once.

Zero of 54 episodes succeeded when the target color was anywhere other than its training slot. The 17.5% headline decomposes exactly as 0.325 × 0.538 + 0.675 × 0.000 ≈ 0.175. The model had never used vision to find the color. It memorized “red prompt → reach for slot 0, blue prompt → slot 2, …” and used the vision tower only for arm-trajectory localization, not for object identification.

SAME MEMORIZED LOOKUP, TWO SCENES canonical eval training distribution "pick the red book" → slot 0 slot 0 slot 1 slot 2 slot 3 ✓ slot 0 holds red · 53.8% (14/26) shuffled eval eval-only distribution shift "pick the red book" → slot 0 (same) slot 0 slot 1 slot 2 slot 3 ✗ slot 0 holds blue · 0% (0/54)
Same memorized lookup, both scenes. Canonical eval, lands on red 53.8% of the time. Shuffled, lands on whatever sits at slot 0.

Here’s what that looks like. The prompt is “blue”, the actually-blue book has been shuffled to a different slot, and the canonical-blue slot now holds the red book. Watch where the arm goes:

It descends on the canonical-blue slot, grabs the red book that’s sitting there, drops it in the carton. Fails. Loops. Does it again.

This sharpens the Robust Skills, Brittle Grounding paper (Emukpere et al.). They reported a ~2% non-canonical “instruction-correct” floor under continuous Cartesian jitter, attributable to accidental near-canonical matches. With discrete permutations, the floor is cleanly 0%. This is shortcut learning: when two solutions fit the data and one is simpler, optimizers find the simpler one.

The failed mitigation

The literature-recommended fix for this kind of grounding collapse is counterfactual data augmentation, synthesizing demonstrations where the color↔slot mapping varies so the model can’t shortcut on slot-position.

I collected 1500 additional shuffled scripted demonstrations and retrained SmolVLA on the combined 600 canonical + 1500 shuffled dataset.

eval conditionbaseline+ aug
canonical60.0%8.8%
shuffled17.5%5.0%

Worse on both. Lose-lose.

Two things happened.

The frozen vision encoder cannot absorb the new visual diversity. SigLIP’s features for “red book at slot 0” and “red book at slot 2” are essentially the same vector to the action expert. Adding shuffled episodes doesn’t change what the encoder produces. It only changes what the expert tries to map those near-identical features to. The shuffled-data signal that’s supposed to break the position shortcut is invisible to the part of the network that would need to learn from it.

The 100M-parameter action expert only has so much capacity. The shuffled data forces it to abandon the memorized lookup, because that lookup no longer fits the data. But the frozen encoder hands the expert the same indiscriminate features as before, so there’s nothing to learn grounding from. The expert forgets the shortcut and gets nothing in return.

I spent a couple of days trying to rule out boring explanations. bf16 vs fp32 (matched the v2 recipe exactly), state schema (15-D vs 8-D, retrained from scratch), insufficient training (loss plateaued at 22K and per-checkpoint eval bounced 5-15% with no monotonic trend). None moved the number. The model didn’t break. It just got systematically worse at both distributions because the added data made the language→object pathway harder to learn, at this model size and trainable budget.

I independently corroborated the Brittle Grounding paper’s most important practical claim. Scaling demonstrations does not close this gap at the SmolVLA-LoRA budget. The architecture has a structural ceiling here.

Phase 4: a harder scene, 0/10

If the simple scene was hitting a structural ceiling, the obvious move was to either grow the model or simplify the task. I did neither. I made the task harder. Bigger rack (2 rows × 6 columns instead of 1×4), 4-cell carton instead of one carton bin, side-grasp instead of vacuum gripper, multi-cue prompts like “pick the red book in the bottom row, leftmost, and put it in the front-left cell.”

publisher_v4 scene, side camera (256×256, as the policy sees it)

240 demos covering 10 of 12 (book, cell) pairs (one pair held out for compositional eval), with the same scripted DLS-IK controller. Standard SmolVLA recipe:

bs                = 16
freeze_vision     = True   # frozen SigLIP
train_expert_only = True
chunk_size        = 50
lr_peak           = 1e-4
precision         = bf16

20K steps. Training loss dropped 0.604 → 0.056. Looked textbook.

Eval: 0/10.

I tried a warm-restart from the 20K checkpoint with a fresh optimizer and scheduler, so the LR would re-ramp back to peak and (in theory) knock the policy out of whatever local basin it had settled into. Evaluated every 5K steps. All four checkpoints (10K, 15K, 20K, 25K) scored 0/5+. The decision gate triggered. I killed the run.

The diagnostic I should have run on day one was to take a recorded training episode, feed its exact prompt, frames, and proprioceptive state back through the trained policy, and compare the predicted action chunk to the recorded ground-truth chunk. The model had seen this exact sequence hundreds of times.

max action error at t=0   : 0.12 rad
max action error at t=150 : 1.30 rad

The first time I saw those numbers I assumed I’d broken something in the eval harness. I re-ran the diagnostic on three different recorded episodes. Same shape every time.

At t=0 the policy is roughly correct (all 240 demos start from the same arm pose, so the average is also roughly correct). By t=150 it’s just predicting the dataset’s average trajectory, regardless of input. The per-demo signal is gone. The model couldn’t reproduce its own training data. On the training set. With the literal training inputs.

When this happens, you can stop debugging the eval distribution. The conditional structure between input and action never made it into the weights. Render quality, RNG, prompt templates, eval thresholds. All ruled out in one pass by running the model on its own training tape.

Three suspects

In rough order of how much I blame them.

1. Frozen SigLIP probably never actually learned to discriminate colors. This is the one I’d bet on.

In v1 the books always sat in the same shelf-slot order at training time. Red on the far left, yellow on the far right, blue and green in between. The simplest thing the model could learn was “red prompt → reach for slot 0, blue prompt → slot 2, …”, not “recognize red book in the camera image”. The shuffle eval that dropped v1 to 0/54 on non-canonical positions basically proved that’s what happened. The 60% on canonical eval wasn’t color discrimination, it was position memorization.

In v4 books also sit in fixed positions during training, so the same shortcut should work in principle. But v4’s prompts are denser. Multi-cue spatial descriptions, 170 unique strings versus v1’s 12 simple templates, and only ~1.4 demos per unique string. The model can’t memorize 170 different per-prompt position lookups with that little supervision. And frozen SigLIP, processing each book at roughly one 16-pixel patch, doesn’t give the action expert enough discriminative visual signal to fall back on. SigLIP was trained on natural image-text pairs from the web. Saturated solid-color rectangles 10 pixels wide aren’t anywhere in that distribution.

About “unfreezing SigLIP”. That’s not really a knob you turn inside the SmolVLA recipe. The paper bakes it in (Section 4.3, no ablation), the LeRobot config defaults to frozen, every public fine-tune I’ve read keeps it that way. So if frozen vision is the problem, the answer isn’t “I should have flipped a switch”. It’s “I should have used a different model”. π0.5 and MolmoAct2 both qualify.

2. chunk_size=50 lets flow-matching cheat. A 50-step chunk at 20 Hz is 2.5 s of action. The flow-matching loss is satisfied in expectation by predicting the mean trajectory across all conditional modes that share a near-identical start state. With chunk=10 the model has to commit to a specific direction much sooner. With chunk=50 the early steps of “go toward the shelf” cover most of the loss landscape and the discriminating motion gets averaged out.

3. ~1.4 demos per unique prompt. 12 books × 4 cells minus 2 known-failing combinations is 46 valid (book, cell) pairs. Multiply by 9 verb pairs (pick/grab/take × place/put/drop) and you get 414 possible prompt strings. The dataset actually contains 170 unique strings, so 240 demos / 170 unique ≈ 1.4 demos per string. Thin even if (1) and (2) weren’t true.

The obvious next step that didn’t fit on the GPU

The natural progression from SmolVLA is π0.5 (openpi). It has a trainable vision encoder, which is the recipe-level fix for suspect 1. I tried it. The pilot OOM’d at 16 GB after rematerialization, needing 5.37 GB more than the 5070 Ti has. Even openpi’s smallest LoRA variant (gemma_2b_lora + gemma_300m_lora) is too large for this hardware. The training config is set up and ready. It just needs a different machine.

What I’m taking forward

Phase 3 (the simple-scene SmolVLA fine-tune with the 4-color sequential demo) was the high-water mark for this entire project. v4 wasn’t a recipe mistake. The recipe is the recipe. v4 was the wrong task for this recipe. SmolVLA + frozen vision at 240 demos was great for 4-color desk picking and ran out of room when I asked it to do 6 colors across 12 shelf positions and 4 carton cells. The deployable-target hypothesis (SmolVLA-class as the robok8s deployable for a publishing-house pick-and-pack demo) still stands. It just means I need either a model with more visual headroom or a task that fits inside SmolVLA’s headroom.

Concretely, the v5 plan is to rent an A100 or 4090 for a weekend and try π0.5 (the natural progression) or MolmoAct2 (smaller, different architecture, also has a trainable vision encoder). A different vision pipeline is where I think the grounding gap actually lives, but I haven’t tested that, and it’s worth saying out loud that I’m guessing.

What to take away

The bottleneck during fine-tuning is the action expert, not the VLM. Gradient signal reshapes ~100M params on top of features the frozen encoder produces and never updates. If those features don’t separate “red book” from “blue book” at the resolution you need, more data doesn’t fix it.

Position memorization is the predictable shortcut, not a bug. When color and slot are correlated at training time, there are two solutions that fit the data. Read the prompt and look at the scene, or read the prompt and reach for the slot. The second one bypasses vision entirely. Gradient descent finds the simpler one.

And if a fine-tune fails, run train-set replay first, not last. If the model can’t reproduce its own training inputs, you can stop debugging the eval distribution. I learned that the expensive way.

← all posts