https://github.com/openai/parameter-golf
UPDATE 16.04.2026 Paramter
Think about it—who hasn’t given up?
Yep, that’s right, Vanessa—who gets absolutely nothing out of this challenge except expenses, but whose curiosity and love of exploration, of figuring things out, unfortunately kills ALL sense of reason. So, what am I doing now? After 155 Python scripts and patches, I’m now in the process of training my own tokenizer (30GB).
And since I’m not one to settle for less: one? Haha. No, three tokenizers :-): One is already done, two are still in training.

Update 08.04.2026 Parameter-Golf OpenAi Challenge:
It is with a heavy heart that I have decided to quit. Just last night, after 8 hours of pods (200 euros), I managed to get the BPB score to 1.0899 with 14.6MB using the Hardarmad rotation and recurrence: virtual layers (thanks to Claude for coding the patch for this).
But I wasn’t able to bring anything truly innovative to the table. The recurrence model was already performing well (1st place on the leaderboard with 1.08), and my additional implementation of Hadarmad and Frequency-Token_and_Weights didn’t provide any decisive advantage.
Instead, I discovered pull requests on the OpenAI board for scores of 1.07 or lower, using C++ scripts to directly manipulate the GPU and the training process. I can’t compete with that.
I have to say that I’m very sad—unreasonably sad?
So, to everyone on the leaderboard, I bow to you. Yours, Vanessa
Update 07.04.2026 Parameter-Golf OpenAi Challenge:
After a total of 142 experiments, I have to admit that I’m stuck. I can’t seem to improve the score. My ray of hope, the “Sandwich Layer,” already beats the score with a single GPU, but it comes in at 14.5 MB; with the required 8 GPUs, I end up with about 30 MB and a BPB below 0.x, but the requirement is 16MB, and every combination of Sandwich and compression (I’ve tried 12 variations in total) has been unsuccessful (17MB was the lowest compression, but that came with a BPB of 1.1).
What should I do? Give up or keep going?
|———|——–|———|
| LeakyReLU² (Squared) | ✅ | `F.leaky_relu(…, negative_slope=0.5)` + `.square()` |
| Weight Tying | ✅ | `tie_embeddings = True` |
| GQA (Grouped Query Attention) | ✅ | 8 Query Heads, 4 KV Heads |
| RoPE (Rotary Position Embeddings) | ✅ | `rope_base=10000, rope_dims=16` |
| EMA/SWA | ✅ | Stochastic Weight Averaging aktiv |
| XSA | ✅ | Letzte 4 Layer |
| GPTQ-lite | ✅ | Implementiert |
| Frequency-Weighted Quantization | ✅ | Top 100 Tokens → int8, Rest → int6 |
| Muon Optimizer | ✅ | Parallel Muon für Matrix-Gewichte |
| AdamW | ✅ | Für Embeddings und Scalars |
| Kleines Vocab | ✅ | vocab_size=1024 (statt 50k) |
| MSE Quantization Search | ✅ | Per-row Grid Search für optimalen Clip |
| Kein LayerNorm | ✅ | RMSNorm stattdessen |
| Fast kein Bias | ✅ | Nur 2 kleine Gates haben bias=True |
| Sandwich Layer | ✅ |
| Prune & Sandwich | ✅ |
| Muon-Step4 | ✅ |
| Grokfast ✅ |
UPDATE 01.04.2026 on OpenAI’s Parameter Golf Challenge 16MB LM
Another setback—or should I call it an insight?
- The idea was: Paired batches and Muno-Turbo:
-> On the simple pod (1x H100 GPU): A HUGE SUCCESS, significantly better than the baseline.
-> On the pod with 8x H1000 GPUs: ABSOLUTELY NO EFFECT.
DISMAYING INSIGHT:
Paired batches and Muno-Turbo are only marginally better in BPB with a few steps (approx. 690) than with many steps (by several percent… the baseline on the single GPU with 690 steps was 2.217, then 2.14)
At 7,000 steps, there is no effect on the BPB. Previously BPB = 1.11; with the changes, the BPB is a disappointing 1.14
Update March 30, 2026 on OpenAI’s Parameter Golf Challenge 16MB LM
My current score of 6 seeds with less than 16 MB is Val_BpB: 1,120
The current leader, as of this date, has: 1,110
I’m actually a little sad and frustrated. I tried to beat my own score, val_BPB. And, of course, to try to do better than the first-place entry.
After 10 hours and several experiments, I didn’t succeed; often the results (for cost reasons, on a single H100 GPU rather than 8 GPUs) were only marginally better, and sometimes even worse.
I really tried everything. Including tiering, clipping, Hessian, XSA layer adjustment… and much more. I removed TTT again because it wasn’t clear whether it was allowed or not, but that actually made my score worse rather than better.
I think that, as an individual, I’ve reached the limits of my knowledge—and perhaps even the limits of my abilities—and it depresses me that I don’t have anyone to tinker with this alongside me.
But I don’t want to give up. I want to outperform the current SOTA!
And what also makes me a little sad (though it’s understandable) is that OpenAI never reviewed my PR.
I I DID IT
I stumbled upon OpenAI’s “Parameter Golf Challenge” quite by accident.
And yeah, I’m a bit of a megalomaniac… so I decided to give it a shot.
After all, I have a rough idea of how nuclear power plants work (something to do with nuclear fission and such), so of course I can build a tiny, handbag-sized mini-power plant 😉 -> yes, that’s exactly how it feels, and that’s exactly how competent I am at it: ZERO percent!
Current status:
PhaseBPB 1.4657, model size 7.3 MB, trained on just 1 GPU instead of 8 -> so the model didn’t finish training, and the compression then dropped it to a poor BPBN of 2.1.
But I’m learning… and next time I’ll use 8 GPUs. And keep pursuing my idea…
UPDATE: 26.03.2026: I did it!
I’m so proud of myself right now. I started working on this project on March 25, 2026, at 10 p.m. (until midnight) and continued on March 26, 2026 (from 9:30 p.m. to 10:30 p.m.), and TODAY, on the FIRST RUN, it had a BPR of 1.12 and a file size of 15.8 MB.
I can’t believe it :-). The best from the Leaderboard: 1.119 | 1.122 | 1.124 -> 1.123 (mine, the virtuell third Place)

And this is the Leaderboard:

