Discussion

Language Modeling with Limited Data, Infinite Compute

suddenlybananas: Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.

sdpmas: hey, it's Samip (behind the Slowrun repo). yeah that's a fair point, we will mention them in the blog. but there are a couple of major differences: 1. our emphasis is on using more compute to get better data efficiency. this is important because there are lots of hacky chances that will get lower loss, but when compared to general methods that leverage a lot of compute, they don't do so well. and you can already see how this emphasis on compute leads to different methods to BabyLM! 2. our reasoning behind the repo is not anything to do with how much data a child sees. and our dataset is not tailored towards that either. it's simple pretraining on random subset of the internet. we know there are better training algorithms that get lower loss on that data, and we are finding those.

soraki_soladead: [delayed]

archermarks: Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.

riajain2525: Super cool!

kseniamorph: Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?

STARGA: The data-limited regime is where most of the interesting engineering happens. When you have infinite data, you can paper over bad architecture choices with more tokens. When data is fixed, every design decision — tokenizer vocabulary, attention pattern, positional encoding, regularization — has measurable impact on sample efficiency.The ensemble approach is worth examining closely. In low-data regimes, model diversity matters more than individual model quality. If your 8 models converge to similar representations (which happens with identical architectures and similar init), the ensemble gain is minimal. The interesting question is whether architectural diversity (different attention patterns, different FFN ratios) gives better ensemble coverage than just different random seeds.The aggressive regularization finding aligns with what we see in other domains. When your dataset is small, the model's capacity-to-data ratio is the dominant variable. Dropout, weight decay, and data augmentation are doing more work than the optimizer or learning rate schedule.

refulgentis: This looks awesome!!! I’m curious on the ensemble: does it mean “train 8 different models and pick the best one”? That’s what my mind jumps to, but that also seems wrong, because I assume we could just keep increasing the number of different models you train to get a win.

bee_rider: > Directions we think are wide open> Second-order optimizers and natural gradient methodsDo second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).

whimsicalism: really no shame in comments like these?

devinplatt: It seems like best etiquette would be to have a username with "bot" in it and include something in the post explicitly indicating it's a bot (e.g. a signature).This isn't even a new problem where a good cultural solution hasn't been figured out yet. Reddit has had bot etiquette for years.

linolevan: There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.[0] https://www.alphaxiv.org/abs/2509.14786

sdpmas: yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.

_0ffh: I see you already mention diffusion - iirc there was a result not too long ago that diffusion models keep improving with more epochs for longer than AR models do.

sdpmas: diffusion is promising, but still an open question how much data efficient they are compared to AR. in practice, you can also train AR forever with high enough regularization, so let's see.

sdpmas: yes! typically the optimizer that trains faster also get better data efficiency. it maybe not be absolutely true, but that has been my observation so far. also see https://arxiv.org/pdf/2510.09378 for second-order methods.

sdpmas: no ensembling means train 8 models and during inference avg logits of all 8 models to make a prediction.

_0ffh: Yes, it could go either way of course.Still, just for reference, here's the paper I remembered: https://arxiv.org/pdf/2507.15857

sdpmas: thanks, here's another one: https://arxiv.org/abs/2511.03276

alyxya: Fundamentally I don't believe second-order methods get better data efficiency by itself, but changes to the optimizer can because the convergence behavior changes. ML theory lags behind the results in practice.

shubhamintech: The ensemble diversity point is underrated. Most teams pick one architecture and ship it, so the finding that architectural variation beats random seeds is interesting but hard to act on in practice. The more useful takeaway: low-data regimes expose every bad design decision you normally paper over with more tokens. It's basically a forcing function for understanding what actually drives model quality vs. what's just scale noise.

vladf: That still looks like a “converge faster” paper.https://arxiv.org/abs/2006.10732The above provides a nuanced theoretical view. GD inductive bias is probably better unless your model is misspecified

jiggawatts: That doesn't seem all that different to a MoE architecture.

londons_explore: I think there will be good headway in using the part-trained model to generate itself more training data in the form of making itself tasks, completing those tasks with many different approaches, evaluating which solution is best (using the same LLM as judge), and then differentially training on the best solutions vs the worst ones.The challenge is that such an approach almost certainly requires a model with RLHF post-training, but this needs to be done in the pre training phase. But with infinity compute, this isn't an issue - you simply do the post-training many times.

rcarmo: This feels like optimizing for local minima, but more verbosely. Even the epoch shuffling doesn’t seem like it would get them out of that pitfall.

jbergqvist: Very interesting benchmark, excited to see what comes out of this. Considering humans are enourmously more sample efficient compared to today's models, it seems clear there's a lot of room to close that gap. The fact that they hit 5.5x in the first week with relatively straightforward changes suggests we're nowhere near the ceiling for data efficiency

yorwba: It's the opposite of a MoE architecture in many ways. MoE splits every individual feed-forward layer into many tiny subnetworks, only a small number of which contribute to the layer output, and they get trained together to complement each other.Ensembling makes multiple copies of the entire model, trains them independently on the same task, and then has every copy contribute to the output.Reducing computation vs. increasing it; operating at per-layer granularity vs. whole model; specialization vs. redundancy.

sdpmas: absolutely!

Mumps: I feel like you really need to mention BabyLM. For example you have:> Directions we think are wide open ... Curriculum learningBabyLM and offshoot published a pretty convincing body of work on exactly that (which suggests it's not particularly relevant to LM training).As I read your page, I really felt like the brevity-thoroughness tradeoff went the wrong way.

magicalhippo: [delayed]

easygenes: This is very much in line with what I found fascinating about optimizing microgpt for speed (0). Or rather, what I was able to do with it after doing so. It's so small and so fast to train, you can really dig deep into the optimization landscape. I've spent all my free time this past week digging into it.0: https://entrpi.github.io/eemicrogpt/ (The writeup is from a few days ago, and I'm still running experiments before I do a big rewrite. Slowrun is good food for thought.)

NanoGPT Slowrun - Q

More from qlabs.sh

10x Data Efficiency - NanoGPT Slowrun - Q

Discover

Optimizing Ruby Path Methods | byroot’s blog

Any Color You Like: NIST Scientists Create ‘Any Wavelength’ Lasers in Tiny Circu…

새로 비공개 해제된 기록에서 드러난 Amazon의 가격 고정 전술, California 법 | GeekNews

A college instructor turns to typewriters to curb AI-written work and teach life…

NASA Shuts Off Instrument on Voyager 1 to Keep Spacecraft Operating - NASA Scien…

Floating Point Fun on Cortex-M Processors · Daniel Mangum