Discussion
flask-gearQwen3.5 Fine-tuning Guide
antirez: Fine tuning is a story that is nice to tell but that with modern LLMs makes less and less sense. Modern LLMs are so powerful that they are able to few shot learn complicated things, so a strong prompt and augmenting the generation (given the massive context window of Qwen3.5, too) is usually the best option available. There are models for which fine tuning is great, like image models: there with LoRa you can get good results in many ways. And LLMs of the past, too: it made sense for certain use cases. But now, why? LLMs are already released after seeing (after pre-training) massive amount of datasets for SFT and then RL. Removing the censorship is much more efficiently done with other techniques. So I have a strong feeling that fine tuning will be every day less relevant, and already is quite irrelevant. This, again, in the specific case of LLMs. For other foundational models fine tuning still makes sense and is useful (images, text to speech, ...).
clueless: What are some sample real world cases folks are using to fine tune their own small/medium models?
danielhanchen: Oh I wrote up a post on X on this exact question! https://x.com/danielhanchen/status/1979389893165060345?s=201. Cursor used online RL to get +28% approval rate: https://cursor.com/blog/tab-rl2. Vercel used RFT for their AutoFix model for V0: https://vercel.com/blog/v0-composite-model-family3. Perplexity's Sonar for Deep Research Reasoning I think was a finetuned model: https://docs.perplexity.ai/docs/getting-started/overview4. Doordash uses LoRA, QLoRA for a "Generalized Attribute Extraction model" https://careersatdoordash.com/blog/unleashing-the-power-of-l...5. NASA flood water detection https://earthdata.nasa.gov/news/nasa-ibm- openly-release-geospatial-ai-foundation-model-nasa-earth-observation-data66. Online RL for robotics - imagine you teaching a robot in the future via some mini finetuning7. OpenAI's RFT page has more: https://developers.openai.com/api/docs/guides/rft-use-cases8. For larger models - https://www.mercor.com/blog/expert-data-drives-model-perform...
ranger_danger: where it makes sense IMO is when you need it to know about a large amount of information that's not already in the model, such as a company knowledgebase, code repositories or a trove of specialized legal documents... in that case it's not realistic to try to stuff the context window every time with that information, especially if you're trying to make a responsive chat bot.
dotancohen: Wouldn't a RAG make more sense for this use case?
antirez: With the current context windows and the ability those models did RL to work as agents, it's much faster and reliable for them to use tools and find the information before replying. Much better, no hallucinations problems (or a lot less), no fine tuning needed when information changes. I believe it is exactly in this case that fine tuning is no longer useful, and even in the past worked at very different degrees of quality.
danielhanchen: These are fair points considering LLMs are getting smarter and better every week - but to be fair the biggest benefits of finetuning / RL are still not yet realized:1. If we have robots at home, they need some sort of efficient continual learning, which could be on the go finetuning / RL via some small LoRA - this will need to do multimodal finetuning with sparse reward signals - one could also imagine all data is aggregated to one central processing center after anonymization, and training a larger model with more data + RL like that2. Agreed images, audio, video etc is what still LoRA does well - the guide at https://unsloth.ai/docs/models/qwen3.5/fine-tune is actually a vision + text finetuning guide, so you can finetune the vision layers on your own use case3. Distillation and model routing is going to be more efficient for all - ie locally smallish models with LoRA for continuous finetuning can be used, but complex tasks can be offloaded to a large LLM in the cloud.
prettyblocks: I think the biggest case for fine tuning is probably that you can take small models, fine tune them for applications that require structured output, and then run cheap inference at scale. "Frontier LLMs can do it with enough context" is not really a strong argument against fine-tuning, because they're expensive to run.
butILoveLife: This is literally what I'm waiting for. I want a ~8B model that works well with OpenClaw.
canyon289: I work on Gemma and Gemini models I want to echo Daniel's point here. Small finetuned models have their place even with larger general purpose models.For example last year with Daniel/Unsloth's help we released a tiny specialized model that can get equivalent to Gemini level purpose specifically for FC. For folks that need efficient limited purpose models small models like this can fit a specific need.https://blog.google/innovation-and-ai/technology/developers-...Especially on device. https://developers.googleblog.com/on-device-function-calling...It's the same with chips, we have general purpose CPUs but we still have specialized silicon for tasks that are smaller, more power efficient, cheaper, and because they're single purpose it simplifies and derisks certain designs.And I have to add, if you want to learn about finetuning models efficiently the Unsloth guides are at the top of my list. They're practical, have all the technical details, and most importantly Daniel and the others are working around the clock to keep it up to date in what is an incredibly fast moving space of models and hardware. I am continually astounded by their work.
KronisLV: > But now, why?Because these models are good in general but their Latvian output is half-drivel, like the roots of the words are usually the right ones, but not the rest.That, and EuroLLM is really slow to release new models that would be similarly good off the shelf.
Me1000: Wouldn’t it be better to use a grammar in the token sampler? Tuning is fine, but doesn’t guarantee a syntactical correct structured output. But if the sampler is grammar aware it could.
prettyblocks: I don't think you will get that anytime soon because for a model to work well with something like openclaw it needs a massive context window.
butILoveLife: but but but but unified memory! (jk, I don't actually believe in Apple marketing words)There might be future optimizations. Like, have your small model do COT to find where to look for memory that is relevant.
joefourier: Fine-tuning still makes sense for cost/latency-sensitive applications. Massive context windows drastically slow down generation, and modern models' performance and instruction following ability relies heavily on a reasoning step that can consume orders of magnitude more tokens than the actual response (depending on the application), while a fine-tuned model can skip/significantly reduce that step.Using the large model to generate synthetic data offline with the techniques you mentioned, then fine-tuning the small model on it, is an underrated technique.
azath92: Only to prompt thought on this exact question, im interested in answers:I just ran a benchmark against haiku of a very simple document classification task that at the moment we farm out to haiku in parallel. very naive same prompt system via same api AWS bedrock, and can see that the a few of the 4b models are pretty good match, and could be easily run locally or just for cheap via a hosted provider. The "how much data and how much improvement" is a question i dont have a good intuition for anymore. I dont even have an order of magnitude guess on those two axis.Heres raw numbers to spark discussion:| Model | DocType% | Year% | Subject% | In $/MTok ||---------------|----------|-------|----------|-----------|| llama-70b -----| 83 | 98 | 96 | $0.72 || gpt-oss-20b --| 83 | 97 | 92 | $0.07 || ministral-14b -| 84 | 100 | 90 | $0.20 || gemma-4b ----| 75 | 93 | 91 | $0.04 || glm-flash-30b -| 83 | 93 | 90 | $0.07 || llama-1b ------| 47 | 90 | 58 | $0.10 |percents are doc type (categorical), year, and subject name match against haiku. just uses the first 4 pages.in the old world where these were my own in house models, id be interested in seeing if i could uplift those nubmers with traingin, but i haven't done that with the new LLMs in a while. keen to get even a finger to the air if possible.Can easily generate tens of thousands of examples.Might try myself, but always keen for an opinion._edit for table formatting_
airstrike: [delayed]
airstrike: [delayed]
airstrike: [delayed]
piyh: Qwen 9B doesn't?
butILoveLife: Nothing is really usable outside Opus.I've tried too. Wasted a few days trying out even high end paid models.
bravura: For me, trying to fine-tune a model to write "best day" prose I would accept over 80% of the time.You are correct if we are talking about knowledge.However it is bad at hyper-idiosyncratic, gritty style transfer.I first noticed the issue when asking claude code to draft email responses. The choice of register was off. ("Register in writing refers to the level of formality and tone chosen to suit a specific audience, purpose, and context.")I decided to talk all my HN comments and rewrite them in various bad LLM prose, and see if I could use DSPy to optimize a prompt using in-context-learning (ICL, I give it 10 examples of my HN comments) and the results were abysmal. RHLF fine-tuned frontier LLMs have a deep seated aversion to the target stylistic distribution of my comments.I tried fine-tuning qwen3, llama, and gemma models. Instruct models are already so tuned that they could not be tuned. This is using several hunded comments as gold targets and 5 different LLM degradations per gold as the input.
krasikra: Fine-tuned Qwen models run surprisingly well on NVIDIA Jetson hardware. We've deployed several 7B variants for edge AI tasks where latency matters more than raw accuracy – think industrial inspection, retail analytics where you can't rely on cloud connectivity. The key is LoRA fine-tuning keeps the model small enough to fit in unified memory while still hitting production-grade inference speeds. Biggest surprise was power efficiency; a Jetson Orin can run continuous inference at under 15W while a cloud round-trip burns way more energy at scale.
andai: Very interesting. Could you give examples of industrial tasks where lower accuracy is acceptable?
andai: | Model | DocType% | Year% | Subject% | In $/MTok | |----------------|----|-----|----|-------| | llama-70b -----| 83 | 98 | 96 | $0.72 | | gpt-oss-20b ---| 83 | 97 | 92 | $0.07 | | ministral-14b -| 84 | 100 | 90 | $0.20 | | gemma-4b ------| 75 | 93 | 91 | $0.04 | | glm-flash-30b -| 83 | 93 | 90 | $0.07 | | llama-1b ------| 47 | 90 | 58 | $0.10 |
embedding-shape: > where latency matters more than raw accuracy – think industrial inspectionHuh? Why would industrial inspection, in particular, benefit from lower latency in exchange for accuracy? Sounds a bit backwards, but maybe I'm missing something obvious.
someotherperson: At a very high level, think fruit sorting[0] where the conveyor belt doesn't stop rolling and you need to rapidly respond, and all the way through to monitoring for things like defects in silicon wafers and root causing it. Some of these issues aren't problematic on their own, but you can aggregate data over time to see if a particular machine, material or process within a factory is degrading over time. This might not be throughout the entire factory but isolated to a particular batch of material or a particular subsection within it. This is not a hypothetical example: this is an active use case.[0] https://www.youtube.com/watch?v=vxff_CnvPek
embedding-shape: But why would I want to results to be done faster but less reliable, vs slower and more reliable? Feels like the sort of thing you'd favor accuracy over speed, otherwise you're just degrading the quality control?
bigyabai: The high-nines of fruit organization are usually not worth running a 400 billion parameter model to catch the last 3 fruit.
0cf8612b2e1e: Local, offline system you control is worth a lot. Introducing an external dependency guarantees you will have downtime outside of your control.
throwaway6977: I agree- I'm currently trying to learn how I can embed a fine tuned tiny model into my c++ game so it can provide a narrative in prose of certain game-event logs. It needs to be as tiny as possible so it doesn't take resources away from the running game.
yw3410: How small a model are we talking? Don't even the smallest models which would work need gigabytes of memory?
sorenjan: But that's not something you'd use an LLM for. There have been computer vision systems sorting bad peas for more than a decade[0], of course there are plenty of use cases for very fast inspection systems. But when would you use an LLM for anything like that?[0] https://www.youtube.com/watch?v=eLDxXPziztw
_the_inflator: I agree.Also for certain use cases there are constraints like embedded hardware systems with no internet access. These LLMs have to be trained to specialize for clearly defined use cases under hardware constraints.Frontier LLMs also are rarely function in isolation instead are orchestrating a system of special units aka subsystems and agents.While costs and effort are one thing, being able to downsize these monster LLMs through finetuning itself in the first place is extremly valuable.
embedding-shape: Right, but that doesn't answer why you'd need a fast 7b LLM rather than a slightly less fast 14b LLM.
0xbadcafebee: [delayed]
0xbadcafebee: You would use a VLM (vision language model). The model analyzes the image and outputs text, along with general context, that can drive intelligent decisions. https://tryolabs.com/blog/llms-leveraging-computer-vision