Discussion
Introducing Cohere Transcribe: a new state-of-the-art in open-source speech recognition
geooff_: I can't say enough nice things about Cohere's services. I migrated over to their embedding model a few months ago for clip-style embeddings and it's been fantastic.It has the most crisp, steady P50 of any external service I've used in a long time.
dinakernel: My worry is that ASR will end up like OCR. If the multi modal large AI system is good enough (latency wise), the advantage of domain understanding eats the other technlogies alive.In OCR, even when the characters are poorly scanned, the deep domain understanding these large multi modal AIs have allows it to understand what the document actually meant - this is going to be order id because in the million invoices I have seen before order id is normally below order date - etc. The same issue is going to be there in ASR also is my worry.
gruez: > Limitations>Timestamps/Speaker diarization. The model does not feature either of these.What a shame. Is whisperx still the best choice if you want timestamps/diarization?
akreal: WhisperX is not a model but a software package built around Whisper and some other models, including diarization and alignment ones. Something similar will be built around the Cohere Transcribe model, maybe even just an integration to WhisperX itself.
topazas: How hard could it be to train other European language(-s)?
gunalx: If you have to ask you dont really need the answer.Seems to not be to difficult in finding or creating training code. So a pretty decent amount of high quality training data should be many hours. And a few hours in high end data enter GPU compute, and many iterations to get it right.
bartman: Even in the commercial space, there’s a lack of production grade ASR APIs that support diarization and word level timestamps.My experiences with Google’s Chirp have been horrendous, with it sometimes skipping sections of speech entirely, hallucinating speech where the audio contains noise, and unreliable word level timestamps. And this all is even with using their new audio prefiltering feature.AWS works slightly better, but also has trouble with keeping word level timestamps in sync.Whisper is nice but hallucinates regularly.OpenAI’s new transcription models are delivering accurate output but do not support word level timestamps…A lot of this could be worked around by sending the resulting transcripts through a few layers of post processing, but… I just want to pay for an API that is reliable and saves me from doing all that work.
teach: Dumb question, but if this is "open source" is there source code somewhere? Or does that term mean something different in the world of models that must be trained to be useful?
Doman: Files can be downloaded here: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/...And someone has already converted it to onnx format: https://huggingface.co/eschmidbauer/cohere-transcribe-03-202... - so it can be run on CPU instead of GPU.
harvey9: It includes several European languages.
stronglikedan: hence "other" lol
stronglikedan: I presume it means the model itself.