Discussion
celestialcheese: mix of promptfoo and ad-hoc python scripts, with langfuse observability.Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.
kelseyfrog: Automated benchmarking.We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.
dkoy: Curious who’s used OpenAI Evals
raviisoccupied: I have been working on a web app called Beval - Simple evaluations for your AI product that meant to be a 'lay person' introduction to evals.In my day to day as a Product Manager working in a team that ships AI products, I often found myself wanting to do 'quick and dirty' LLM based evaluation on conversation transcripts and traces. I found myself blocked by 'Gemini in Google Sheets', it was too slow and cumbersome, and it didn't handle eval changes well. And because I was exploring, it wasn't helpful to try and set up something more robust with the team.To fix the problem I eventually learned to call the OpenAI API in python and more sophisticated approaches like some listed here, but I really felt that I wanted a 'product' to help me and potentially help others.You can check it out at https://www.beval.spaceFull disclosure - this is vibe coded and still a work in progress.
georgemcbay: In my experience the models are all so close in performance and the deltas between them at different tasks change so often as each new model version is released that spending significant effort to evaluate which is best in class for your task is kind of a fool's errand.Better to spend the time getting used to all of them and setting up systems that let you easily switch between them.