Discussion
Gemini Robotics-ER 1.6: Powering real-world robotics tasks through enhanced embodied reasoning
jeffbee: Showing the murder dog reading a gauge using $$$ worth of model time is kinda not an amazing demo. We already know how to read gauges with machine vision. We also know how to order digital gauges out of industrial catalogs for under $50.
gallerdude: I’ve been thinking about AI robotics lately… if internally at labs they have a GPT-2, GPT-3 “equivalent” for robotics, you can’t really release that. If a robot unloading your dishwasher breaks one of your dishes once, this is a massive failure.So there might be awesome progress behind the scenes, just not ready for the general public.
monkeydust: I ended up watching Bicentennial Man (1999) with Robin Williams over the weekend. If you haven't seen I thought it was a good and timely thing to watch and is kid friendly. Without giving away the plot, the scene where it was unloading the dishwasher...take my money!
skybrian: Pointing a camera at a pressure gauge and recording a graph is something that I would have found useful and have thought about writing. Does software like that exist that’s available to consumers?
gunalx: Look into opencv.
sho_hn: It does all start to feel like we'd get fairly close to being able to convincingly emulate a lot of human or at least animal behavior on top of the existing generative stack, by using brain-like orchestration patterns ... if only inference was fast enough to do much more of it.The gauge-reading example here is great, but in reality of course having the system synthesize that Python script, run the CV tasks, come back with the answer etc. is currently quite slow.Once things go much faster, you can also start to use image generation to have models extrapolate possible futures from photos they take, and then describe them back to themselves and make decisions based on that, loops like this. I think the assumption is that our brains do similar things unconsciously, before we integrate into our conscious conception of mind.I'm really curious what things we could build if we had 100x or 1000x inference throughput.
Kostic: Taalas showed that you could make LLMs faster by turning them into ASICs and get 10k+ token generation. It's a matter of time now.
readams: I think that where this gets interesting is when you can just drop these robotic systems into an environment that wasn't necessarily set up specifically to handle them. The $50 for your gauge isn't really the cost: it's engineering time to go through the whole environment and set it up so that the robotic system can deal with each of the specific tasks, each of which will require some bespoke setup.
spwa4: It's called "VLA" (vision-language-action) models: https://huggingface.co/models?pipeline_tag=roboticsVLA models essentially take a webcam screenshot + some text (think "put the red block in the right box") and output motor control instructions to achieve that.Note: "Gemini Robotics-ER" is not a VLA, though Gemini does have a VLA model too: "Gemini Robotics".A demo: https://www.youtube.com/watch?v=DeBLc2D6bvg