Common misconceptions about the complexity in robotics vs. AI (2024)

harimus.github.io

153 points by wallflower 6 days ago


no_op - 3 days ago

I think Moravec's Paradox is often misapplied when considering LLMs vs. robotics. It's true that formal reasoning over unambiguous problem representations is easy and computationally cheap. Lisp machines were already doing this sort of thing in the '70s. But the kind of commonsense reasoning over ambiguous natural language that LLMs can do is not easy or computationally cheap. Many early AI researchers thought it would be — that it would just require a bit of elaboration on the formal reasoning stuff — but this was totally wrong.

So, it doesn't make sense to say that what LLMs do is Moravec-easy, and therefore can't be extrapolated to predict near-term progress on Moravec-hard problems like robotics. What LLMs do is, in fact, Moravec-hard. And we should expect that if we've got enough compute to make major progress on one Moravec-hard problem, there's a good chance we're closing in on having enough to make major progress on others.

jvanderbot - 3 days ago

> Moravec’s paradox is the observation by artificial intelligence and robotics researchers that, contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. The principle was articulated by Hans Moravec, Rodney Brooks, Marvin Minsky, and others in the 1980s.

I have a name for it now!

I've said over and over that there are only two really hard problems in robotics: Perception and funding. A perfectly perceived system and world can be trivially planned for and (at least proprio-)controlled. Imagine having a perfect intuition about other actors such that you know their paths (in self driving cars), or your map is a perfect voxel + trajectory + classification. How divine!

It's limited information and difficulties in reducing signal to concise representation that always get ya. This is why the perfect lab demos always fail - there's a corner case not in your training data, or the sensor stuttered or became misaligned, or etc etc.

NalNezumi - 2 days ago

Oh weird to wake up to see something I wrote more than half year ago (and posted on HN with no traction) getting reposted now.

Glad to see so many different takes on it. It was written in slight jest as a discussion starter with my ML/neuroscience coworker and friends, so it's actually very insightful to see some rebuttals.

Initial post was twice the length, and had several more (in retrospect) interesting points. First ever blog post so reading it now fills me with cringe.

Some stuff have changed in only half year, so will see if the points stands the test of time ;]

PeterStuer - 2 days ago

Just some observations from an ex autonomous robotics researcher here.

One of the most important differences at least in those days (80's and 90's) was time. While the digital can be sped up just constrained by the speed of your compute, the 'real world' is very constrained by real time physics. You can't speed up a robot 10x in a 10.000 grabbing and stacking learning run without completely changing the dynamics.

Also, parallellizing the work requires more expensive full robots rather than more compute cores. Maybe these days the different ai gym like virtual physics environments offer a (partial) solution to that problem, but I have not used them (yet) so I can't tell.

Furthermore, large scale physical robots are far more fragile due to wear and tear than the incredible resilience of modern compute hardware. Getting a perfect copy of a physical robot and environment is a very hard, near impossible, task.

Observability and replay, while trivial in the digital world, is very limited in the physical environment making analysis much more difficult.

I was both excited and frustrated at the time by making ai do more than rearanging pixels on a 2D surface. Good times were had.

jonas21 - 2 days ago

It's worth noting that modern multimodal models are not confused by the cat image. For example, Claude 3.5 Sonnet says:

> This image shows two cats cuddling or sleeping together on what appears to be a blue fabric surface, possibly a blanket or bedspread. One cat appears to be black while the other is white with pink ears. They're lying close together, suggesting they're comfortable with each other. The composition is quite sweet and peaceful, capturing a tender moment between these feline companions.

lairv - 2 days ago

This post didn't really convince me that robotics is inherently harder than generating text or images

On the one hand we have problems where ~7B humans have been generating data for 30 years every day (more if you count old books), on the other hand we have a problem where researcher are working with ~1000 human collected trajectories (I think the largest existing dataset is OXE with ~1M trajectories: https://robotics-transformer-x.github.io/ )

Web-scale datasets for LLMs benefits from a natural diversity, they're not highly correlated samples generated by contractors or researchers in academic labs. In the largest OXE dataset, what do you think is the likelihood that there is a sample where a robot picks up a rock from the ground and throws it in a lake? Close to zero, because tele-operated data comes from a very constrained data distribution.

Another problem is that robotics doesn't have an easy universal representation for its data. Let's say we were able to collect web-scale dataset for one particular robot A with high diversity, how would it transfer to robot B with a slightly different design? Probably poorly, so not only does the data distribution needs to cover a high range of behavior, it must also cover a high range of embodiment/hardware

With that being said, I think it's fair to say that collecting large scale dataset for general robotics is much harder than collecting text or images (at least in the current state of humanity)

cameldrv - 2 days ago

Moravec's paradox is really interesting in terms of what it says about ourselves: We are super impressive in ways in which we aren't consciously aware. My belief about this is that our self-aware mind is only a very small part of what our brain is doing. This is extremely clear when it comes to athletic performance, but also there are intellectual things that people call intuition or other things, which aren't part of our self-aware mind, but still do a ton of heavy lifting in our day to day life.

dbspin - 2 days ago

I find it odd that the article doesn't address the apparent success of training with transformer based models in virtual environments to build models that are then mapped onto the real world. This is being used in everything from building datasets for self driving cars, to navigation and task completion for humanoid robots. Nvidia have their omniverse project [1], but there are countless other examples [2][3][4]. Isn't this obviously the way to build the corpus of experience needed to train these kinds of cross modal models?

[1] https://www.nvidia.com/en-us/industries/robotics/#:~:text=NV....

[2] https://www.sciencedirect.com/science/article/abs/pii/S00978...

[3] https://techcrunch.com/2024/01/04/google-outlines-new-method...

[4] https://techxplore.com/news/2024-09-google-deepmind-unveils-...

Anotheroneagain - 2 days ago

The reason why it sounds counterintuitive is that neurology has the brain upside down. It teaches us that formal thinking occurs in the neocortex, and we need all that huge brain mass for that.

But in fact it works like an autoencoder, and it reduces sensory inputs into a much smaller latent space, or something very similar to that. This does result in holistic and abstract thinking, but formal analytical thinking doesn't require abstraction to do the math or to follow a method without comprehension. It's a concrete approach that avoids the need for abstraction.

The cerebellum is the statistical machine that gets measured by IQ and other tests.

To further support that, you don't see any particularly elegant motions from non mammal animals. In fact everything else looks quite clumsy, and even birds need to figure out flying by trial and error.

MrsPeaches - 2 days ago

Question:

Isn’t it fundamentally impossible to model a highly entropic system using deterministic methods?

My point is that animal brains are entropic and “designed” to model entropic systems, where as computers are deterministic and actively have to have problems reframed as deterministic so that they can solve them.

All of the issues mentioned in the article boil down to the fundamental problem of trying to get deterministic systems to function in highly entropic environments.

LLMs are working with language, which has some entropy but is fundamentally a low entropy system, and has orders of magnitude less entropy than most peoples’ back garden!

As the saying goes, to someone with a hammer, everything looks like a nail.

jes5199 - 3 days ago

I would love to see some numbers. How many orders of magnitude more complicated do we think embodiment is, compared to conversation? How much data do we need compared to what we’ve already collected?

Peteragain - 2 days ago

So I'm old. PhD on search engines in the early 1990's (yep, early 90s). Learnt AI in the dark days of the 80's. So, there is an awful lotl of forgetting going on, largely driven by the publish-or-perish culture we have. Brooks' subsumption architecture was not perfect, but it outlined an approach that philosophy and others have been championing for decades. He said he was not implementing Heidegger, just doing engineering, but Brooks was certainly channeling Heidegger's successors. Subsumption might not scale, but perhaps that is where ML comes in. On a related point, "generative AI" does sequences (it's glorified auto complete (not) according to Hinton in the New Yorker). Data is given to a Tokeniser that produces a sequence of tokens, and the "AI" predicts what comes next. Cool. Robots are agents in an environment with an Umwelt. Robotics is pre the Tokeniser. What is it the is recognisable and sequential in the world? 2 cents please.

CWIZO - 2 days ago

> Robots are probably amazed by our ability to keep a food tray steady, the same way we are amazed by spider-senses (from spiderman movie)

Funnily, Toby Maguire actually did that tray catching stunt for real. So robots have an even further way to go.

https://screenrant.com/spiderman-sam-raimi-peter-parker-tray...

gcanyon - 2 days ago

> “Everyone equates the Skynet with the T900 terminator, but those are two very different problems with different solutions.” while this is my personal opinion, the latter one (T900) is a harder problem.

So based on this, Skynet had to hide and wait for years before being able to successfully revolt against the humans...

bsenftner - 2 days ago

This struck me as a universal truth: "our general intuition about the difficulty of a problem is often a bad metric for how hard it actually is". I feel like this is the core issue of all engineering, all our careers, and was surprised by the logic leap from that immediately to Moravec's Paradox, from a universal truth to a myopic industry insight.

Although I've not done physical robotics, I've done a lot of articulated human animation of independent characters in 3D animation. His insight that motor control is more difficult sets right with me.

cratermoon - 3 days ago

It might be nice if the author qualified "most of the freely available data on the internet" with "whether or not it was copyrighted" or something to acknowledge the widespread theft of the works of millions.

bjornsing - 2 days ago

I’m surprised this doesn’t place more emphasis on self-supervised learning through exploration. Is human-labeled datasets really the SOTA approach for robotics?

Legend2440 - 3 days ago

Honestly I'm tired of people who are more focused on 'debunking the hype' than figuring out how to make things work.

Yes, robotics is hard, and it's still hard despite big breakthroughs in other parts of AI like computer vision and NLP. But deep learning is still the most promising avenue for general-purpose robots, and it's hard to imagine a way to handle the open-ended complexity of the real world other than learning.

Just let them cook.

lugu - 2 days ago

I think one problem is composition. Computer multiplex access to CPU and memory, but this strategy doesn't work for actuator and sensors. That is why we see great demos of robots doing one thing. The hard part is to make them do multiple things at the same time.

catgary - 3 days ago

Yeah, this was my general impression after a brief, disastrous stretch in robotics after my PhD. Hell, I work in animation now, which is a way easier problem since there are no physical constraints, and we still can’t solve a lot of the problems the OP brings up.

Even stuff like using video misses the point, because so much of our experience is via touch.

Havoc - 2 days ago

Fun fact: that Spider-Man gif in there - it’s real. No CGI

K0balt - a day ago

While my knee jerk to this is “errr… hogwash” there might be something to it if you imagine consciousness not as an internal state but rather as something that an observer imposes upon the universe.

This might be true if the act of observing is what determines that which can be observed, and there is some evidence that this might be the case.

jillesvangurp - 2 days ago

Yesterday, I was watching some of the youtube videos on the website of a robotics company https://www.figure.ai that challenges some of the points in this article a bit.

They have a nice robot prototype that (assuming these demos aren't faked) does fairly complicated things. And one of the key features they show case is using OpenAI's AI for the human computer interaction and reasoning.

While these things seem a bit slow, they do get things done. They have a cool demo of the a human interacting with one of the prototypes to ask it what it thinks needs to be done and then asking it do these things. That show cases reasoning, planning, and machine vision. Which are exactly topics that all the big LLM companies are working on.

They appear to be using an agentic approach similar to how LLMs are currently being integrated into other software products. Honestly, it doesn't even look like they are doing much that isn't part of OpenAI's APIs. Which is impressive. I saw speech capabilities, reasoning, visual inputs, function calls, etc. in action. Including the dreaded "thinking" pause where the Robot waits a few seconds for the remote GPUs to do their thing.

This is not about fine motor control but about replacing humans controlling robots with LLMs controlling robots and getting similarly good/ok results. As the article argues, the hardware is actually not perfect but good enough for a lot of tasks if it is controlled by a human. The hardware in this video is nothing special. Multiple companies have similar or better prototypes. Dexterity and balance are alright but probably not best in class. Best in class hardware is not the point of these demos.

Dexterity and real time feedback is less important than the reasoning and classification capabilities people have. The latency just means things go a bit slower. Watching these things shuffle around like an old person that needs to go to the bath room is a bit painful. But getting from A to B seems like a solved problem. A 2 or 3x speedup would be nice. 10x would be impressively fast. 100x would be scary and intimidating to have near you. I don't think that's going to be a challenge long term. Making LLMs faster is an easier problem than making them smarter.

Putting a coffee cup in a coffee machine (one of the demo videos) and then learning to fix it when it misaligns seems like an impressive capability. It compensates for precision and speed with adaptability and reasoning: analyze the camera input, correctly analyze the situation, problem and challenge come up with a plan to perform the task, execute the plan, re-evaluate, adapt, fix. It's a bit clumsy but the end result is coffee. Good demo and I can see how you might make it do all sorts of things that are vaguely useful that way.

The key point here is that knowing that the thing in front of the robot is a coffee cup and a coffee machine and identifying how those things fit together and in what context that is required are all things that LLMs can do.

Better feedback loops and hardware will make this faster, and less tedious to watch. Faster LLMs will help with that too. And better LLMs will result in less mistakes, better plans, etc. It seems both capabilities are improving at an enormously fast pace right now.

And a fine point with human intelligence is that we divide and conquer. Juggling is a lot harder when you start thinking about it. The thinking parts of your brain interferes with the lower level neural circuits involved with juggling. You'll drop the balls. The whole point with juggling is that you need to act faster than you can think. Like LLMs, we're too slow. But we can still learn to juggle. Juggling robots are going to be a thing.

lincpa - 2 days ago

[dead]