Robotics in the era of the Scaling Hypothesis

This piece is my attempt to collect my thoughts on the current landscape in AI and Robotics, how this informs our approach at mimic in our quest towards solving general-purpose robotic dexterity, and what I think the future of the field will look like. Are you excited about this as well? We are hiring.

Preface

It was mid-2021 and I had just begun my PhD at ETH Zurich when I came across a blog post titled “The Scaling Hypothesis”. I had just made the decision to move away from my previous theory-minded research on Bayesian Optimization and dive straight into deep learning, guided by a vague belief in the fact that “deep learning had started to finally work”. All around me, it seemed that we had already found the right primitives for general-purpose intelligence, even if those primitives were really just simple, “dumb” regular neural networks. “The Scaling Hypothesis” however suggested something more radical than just that: there was a specific way of identifying the types of approach to AI that eventually work, the state of the art in AI was going to improve dramatically very soon, and it was all going to happen in a predictable manner. And I definitely wanted to join in the fun. So that’s what I did.

What is the “Scaling Hypothesis” about? I would describe it as a kind of strategic awareness for what methods are bound to succeed in AI research. Mainly, it consists of these three points:

General-purpose methods that leverage large scale computation and data are ultimately the most effective, and in the long run always beat ad-hoc, bespoke methods engineered for specific tasks with lots of baked-in human knowledge (Rich Sutton’s Bitter Lesson).
“Generative AI” models such as (at the time of the post) GPT-3 are exactly the right type of scalable methods: they consist of very simple neural network architectures, with almost no hint of special human expert-derived features, and can be easily scaled up to very large parameter counts and trained on extremely large datasets. Despite all of the extreme inefficiencies of a model like GPT-3, from low data quality to unnatural word tokenization, it is its scalability that rendered it relevant and a generational achievement.
Plotting the “scaling laws” of a model like GPT-3, it’s possible to predict the performance of future models as they increase in parameter and dataset size. The scaling hypothesis is not just post-hoc rationalization, but it can be used inductively to inform technological development timelines and investment decisions, on the road towards transformative general-purpose AI. And, in the years after the “Scaling Hypothesis” blog post, it has been used exactly for that, mostly with great success ¹.

If you’re at all into AI in 2024, all of the above is very old news. “Scaling laws” are an extremely widely used and misused term. But in 2021, the contents of the “Scaling Hypothesis” were absolutely not considered obvious. On the contrary, it was a position often ridiculed within academic and practitioner circles. Why am I then re-iterating this “old” story? My thesis is that the field of robotics has not yet digested the implications of the Bitter Lesson, and this can give a massive edge to anyone who starts doing so now.

Add more layers.

Scalability in Robotics and mimic

If you look at the state of the art of robotics research today (never mind industry!), you will quickly notice that most of it is still dominated by extremely ad-hoc methods attempting to solve one individual task with high enough precision and robustness. This is expected, as robotics is notoriously hard, even before you start factoring in AI: hardware has a difficult learning curve, reproducible experiments are almost impossible, upfront costs are not insignificant. In this environment, getting anything to work at all is already an achievement, and once you get it to work, the next step has to be to focus on reliability for it to be at all commercially viable. You are essentially forced by your circumstances to think small and have no ambition. For a person working at the bleeding edge of AI and accustomed to the fast pace of the field, getting everything around you to constantly fall apart and stop working for a million reasons that have nothing to do with your actual AI model is demotivating.

Insofar as deep learning is involved in “actually existing” robotics, you will mostly see it used as an individual component of a complex system: used for state estimation, object detection, at most locomotion. Think of this overall robotics solution as a system akin to pre-LLM voice assistants like Alexa: millions of lines of bespoke code implementing use case logic (”do a Google search”, “listen to music”), with deep learning models inserted in-between the cracks to improve specific sub-components of the system. Such a system was the only way to get from zero to one and begin delivering something workable to customers for a very complex use case. Deviate slightly from the modal use case, and suddenly nothing works again.

The other way of doing things, as you might now expect, involves simple methods that scale well with computation and data. The reason these methods are not used initially is that with lack of computation and data they just don’t work at all. But, if you believe the Bitter Lesson, in the long run they always win. For conversational assistants (and disembodied agents), this “other” way is that of Large Language Models. In robotics, I am betting on end-to-end deep learning models of embodied robotic behavior (you can call them “Foundation AI Models for Robotics”, “Large Behavior Models”, or any other marketing term you like). Instead of a compartimentalized solution involving individual components for path planning, collision avoidance, object detection, state estimation and application logic, think of a single end-to-end model mapping RGB images and proprioceptive readings to robot actions. Suddenly, learning a large amount of different manipulation tasks becomes a data scaling problem as opposed to a series of ad-hoc engineering problems. This is the approach we are taking at mimic.

mimic’s tagline is: Scalable AI models for universal robotic manipulation. In more detail: We build dexterous robotic hands that fit seamlessly into human workplaces, and we plan to provide a “Foundation AI Model” suite allowing our robots to perform general-purpose skills, from simple industrial and logistics pick-and-place, to kitting, food preparation, and even service and care. We specifically focus on manipulation because it’s the key to enable widespread economic impacts from robotics, regardless of what the rest of our robotics system looks like (whether a static station, a wheeled robot, a quadruped or a humanoid).

The reason I co-founded mimic is that I see an immense opportunity within robotics to build an AI company that looks at robotic dexterity as a problem of data and compute scaling, solvable with simple, scalable methods and the smart use of large datasets. I not only think this is a great opportunity, but I also think that we’re in an extremely favorable place and time to make it work, despite humble beginnings.

Around the world, the most forward-thinking people in robotics are already reasoning in these terms. In 2022, Eric Jang asked ”How can we make robotics more like generative modeling?” Google has been training end-to-end robot models for a while too, using their existing Vision-Language Models as backbones (see RT-2). Breakaways from within Google have raised tens of millions for exactly this type of robot models (Physical Intelligence), and humanoid robotic companies with an end-to-end approach to autonomy (1X, Figure) have raised hundreds of millions of dollars.

Still, we are in the early days: the state of the art in terms of compute and data scale for these types of robot models is basically five years behind LLMs, to the point that even small start-ups can compete at the level of the base model and demonstrate their worth for further investment ².

A Concrete Plan for the Future

Having decided to bet on scalability, what should the next few years look like? In my view, the road to truly general-purpose robots that will have a transformative impact on real GDP will have to go through these steps:

Multi-Task Imitation Learning with Robot Tele-operation Data
As stated above, what we are ultimately aiming to do is to train a single language-conditioned end-to-end model mapping RGB images and proprioceptive readings to robot actions. The native data format for such a model is that of “robot tele-operation data”: simply deploy lots of robots, hire robot operators, and obtain large quantities of robot trajectory and camera feed data to train a model on with imitation learning. This is essentially what the other humanoid robotics companies that are taking an end-to-end approach to autonomy are now doing, including us. Latest algorithmic advances such as Diffusion Policy have made this viable, the question is: is it enough? Looking at the data scale required for training Large Language Models today, it seems uneconomical to have to rely on expensive manual data collection with physical robots for much longer.
Large Scale Human Data Pre-Training
Once one has built a working proof of concept for imitation learning with robot tele-operation data, it is my opinion that the greatest gains in robot capabilities, especially for a lean start-up, are to be obtained by a smart approach to efficient and scalable data collection. Stanford’s Universal Manipulation Interface is a great example of such an efficiency gain, as it allows to cheaply collect robot data with a portable gripper without needing to deploy robots. The logical conclusion of this idea of maximizing data collection efficiency is that of foregoing the robot affordances altogether: simply train on large scale egocentric human video data, which is attainable at large scale today. mimic’s choice of a highly dexterous humanoid hand for our robot combines particulary well with this idea, as the “embodiment gap” between human hands and our robot hand is much smaller than between human hands and two-finger gripper robots.

I consider the two above steps crucial in the short term for developing general-purpose robot AI, and foundational to everything else. However, there are some additional directions that I think will be interesting to explore in the medium term:

World Models for Evaluations
A huge problem with imitation learning operating directly on real world robot data is that model validation scores cannot be computed on the fly during model training. Unlike Language Models, to obtain an actually valuable evaluation score for this type of model you need to deploy it on a physical robot in a human-curated physical environment. A way to solve this problem seems to be that of learning a simulation of the real world environment conditioned on robot actions, also called a “world model”. As an example of this, 1X recently posted about their efforts in training a world model for their EVE robot, specifically aimed at automating evaluations. There remains a question: how reliable will such world models be, and how can we be sure that the evaluation quality will be uncorrelated with model errors, given that they will most likely be trained on the same data used to train the robot policies themselves?
Vision-Language Models as Reward Models for the Real World
World Models can allow us to perform simulated model evaluations. However, whether we do evaluations with World Models or in the real world, to reach full automation of evaluations we need to be able to autonomously assign success scores (or “rewards”) to these evaluations, as currently all evaluations for general-purpose robot tasks have to be performed by human raters. The obvious answer to this problem is that of using Vision-Language Models (VLMs) as Reward Models for real world robotic tasks. I have worked on this before within a purely RL-from-scratch context. I believe that fine-tuning existing VLMs on robot data and human preference ratings can already lead to high quality real world robot action reward models today.
Reinforcement Learning Fine-Tuning
If scaling up imitation learning alone cannot achieve the last precious 99.99..% in success rates on some real world robotics tasks, I expect that the above discussed Robot World Models and VLM-based Reward Models can come together to power a resurgence of Reinforcement Learning for robotics. As opposed to using RL to train sim-to-real robot policies “from scratch”, RL would be employed as a “cherry on top” of large scale robot AI models trained with imitation learning, similarly to how RLHF can be used to shape the behavior of Large Language Models and make them safer and more useful for downstream applications. Further ahead, I can imagine RL-based optimization being used to make robots attain superhuman performance on general-purpose manual labor tasks, akin to how OpenAI o1 makes use of RL with Chain-of-Thought to bootstrap reasoning capabilities.

Conclusion

Going back to the original point: I hope I was able to paint a convincing picture of how the field of robotics has not yet fully embraced the implications of the Bitter Lesson and the Scaling Hypothesis, and of what the path for a robotics company that wishes to embrace such a philosophy will look like. By focusing on end-to-end deep learning models and scalable data collection methods, we at mimic are positioning ourselves at the forefront of this paradigm shift.

Want to join us? We are hiring.

Three years later, in 2024, we finally got to a point in which it seems that “dumb” scaling-up of LLMs has hit diminishing returns - mostly due to power generation bottlenecks, and the hyperscaler AI labs are trying out new ideas such as Chain-of-Thought RL training to most efficiently make use of compute at both training time and inference time. But to even recognize this as a development, one had to first adopt the Scaling Hypothesis framing. Three years ago, we were simply operating in a different paradigm. ↩︎
In some cases, one needs to fine-tune large scale VLMs that had to be pre-trained at a scale beyond the grasp of start-ups. However: 1. There are open source VLMs available to anyone for this purpose. 2. The compute and data involved in the fine-tune step for the current state of the art in robotics is still accessible to anyone with a small amount of seed funding. ↩︎

Preface#

Scalability in Robotics and mimic#

A Concrete Plan for the Future#

Conclusion#

Preface

Scalability in Robotics and mimic

A Concrete Plan for the Future

Conclusion