A list of borrowed opinions
Generative Models and RL
The emerging “generative” paradigm in AI is a big deal. Large scale models that are trained autoregressively (or with diffusion) on large scale datasets seem to encode useful abstractions one can bootstrap agency from. On the other hand, stuff like pure Reinforcement Learning from scratch seems way too sample inefficient to ever scale to complex behaviors in the real world. This is why I am doing behavior cloning instead of sim-to-real RL in my current robotics work.
Rich Sutton produced both the best idea ever (The Bitter Lesson) and the worst idea ever (Reward is enough) in deep learning. The bitter lesson is all about choosing approaches that scale well! Good luck with “reward is enough” waiting until the heath death of the universe to learn anything!
It is fashionable in academia to say that GPT-style transformers trained with next token prediction “don’t really understand”, “cannot reason”, are stochastic parrots and so on. My friend Daniel Paleka argues it’s not that simple.
Robotics
It seems pretty clear that the learning architecture / recipes for solving autonomous robotics are already here, the only thing that is missing is scale. We have “universal” techniques for learning representations, abstractions and behaviors. Ignore reward optimization, ignore super-human performance, ignore AGI. Want to build a robot to do a task? Just collect lots of human-level data from completions of that task. This last part is the only thing that is still hard due to the economics of real world hardware.
Moravec’s Paradox is not really that relevant in the era of deep learning. Specifically, Moravec’s Paradox describes the fact that computational systems are good at things that humans find hard (e.g. arithmetic and chess), and bad at things that humans find easy (e.g. motor skills). In the GOFAI era, this was relevant. But early on in the advent of Deep Learning, we already saw computers achieve extremely high performance on tasks that humans find easy, such as object classification. For a human, recognizing images is extremely easy, it happens instantaneously and subconsciously. Now, it is considered easy for computers as well, provided the right architecture and algorithms are used. Deep learning already solved Moravec’s Paradox. The only bottleneck is what we can get scalable data for: easy for images and text in our internet era, still hard for motor skills. A “data lottery ticket hypothesis”.
Given the above, the specifics of the architecture don’t really matter. Whether we use RL as a cherry on top of world models or fully use behavior cloning doesn’t matter. What matters is that we train large scale neural networks that encode suitable representations for performing embodied actions. Only for practical reasons, we need to understand which method allows us to scale faster towards this goal.
Anti-Theory
Theory is supposed to explain reality in an elegant, low-complexity Occam-razor kind of way. “Classical” Learning Theory, when applied to Deep Learning, fails the first step of explaining phenomena that actually occur in practice.
I am skeptical of whether Double Descent is at all relevant to deep neural network generalization. If you look at GPT-3, its parameter count (175B) is lower than the number of tokens it was trained on (300B). Nevermind the Chinchilla scaling laws arguing for training the same models on even more data.
I appreciate the attempt at moving beyond classical theory. Still, double descent is commonly explained with linear regression as a starting example. You’re telling me it’s not just classical learning theory in a trench coat?I expect that whatever theory of deep learning we need, it cannot possibly be the kind of theory that starts from bounds on linear regression. Also, it would probably have something to do with absolute model size and emergent abilities.
Diffusion models come with beautiful and complicated theory about how they approximate a score function and perform stochastic gradient Langevin Dynamics. Except that they can be made to work even if the noise is not i.i.d. Gaussian, or without noise at all. The design space of diffusion models can be explored independently of theoretical niceness – a story as old as time. I believe that diffusion models ultimately work because they are a scalable method for giving neural networks adaptive computation time. Also, because they are not adversarial.
Adversarial Learning
- I am happy that due to the seeming demise of GANs many of us do not have to deal with adversarial optimization anymore. I suspect it to be a fundamentally harsher setting with respect to safety and interpretability concerns than the autoencoding/autoregressive one. I also dislike it ideologically.
Various links – Deep Learning
Ferenc Huszar’s inference.vc blog is great for explanations and explorations of theoretical topics in ML
Sam Bowman: Eights Things to Know about Large Language Models
Various links – Everything Else
- Money Stuff by Matt Levine
The best financial newsletter in the world. Very smart and fun.