Jaime Sevilla on Trends in Machine Learning
Contents
Jaime Sevilla is the Director of Epoch, a team of researchers investigating and forecasting the development of advanced AI. This is his second time on the podcast.
Over the next few episodes, we will be exploring the potential for catastrophe cause by advanced artificial intelligence. Why? First, you might think that AI is likely to become transformatively powerful within our lifetimes. Second, you might think that such transformative AI could result in catastrophe unless we’re very careful about how it gets implemented.
This episode is about understanding the first of those two claims.
In particular, we talk about three big contributors to progress in machine learning:
- Compute (how many computations you can perform, how quickly, and for how much money)
- Data (how much raw data do you have to train on, and how high-quality it is)
- Algorithmic efficiency (for a given amount of compute and data, how cleverly can you train a system on that data and produce useful, reliable results?)
Fin spoke with Jaime about investigating progress in all three factors. In particular, they discuss:
- We’ve seen a crazy amount of progress in AI capabilities in the last few months, with AI models like ChatGPT, GPT-4, Microsoft’s Bing chatbot, and more. How should we think about that progress continuing into the future?
- How has the amount of compute used to train AI models been changing over time?
- What about algorithmic efficiency?
- How expensive are the training runs for big ML models like GTP-3 and 4?
- Will we soon run out of data to keep making progress in training big ML models?
- How many words is a typical human trained on, compared to LLMs like ChatGPT?
- What will become of AI-generated art?
Further reading
- Epoch’s website
- Jaime’s Twitter
- Jaime’s AI-generated art project
- Epoch reports and blog posts that we discuss in the interview
Transcript
Introduction
Fin 00:06
Hey, you’re listening to hear this idea. So for the next few months, Luca and I are making artificial intelligence a big focus of the podcast. Now, what’s the general motivation for focusing on AI if you’re looking for opportunities to have a big positive impact? Well, you might believe two things. First, you might think that AI is likely to become transformatively powerful within our lifetimes. That means maybe it could cause a transition comparable to something like the agricultural or industrial revolution. And second, you might think that transformative AI could result in catastrophe unless we’re very careful about how it gets implemented. This episode is about giving some background on the first of those two claims. Specifically, Luca and I wanted to hear more about trends in AI so far and how we might begin to extrapolate them forward.
Fin 00:56
So we reached out to Epoch, which is a group of researchers investigating and forecasting the development of advanced AI. I’ve been hugely impressed with Epoch’s output so far. In case you need convincing, I can say just this week, their report on compute trends was the first citation in Google’s recent announcement about its new AI service called Bud. Now, in particular, I spoke Jaime on who is now Epoch’s director. Now, you might remember Jaime all the way back from episode 13, where we talked about cultural persistence and quantum computing. But in this conversation, Jaime gave just a really nice overview of Epoch’s results so far.
Fin 01:34
We went over big picture trends in the amount of computes used in top ML models, whether it’s soon going to be hard to find more training data, trends in algorithmic efficiency and GPU costs, and how performance has scaled historically with those different inputs. And we also chatted about how much text data humans are trained on and AI art. Now, apologies for the spotty audio in this one. We actually recorded this in Mexico City, and I managed to only get one mic set up. So you’ll have to bear with my phone’s mic on my questions, I’m afraid. Also, just a note that if you’d like to skip to sections which stand out, then you can use the chapter markers. Okay. Without further ado, here’s the episode. All right, Henry Sevilla, you are our first ever repeat guest.
Fin 02:21
So special thanks for coming on for the second time.
Jaime 02:23
Thank you for inviting me.
Fin 02:25
Cool. So we are going to talk about trends in machine learning, but first question we ask everyone, is there a particular problem that you’re currently stuck on?
Jaime 02:37
So a problem that I’m currently trying to answer is how we can combine traditional macroeconomic models of growth with models of AA timelines to try to understand better what those tell us about how the economy is going to be progressively automated in the next few decades.
Fin 02:54
Last time we spoke, you were actually where were you working? We spoke last time, so I was.
Jaime 03:00
Working on technological forecasting.
What does Epoch do?
Fin 03:02
Okay. All right. Nice and now yeah, now we’re talking about forecasting ML specifically. And specifically. Specifically at an organization called Epoch. You’re the director of Epoch. Can you see a bit about what Epoch does and why it exists?
Jaime 03:18
So Epoch is a research organization that is trying to figure out what is going to happen with artificial intelligence. Currently, we live in an exciting era. I will say we have seen lots of advances in artificial intelligence, in text generation, in image generation, in protein generation, like many other areas that are really interesting. And this raises an important question about what are going to be the long-term consequences of these advances for society. We want to give an empirical footing to the inquiry of this question. We’re trying to gather data about artificial intelligence in the last few decades and build some models that allow us to extrapolate these trends into the future.
Fin 04:07
Got it. Okay. So in my head, I guess there are organizations which work on AI alignment, making sure AI is safe when it becomes powerful enough to be very dangerous, potentially. I guess there are also orgs which work on questions around governing powerful AI. Epoch is focused on forecasting how AI could play out from now to when it becomes especially transformative. Is that roughly right? Yeah, that’s cool. Right. And then quickly, what’s the story behind Epoch? How did it come to be?
Jaime 04:38
So? It came in a long way. So while I was doing my PhD, I had many Cypriots different small investigations into different aspects that had to do with my bigger program about technological forecasting and one of them had to do with AI. We were trying to gather data on milestone machine learning systems and trying to study how these machine learning systems have become bigger over time, like how they were consuming more compute over time. This effort started growing. We got a few publications out there. People started recognizing or labor. Other people wanted to join and contribute to the effort. And then later I received an offer to help the Open Philanthropy Project with developing their own models of artificial intelligence, which I did.
Jaime 05:34
Together with also, while doing that, they also provided with me some funding so I could pursue these site projects that I was involved in.
Fin 05:45
Okay, nice. All right, so I guess most of the episode we’re going to be talking about investigating trends in machine learning. I guess a question I should just ask at the beginning though is what are we talking about when we’re talking about machine learning? What is this ML thing? Yes.
Jaime 06:02
So machine learning is a field of research that is trying to develop programs that automatically adapt to new tasks and are able to perform these really complicated functions without the programmers having to explicitly program them to do so.
What causes progress in machine learning?
Fin 06:24
Maybe one question is presumably it’s useful to break down the big question of how do we think about progress and machine learning into questions about different factors which combine to give you answer. So the question is, yeah, what are those factors? What are these different elements that add up to progress in ML?
Jaime 06:48
So in the last few decades in machine learning, we have learned about a few surprising regularities about how certain inputs to machine learning systems strongly determine are strongly correlated with the performance that these systems are able to achieve. The two chief components that go into this recipe are like the amount of compute that is used to train these machine learning models, the amount of operations that are used to train them, and second amount of data that goes into these models and the quality of this data. A third component that is related to this too is like how our understanding of how to combine compute and data evolves over time, which is what I will call like algorithmic improvements. So this will be the three components that I would point to.
Fin 07:38
Got it? So compute something like how many operations can you do data, how many ones and zeros can you actually work with and learn from? Then algorithms is like, what kind of clever new rules can we come up with to do things with those ones and zeros with all that compute? Maybe one question is how do these things fit together to give you answer about progress in machine learning? So maybe they add together, maybe they multiply, maybe one is a bottleneck and then it becomes unblocked. I guess historically, what has been the story of how these three factors have kind of combined?
Jaime 08:17
So the exact way in which they combine? It’s a bit complicated, but I’m going to give a bit of a simplification. But roughly historically it has turned out that the two main drivers of progress have been the scaling of compute and these algorithmic improvements. And I will say that they are like 50. This is like a figure that we derived from a recent investigation that my colleagues Tamai Masiroglu and a Care deal conducted.
Fin 08:46
You didn’t mention data there. Was that just because there’s always been enough data? There hasn’t really been a bottleneck until recently.
Jaime 08:52
Very much so. Up until this point, data has been collected basically for people in their free time building like really giant data sets. It hasn’t been like a bottleneck in the same way that the other two have been.
Trends in compute used for ML
Fin 09:05
Got it? Okay, so let’s talk about compute trends first, and specifically this paper that you and some colleagues published late last year. It’s called compute trends across three eras of machine learning. First question is just what did you set out to know through that paper? What questions were you trying to answer?
Jaime 09:25
So it was more of an exploratory paper in the sense that we didn’t have any press specified questions in our mind, but rather we had this general notion of compute is a really important factor for determining progress in machine learning. And we wanted to give people a better understanding of what the trend had been historically on it also trying to inform how it might go on in the future. One thing that were also particularly interested in understanding is like any discontinuities that we could observe in the strengths because those will point to changes in the field that will help us better understand how researchers perception of machine learning has changed in the last few decades.
Fin 10:09
Got it. And I guess at a high level you were trying to figure out the amount of compute used by a bunch of different ML projects or models over time, starting from the very first models and then you’re trying to figure out what kinds of trends they paint. Yeah, maybe it’s a silly question, but how do you figure out how much compute a particular model uses?
Jaime 10:32
That’s an excellent question. We had to develop our own methodology to try to understand this better. There was some preliminary work by OpenAI on an AI and Compute report that they released on their website where they already outlined two methods to try to estimate the amount of compute that goes into that model. We closely examined those methods and concluded that they are basically sound and in practice is what we ended up using. The two main ways in which you determine compute. One of them is you just look at the hardware that was used. Usually in those papers it is reported the kind of hardware that was used to train the models and the amount of time that this hardware was run for.
Jaime 11:15
So you can just do the knife calculation of like saying this hardware is so powerful multiplied by the training time that’s roughly the amount of compute.
Fin 11:24
Got it.
Jaime 11:25
And then the second one is you just count manually, like the amount of operations in the model where you just look at the architecture. You figure out like, all right, how long will it take to process like one data point? And then you multiply that by the amount of data points that the machine learning models has seen. Okay, rough.
Fin 11:45
Got it. And can you use both methods for the same model and do you get like a roughly similar answer just to kind of check that it’s going work?
Jaime 11:52
You can. And in fact for like a subset of our data, that is exactly what we did so that we could get certainty that the muscles that we’re using were concordant.
Fin 12:02
Awesome. Okay, so that was the methodology for this paper. Natural question is what did you find? You’re looking at trends in compute over time and what kind of picture do you see?
Jaime 12:13
So there were three main results that we gleaned from this paper. The first one has to do with the historical rate of compute which has grown at rate of like doubling every 20 months or so. This corresponds roughly to Moore’s law. So this indicates that up until 2010. What was happening is that researchers, developers were not investing more money into running these machine learning systems, but rather they were just getting more powerful computes computers which allowed them to train, like, to put more compute into these machine learning systems.
Moore’s law
Fin 12:52
Can you remind me, by the way, what exactly Moore’s Law says?
Jaime 12:55
Roughly? Moore’s Law is like an empirical observation that it’s stated in a few different ways. The traditional way has to do with the size of transistors in computers, in processors, saying that it houses the amount of transistors that go into a processor, like table, like every 20 months or so. And then there is like, analogous laws for other types of processors, like Gbus.
Fin 13:26
Got it. And the idea is that something like the number of transistors is a pretty good proxy for how powerful, how efficient the computing hardware is per cost or per unit.
Jaime 13:35
Exactly. And we see the same trend in the actual performance, the amount of operations that you can get out of a given process.
Three eras of ML
Fin 13:44
So up until 2010, what you observed is that the amount of compute used for machine learning models increased just in line with this Moore’s Law, which suggests that it wasn’t going faster than Moore’s Law for some reason. For instance, a larger share of compute was used or something like that.
Jaime 14:05
Yeah. So then our second finding is that invented them, something happened. People are starting to wake up to the potential of scaling up massively machine learning systems. There were like some landmark achievements in that era, like, for example, like Alex Net that baffled everyone by surpassing all other systems in image recognition. And people started up in their investment in machine learning. Now, it is no longer that their computers are getting better over time, but also they’re investing more money. They’re willing to buy more GPUs.
Fin 14:42
Got it? Okay, so first era compute using ML models increases with Moore’s Law. Then we have a second era starting around 2010 where it increases faster. And the Remy of the doubling time of compute use.
Jaime 14:56
Around that time, our paper found a six month doubling time. That means, like, every year the amount of compute that is used to train these state of the art machine learning systems is like four times as much.
Fin 15:09
Okay. All right. Got it. So exponentially faster than compute itself is becoming cheaper. But is that the end of the story or is there more to come?
Jaime 15:20
So the first thing that we found this is already outstanding in its own right. And it’s probably like the figure that I would want the audience to remember. The trend of compute, it doubles, like, every six months. And this trend has kept up today. The second thing that we found, the third thing that we found is that circa around 2014, there was like a discontinuity. It seems like industry started realizing that artificial intelligence had some really important business applications and they upped their investment in machine learning quite massively, so we see an increasing variance. This is the point where there is a split between small machine learning labs and academics who have low budgets and big industry labs that were able to put like 100 times as much compute as everyone else on training these really useful machine learning systems.
Jaime 16:20
For example, an early example of this is like the Google translation neural network which was used in production and it has this landmark feature of having been trained on way more compute than other systems around that era.
Fin 16:42
Okay, got it. And I don’t know if you have numbers in your head, so maybe this is putting you on the spot. But just to give a sense of the absolute magnitude of improvement since 2014, can you quote something like the amount of compute used for models around 2014. And then the state of the art model it’s used around now. Like the latest text models two to.
Jaime 17:02
The 19 floating point operations. That was what was being used around 2014. And the industry labs which came also circa 2014 up at this by an order of magnitude, more like, for example, AlphaGo was trained on about one to 20 floating point operations.
Fin 17:25
Okay, that is quite a big difference. All right, nice. Yes.
Jaime 17:28
And this growth a lot, right? Like right now the amount of computers being used to train machine learning systems is around like ten to the 24 floating poly operations that’s like the biggest systems that we have today are being trained around that amount of compute.
Scaling laws for compute
Fin 17:44
Okay, I actually have a question which just occurred to me. Maybe it’s just like too big to answer now, but we’re seeing like many orders of magnitudes worth of improvement with doubling times between like six months and a year, right? You’re just going to get that very quickly. Presumably. That doesn’t mean that doesn’t translate to orders of magnitude of improvement on kind of common sense measures of performance. Right? Because that would be insane. So maybe my question is, like, do we have kind of measures of performance for things like image classification and text generation? And then can we say anything about how this improvement in compute has translated into improvements in performance? Like, what’s the kind of elasticity there or something?
Jaime 18:33
So we’re seeing that these improvements increases inputs are like, being translated into increases in performance. And there are different ways that you can measure performance, like the most typical one, like some more technical metrics, like things like cross entropy. And in the end, what we find is that for every exponential increase in the inputs, we are seeing, like, we’re seeing a linear increase on these metrics. So the improvement is still blazingly fast. But we need to take care of when we’re using these inputs to try to forecast how artificial intelligence is going to overperform or underperform in the future. To keep in mind, like, okay, we need to change the scale of these inputs to the scales that we care about, to the metrics that we care about.
Jaime 19:37
Maybe like a more interesting observation here is that for some of these metrics you’re going to find that there is like a smooth relationship between the inputs and the outputs. And there are for some others there is not. Right. If you’re measuring something like, for example, does this machine learning system, this computer vision system, correctly classify this picture of a cat, then this is like a binary. This metric is going to be like yes or no. So there’s going to be like a sharp point in which suddenly the machine learning system gets it and actually gets the picture. Right, but this is like a misleading way of looking about it. In a sense.
Jaime 20:21
I claim that for many of these sharp left terms, there is like a better way of looking about it in which instead of looking at whether it got it correctly, you can look at the internal weights of the model and what was the probability that it was a signing that the picture was a cat. And then for that you will see that a smooth relationship.
Fin 20:40
Okay, so to try saying that back, we often observe discontinuities in measures of performance. So as a model gradually scales up, there is often a point where it quite suddenly starts performing way better. But the thought is that often it’s not like the weight of the model suddenly change or something like that, but instead maybe there is a discontinuity between how a smoothly improving underlying measure of performance translates into the kind of performance that we’re measuring. Is that right?
Jaime 21:14
Exactly. And then there’s like two things that we care about. One of them is like how often does it hold that there is a smooth metrics of performance that we can also use to forecast like this less continuous metrics of performance, which is critical to see if what we’re doing is actually going to be like a good approach to forecasting the future of AI. And then there’s like the second question of the metrics that we are going to care about. Are themselves going to be things that are discontinuous or are themselves going to be things that are more continuous in nature?
Forecasting compute used in ML — will the trend slow down?
Fin 21:45
Okay, cool. So we’ve talked about this compute trends paper. You mentioned that we’ve reached something like a kind of third era in compute trends when there’s this kind of big explosion in variance with how much compute different projects use. In particular, the just well resourced like industry projects can just throw a ton of compute. And the amount that’s growing for state of the art models is that amount is growing faster than Moore’s Law. Significantly faster. One reason this might be important is if it implies anything for how compute continues to grow in the future. So another big impossible question, but how do you start thinking about extrapolating that trend into the future?
Jaime 22:36
So when we first released this paper, we shortly after released rather naive model in which we try to extrapolate these trends into the future, where essentially what we’re looking at is, okay, this trend of compute is essentially driven by the amount of investment that has been put into the field and like improvements in hardware. And we will say, okay, let’s imagine that the improvements in hardware are going to be kept constant in the future, but investment is going to keep growing up until a point where it reaches such a big amount of money that companies are not going to be able to keep up with the spending. And we try to extend this trend under these two assumptions and see what kind of distribution we will get, like what pronounced we will get for compute in the future.
Fin 23:35
Okay, cool. Just quickly, what is the name of this paper in case people want to see it?
Jaime 23:40
So this report not a paper, more.
Fin 23:43
Like a report okay.
Jaime 23:44
Is projecting Compute trends in machine learning that can be found in our website.
Fin 23:50
Got it. Cool. And we’ll link to it as well. Okay, and then what did you find when you just ran this kind of simple extrapolation?
Jaime 23:57
So running this simple extrapolation, what we will find is that the amount of compute up until you will expect that the amount of compute in next ten years is going to be like roughly five orders of magnitude bigger than what we have today, which is roughly what you will expect given that the trends tabling, like every six months. And then it keeps growing and growing. And it’s interesting to put this in context with some estimates that other researchers have done about some interesting amounts of compute.
Fin 24:39
Right? Yeah. Could you say something about what these different compute benchmarks were based on biology and then what your extrapolations say about when they might be hit?
Jaime 24:50
So this kind of like idea of Kotra was like trying to estimate what is the amount of compute that is used, for example, to train a human being. How much data is it is a human being exposed to since they are born up until they reach the age majority and how much compute you will need in order to process that data. That’s kind of like the rough crazy idea that I had. And for that very basic benchmark, then what we will say is that we will reach that critical amount of compute and data processing capability for our machine learning systems near the end of this decade according to this naive extrapolation.
Fin 25:41
Okay. So again, on this naive model, the idea is there is some benchmark which is like a rough guess about the amount of compute that a human needs to effectively train a human. Like you’re training a model to a point where this is like competent and generally intelligent and the idea is that yeah, on your extrapolations we reach that amount of compute near the end of this decade. So less than ten years time.
Jaime 26:12
That is correct. And it’s not exactly clear what this means for us if you are trying to use it to forecast when we’re going to reach transformative artificial intelligence. A very obvious objection to mate is that, well, humans do not start at zero. Right. We have a lot of pre encoded information in our genes. Also we learn a lot through cultural evolution. Not everyone has to figure out everything for themselves. So I also estimated some other critical amounts of compute that you will need for some other landmarks. And perhaps another notable landmark is like the amount of compute that was used by evolution to create human being. Yeah, got it for that. This naive extrapolation will say that we’ll reach it somewhere between 2070 and 2080.
Jaime 27:10
But this is now getting to the regime of extrapolation, which I will trust very little or not at all.
Fin 27:18
But I guess to take the thought back, the idea is if you care about forecasting when transformative AI might arrive on different versions of that, then it might be a bit misleading just to look at the amount of quote unquote compute that human uses from when they’re a baby to when they’re an adult. Because maybe some of our intelligence is kind of borrowed from other compute like processes such as evolution, financial selection, such as cultural evolution. So we might care about forecasting these things as well. Cool. I think you said something about it like five minutes ago. But when you are doing this exercise of extrapolating compute trends in ML models into the future, one consideration you might hit up against is just that the fraction of compute used in ML models is as.
Fin 28:20
A share of all the compute in the world is increasing and you can’t use more than 100% of all the compute in the world. And so maybe that means that the growth over time has to slow down to the limit of just the growth in all the compute in the world. And is there some thought there that means that we might expect compute trends to slow down once they become so enormous, they’re just like an appreciable and expensive fraction of all the resources we have?
Jaime 28:47
Absolutely. I think that the general form of this argument of this can’t go on. Right. This rate of growth is so massive that at some point it not only devours the whole economy, but it implies that you will need more resources than the whole economy in order to sustain it. Maybe the way that I would phrase it is maybe not so much with the specific amount of the fraction of compute that is being dedicated, because as a matter of fact, the kind of hardware that is being used for machine learning systems tends to be quite specialized in some regards. But instead like the frame that I have more in my head. It’s like these chips, they are physical objects. They need some time to be produced.
Jaime 29:36
And sure, you can scale up production up until some point, and you can also divest resources from other types of R and D up until some point, but eventually you’re going to hit that limit.
Fin 29:48
All right, got it. Okay, so that was Whistlestop tour of Epoch’s work on compute trends over time. But there were other factors that plug into these questions about forecasting the performance of ML models. Another factor is data. Right. This is just like the information with which you train models on. Doesn’t matter if you have a supercomputer, if you only have like a couple of pages worth of text to train it on. And Epoch already has also looked at trends in the size of data sets, which I guess is the relevant thing here. Yeah. One question is just what exactly is like a data set in this context?
Jaime 30:31
So for machine learning to learn, you need to have something I can learn on these days within the paradigm that we live in, which is mostly this kind of like language generation models, the kind of text that you need is text from the Internet, text from books, text from Wikipedia. That kind of text is used to help the systems play this game of like, okay, I’m going to give you the text truncated up until a point, and then you need to guess what’s the next word that comes. And then you compare its guess with what actually comes in the data set and you use that kind of signal in order to train better models. Maybe nuance that I will add to what you said is like we’re focusing on the size of the data set.
Jaime 31:16
I see this as more by necessity than by choice. I think also the quality of the data set is an important factor in order to determine how much learning you can get out of it. But since that’s like a whole other dimension to capture, that’s really hard to measure. We proxy. We just focused on the easy dimension, which is the size of the that’s a fair point.
Fin 31:39
I guess you could just generate a bunch of just random data. That’s not very useful. So what matters is it’s actually useful data. Okay, so earlier you mentioned that the availability of even high quality data historically hasn’t been a major bottleneck on scaling up performance of ML models, or at least maybe not until recently. Yeah, maybe we could just talk about what does the state of the art look like now? How big are the data sets? Maybe for like image or text models? For the very top models these days.
Jaime 32:16
So for the very top models these days focusing on text, the kind of data sets that are being used are like sort of like a trillion words of training data.
Fin 32:30
Okay, I. Feel like I don’t have a very good sense of what a trillion tokens or words actually mean. So can you say something about what that is as a fraction of something like, I don’t know, all the books ever written or all of Wikipedia or all of the Internet.
Jaime 32:45
So roughly, this will be about like thousands of all the text that is on indexed websites on the Internet.
Will training data become a bottleneck soon?
Fin 32:55
Okay. All right. So that’s a pretty decent fraction. Okay. And then I guess that raises a natural question, which is if we pass through three orders of magnitude of scaling up the data sets that we use, we’re maybe healing up against limits of what the entire Internet can offer. So I understand you’ve investigated this question, and I’m curious what you found when we’re trying to think about whether data might become a bottleneck soon and whether we might run out of high quality data for ML training.
Jaime 33:34
So we recently published an investigation that was led by my colleague Pablo Villavos, where we investigated exactly this question. And what we found is that for quite high quality data, things like books, things like Wikipedia articles, there isn’t actually that much data. And it seems likely that we’re going to hit the limits around somewhere in this decade.
Fin 34:01
Okay, all right.
Jaime 34:02
And this is already taken into account that new books are written, new websites are published. And even taking that into account, it seems likely that amount of that increase in data is not going to be enough to make up for the increase in demands of machine learning.
Fin 34:19
Okay, interesting. As a complete side note, a few minutes ago I tried to figure out whether it would be possible for a person to read all of English Wikipedia. And I think the answer is just about like if you spent your entire life from the age of literacy and you lived a long enough life and you spent all your waking hours reading, you might be able to read all of Wikipedia. Which was kind of surprising to me. Like, it felt like I thought Wikipedia was a little bit bigger than that.
Jaime 34:44
Seems like a fun side project.
Fin 34:46
Yeah, some side project. One question is you mentioned this is true even accounting for the fact that the stock of text data on the Internet is growing over time. I’m curious how fast the Internet is growing in this sense because there are more people using it, there are more ways to use it, there are more ways to generate a ton of data better on the Internet. So presumably it’s not growing linearly, but it sounds like it’s not growing fast enough.
Jaime 35:14
Okay, so the amount of data that is being uploaded on the Internet, it is still growing exponentially. It’s still growing very fast. Populations are still increasing exponentially. Access to the Internet is also a quantity that is thankfully increasing more over time. And roughly what we find is that the stock of data is growing at a rate of maybe somewhere between 6% to 70% growth per year.
Fin 35:44
Okay, that’s a pretty wide error buzz there.
Jaime 35:47
Yeah, necessarily it has to be. There isn’t like a centralized repository right, which keeps track of all the data that’s on the Internet is based on very rough estimates. Yeah.
Fin 36:00
Got it. But in any case, it sounds like it’s very likely growing slower than, for instance, transistors on Chips Moore’s Law and also than the rate of compute used by ML models. Got it. And then just to try saying back to the idea here yeah, it’s the thought that up until now the availability of data hasn’t been a bottleneck because there’s always been a ton of more data than really it’s been feasible to do anything with. And so the main question has just been how much data can we practically use given the constraints we have on the amount of compute we have access to? But you can imagine that once access to compute scale is fast enough, then you really could beatably use more than an entire internet’s worth of data.
Fin 36:55
And at that point access to just more data becomes the bottleneck because it’s hard to get more performance, like squeeze more performance out of a data set which doesn’t grow. Is that the idea?
Jaime 37:07
Exactly. That is exactly right. And then the question is, like, what happens then?
Fin 37:11
Right.
Jaime 37:12
And the first thing is that at this point we have been talking a lot about high quality data. So data is like very well structured things like books, things like Wikipedia. Presumably there’s much more data available in the internet, right? Like you can still use social media posts, you can use reddit, you can take YouTube videos and transcribe them and use this as data. If you’re willing to use that kind of like more low quality data to train your machine learning systems, then presumably you can extend this deadline. Right. It seems likely that this low quality data stock can last us up until next decade, so probably will run out somewhere in the next decade. Unclear when exactly.
Fin 37:59
Got it. One idea I’ve heard a couple of people mention is maybe you could just generate your own data. Like you already have a model which is capable of generating kind of coherent works. Why don’t you just kind of feed that back into the training? Does that make sense or is that ridiculous?
Jaime 38:13
It does make a lot of sense and there is actually a plethora of paper that explore this kind of idea of bootstrapping your own models by generating data that you then can later feed into the model itself. And I do expect that we’re going to see a lot of cleverness in that regard in the next few years. As data becomes more of a bottleneck, there is going to be a big incentive to investigate some ways of increasing the efficiency with which you process the data that you have.
Fin 38:42
I see.
Jaime 38:43
So I do think that we haven’t yet had to try hard enough to overcome these kind of data limitations. You could always just use a bigger data set, but now you’re going to have to know, you have to make do with what you have. And this is going to lead to a lot of innovation. Like some other promising papers that have been released. There was this recent paper at New Rips that was looking at data pruning how you could achieve a similar level of performance only using a fraction of data set.
Ways to overcome the training data bottleneck
Fin 39:17
Okay, got it. So I guess we’re talking about ways in which this limitation on data might be overcome. Or at least you might push back the point where it becomes a real bottleneck by a few years. One example is dealing with the data that we already have more efficiently and pruning it in several ways. Maybe there’s something which looks a bit like bootstrapping our own data, although I guess there’s a limit to how much you can do that. Are there any other ways in which might be possible to kind of overcome this looming bottleneck?
Jaime 39:47
Well, I think that the wizard said cover the kind of ideas I have in mind. I’m pretty sure that researchers in the future and people who specialize in this field have way more clever ideas in the workings. So, yeah, the things that I will say is, like, we might see people using YouTube as a source of data, transcribing the conversations there using this kind of synthetic data.
How much text data are humans trained on?
Fin 40:18
Okay, so when we’re talking about compute, we mentioned J. Kotra’s Biological Anchors report. And in that report there’s an estimate of how much compute a human uses right up to the point of adulthood or something like that. Can we ask the same question about how much data I was, quote unquote, trained on, let’s say, before my 18th birthday or something?
Jaime 40:44
Yes, we can ask the same question of how many words does a kid listen to between when they are born and up until they reach the majority of age? And we did like a pack of Temple of Calculation back then at the book just for fun internally. And it sticks out to about like 140,000,000 words.
Fin 41:16
So sounds like large language models now even the very biggest ones are trained on more words than humans need to be trained on to reach, I guess, a higher level of performance right now. It’s kind of, I guess, interesting.
Jaime 41:31
Exactly. Which will also lead us to think that data efficiency improvements are certainly possible, unlike humans are the living proof of that.
Fin 41:40
Right, got it. Or I guess maybe we are trained on other kinds of data which are.
Jaime 41:44
Harder to also that’s like a referral point. Yeah, multimodal.
Algorithmic efficiency
Fin 41:52
Okay, cool. So we have talked about trends in compute. We’ve talked about trends in the size of the data sets which ML models have trained on. Let’s talk about this third factor now, which you mentioned, which is something like algorithmic efficiency or progress in algorithms. What exactly is that? I guess it’s a bit harder to imagine exactly what the definition is.
Jaime 42:20
So exactly what I mean by this is how you can decrease the requirements in compute and data that you need in order to reach a given level of performance.
Fin 42:33
Okay, got it. So I guess it’s like the special source that’s left over once you’ve accounted for data and compute. Cool and related question. I guess it’s kind of clear to me how you might measure the size of a data set. Right. Also for Compute, it’s clear how you might operationalize that, but how do you measure algorithmic program?
Jaime 42:57
Yeah, that is an excellent question and I think that is one that is easy to struggle with. Essentially the traditional way of defining this has been thinking about, like, okay, if we take, like, an architecture from, like, ten years ago and we train in App until it reaches, like, a certain level of performance on, for example, like, imagenet and image recognition, then you can measure, like, the amount of compute that you need to learn to do that.
Fin 43:26
Right.
Jaime 43:26
And then what you can do is that you can take a novel architecture, you can take an architecture from today, and then you can train it on the same data set, and then see, okay, how much less compute do you need with this novel architecture which uses compute more efficiently in order to reach the same level of performance?
Fin 43:45
I see.
Jaime 43:46
And that’s kind of like that factor by which you decrease the compute but still get the same performance. This is, in a sense, like a measure of algorithmic progress. I don’t think that this is the best measure to this. There is like a big problem with this which has to do with modern architectures are more efficient at really large scales. Right. So if you are converting like a transformer from today, for example, with Alex Net from ten years ago, this is going to understate how much algorithmic progress has happened because transformers are not that efficient when you’re training them to the level of Alex Net performance. You really need to train them on harder tasks in order for them to shine.
Fin 44:30
I see. Okay. So the idea is, if you’re just comparing Alex net to a big transform, state of the art transformer, one way you could do that is you could train up Alex net to the performance of Alex net on some benchmark and then train a more recent transformer model on the same data to reach the same performance as the alex Net. And then maybe it’ll take less Compute. But that’s really underrating where the progress is between Alex Net and newer models. And the progress is when you massively scale up the amount of data and compute that you’re using.
Jaime 45:11
Exactly. So then you need to adjust for this. And essentially you need to extrapolate, if we could grow Alex Net up until when it reaches the performance of modern Transformers, like, how much compute could we need, and then compare that to what Transformers actually need. And that’s what my colleagues Erdiel and Amaibasiroglu did in the recent investigation of algorithmic progress for computer vision.
Fin 45:37
All right, and what do they find?
Jaime 45:39
So what they find is that the rate of algorithmic progress is way faster than what was previously believed. So roughly the amount of compute that you need in order to reach, like, a level of performance is like, Halving every nine months. That’s like, less every year. You need like, half the amount of compute faster than that.
Fin 46:03
Okay, that’s pretty fascinating. So it sounds like the trends in compute and algorithmic efficiency are doubling Halving on something like the order of a year or even less than that, which in both cases is faster than Moore’s Law. And then I guess in some sense you’re like, multiplying these two things, right? Or roughly speaking, you’re doing that and so you have this much faster growth than Moore’s Law in just the performance of models. Got it. And yeah, I guess I’m curious to zoom in a bit more on how algorithmic efficiency relates to specifically efficiency with respect to compute and efficiency with respect to data. Because I can imagine maybe you come up with an algorithm where you really still need the same amount of data, but it just requires less compute or vice versa. So, yeah, I guess what’s going on there?
Jaime 46:54
Yeah, so different algorithms, algorithmic improvements tend to improve different parts of the pipeline and have different effects on the amount of compute and the amount of data that you have. Primarily, what we have seen in this last decade has been improvements on compute. So most improvements, when it comes down to it, they seem to be improving the performance that we have because they improve our usage of compute and not so much with data. With data, there seems to be currently, like, much less innovation. But I want to emphasize again, I think that the reason why exactly there is less of a need for those kind of data efficiency improvements.
Fin 47:43
Okay, got it. One thing I should have asked a bit earlier is just for examples of improvements in algorithmic efficiency, are there like, architectures that I should know about or specific models where there’s some innovation that really helps move things along?
Jaime 47:58
Well, these days, you definitely cannot talk about machine learning without mentioning Transformers, which has been this really general architecture that has overtaken many fields. In artificial intelligence, they were particularly like a big improvement as compared to recurrent neural networks in text generation. But I will give you others that are maybe like a bit more reduced in scope and maybe a bit easier to understand how this improved the compute efficiency. Another improvement was the improvement in the activations of the neural network. So in a neural network, essentially you have two types of operations. You have matrix multiplications and then you have some nonlinearities. And before, what we tended to use was some sort of like complicated nonlinearity that took like a lot of compute.
Jaime 48:49
And we have replaced that by what’s called a rectified linear unit, which essentially you just take like the neuron, and if it’s less than zero, you set it at zero and otherwise you leave it at this. It’s a very simple operation. It requires very little computation. We turn out that kind of simple function was enough to still train machine learning systems, but at a reduced compute cost.
Fin 49:16
So it sounds like you can get big step changes in algorithmic efficiency from inventing a new architecture, but you also get these small innovations which just improve existing architectures and just kind of it’s a more gradual thing.
Jaime 49:32
Yeah. Another way in which we have learned to improve our machine learning models is by improving the recipe by which these machine learning models are trained. Right. So when you need to train a machine learning model, you need to decide how big it is going to be, how many parameters you’re going to use, how much data you’re going to use, and what’s going to be the relationship between the parameters and the data. And recent work on scaling laws has improved a lot. The way that this recipe, the way that we train these machine learning models in order to get better performance using the same level of compute, just by eyeing this ratio between the amount of data that is used and like the size of the model.
Fin 50:17
Okay, so you have like a bunch of dials in your model, and one thing you can do is just tweak the dials so you can get more info out of it. Yeah. Interesting. I feel like I have a kind of vague question I want to ask, which is when? Maybe this is just asking about your own intuitions rather than anything more rigorous. But when you look ahead to thinking about how performance in ML models might improve over the next decade or so, do you imagine most of that improvement coming from these kind of dial tweaking small innovation type improvements? Or do you imagine it coming from one or two big changes to the paradigm where you have a new thing like a transformer which just gets you a lot more?
Jaime 50:55
I feel like I don’t have a super good intuition on this part of it. I think it is because some of our homework and in a deepak is like doing a more systematic study of like the kind of innovations that have happened in the over the last decade. We have a studied it’s aggregate effects, but not so much like the individual innovations that have happened and how important they have been relative to one another.
Do current methods scale to AGI?
Fin 51:21
Okay, got it. I mean, I guess a related question which is an enormous question and not one that is reasonable to expect any one person to have. The other two is do the current architectures paradigms scale to the truly transformative or general kinds of AI. Right? So if you just turn up the data and you turn up the amount of compute you’re throwing at the kinds of models we have right now, like Transformers, do they reach something like kind of general intelligence or is there some secret extra improvement in algorithms?
Jaime 51:56
Good question. Many people are asking that question. There is different opinions even within epoch. My personal take is that the scanning laws have held up until now and it will be surprising for them to soundly stop working. So I expect that we’re going to keep seeing improvements at the very least in the short term, probably for the next decade, just by scaling the inputs to these machine learning systems. That being said, I think that there’s some key limitations with current machine learning systems. The more fussy question for me is how is it going to be to dedicate to overcome this limitation? Like, for example, with transformer models right now, they have a very limited context length. Right.
Jaime 52:50
You cannot have like a Transformer right now write up a book for you because it’s going to in a sense, it’s going to run out of memory. This is maybe like a way of describing it, right.
Fin 52:59
Once it’s on chapter two, it’s kind of forgotten about whatever it is in chapter one. So it’s like a little bit random the direction it takes.
Jaime 53:05
Exactly. And then there is the question of how can we overcome this limitation? It can be as simple as designing a scheme in which this transformer model provides some intermediate thoughts in some sort of scratch pattern memory that it can console afterwards. Will that be enough? Maybe. There are also some other limitations. Like for example, you can formally prove that transformers are not capable of adding digits together. Sort of in a sense, essentially at some point there is going to be some input that is going to be large enough that the system is not going to be good. Big enough, is not going to be big enough to process. Right. There seems to be like some loop missing that allows the system to actually do this kind of operation.
Jaime 54:04
Maybe what’s going to happen is that in the future these transformer systems might be hooked up to a terminal, to a Python terminal, a programming terminal, and it will write the program that it will solve the task for it and that will be will that be enough? I don’t know. Maybe the fact that I cannot say, like, no, this is ridiculous. I think that should tell you something about my beliefs about scanning laws and the kind of progress we might see in future.
Fin 54:32
Yeah, super interesting. I have a bunch of questions, but it sounds like, okay, one kind of progress we can imagine is not coming up with some incredible elegant new paradigm which solves all the problems. It’s like maybe we take the architectures, which perform very well in certain ways and we kind of clutch them together with other things or plug them into other things and then we have this kind of more complete system. So I’m thinking about the analogy to humans, right? So my short term memory is not nearly the length of a chapter of a book, but I at least hope that I could write a book that is somewhat coherent across chapters. And why is that? Well, I can write a plan down on a piece of paper and then check back the paper.
Fin 55:13
And similarly, at least I haven’t checked recently, but hopefully I can still add eight digit numbers together. But I can’t really do this in my head. But again, I can write the numbers down on a piece of paper and apply these rules and it’s kind of this recursive loop that I can just kind of apply arbitrarily many times and then yeah, similarly, maybe we can just plug these different capabilities into existing architectures. That’s the thought, right?
Jaime 55:40
Yeah. Maybe an interesting reflection is like this can be the baseline. Right. There is like the possibility that the current paradigm is like something powerful enough to get us to some kinds of transformative artificial intelligence. But this is a really recent paradigm. Deep learning has been around for just ten years. There exists a possibility that we will come up with an even better way of conceptualizing machine learning that will be even more flexible than the current paradigm.
Scaling laws
Fin 56:13
Interesting. Got it. And then again, sure about the delia, but you mentioned scaling laws. Can you remind me exactly what scaling laws are saying?
Jaime 56:20
Yeah. So scaling laws are like this regularity between the compute that is being used to train the systems, the data that is used to train the systems, and the size of the models and the observed performance of them. Many people have been studying. Okay, if you were to tell me, just like, the amount of data that is being used, the amount of parameters, the amount of compute, and were to train them in an optimal way, according to current theory, what performance should we expect out of it?
Fin 56:58
Got it. I guess this is more or less what we’ve been talking about, right? But just to kind of just to say it again, it sounds like one finding is that the relationship between performance and these inputs is surprisingly regular so far. Such that if you just throw more Compute and more data, then you get this reliable improvement in performance and maybe the performance scales depending on your measure. Something like Logarithmically with total data or Compute.
Jaime 57:31
Yeah, that’s a rough idea. These kind of scaling laws are usually studied in specific domains, right? So we have a pretty good understanding of for text generation specific what this regularity looks like for image generation in a specific what this regularity looks like, it’s less clear, like, okay, what about interdomain progress? Shall we expect this kind of regularity? Can we somehow anticipate when are we going to see advancements in other fields? This is more speculative.
Fin 58:01
Got it. Okay. Maybe we could try summarizing what we’ve talked about so far. At least that’d be useful for me. So we talked about these inputs to the performance of ML models, which are like compute data and algorithmic efficiency, and then you could extrapolate all of these things forwards. But yeah, I’m just curious on kind of hearing again how these things add up for you or for epochs work, in terms of what kinds of even rough or naive predictions they yield for the next, let’s say a decade of progress in ML and especially what they might tell us about what we should expect to happen between now and reaching something like Transformative AI.
Jaime 58:43
So let me summarize the three takeaways that I will want the listener to remember from this conversation. Compute, it’s doubling every six months. That trend has continued for like ten years, and it will possibly continue in the future. Data. We might run out of data in either this decade or the next one, but this does not necessarily mean the end of scaling in machine learning. This probably means that data efficiency is going to be more important. And third, regarding algorithmic innovation, in the last decade, the computer requirements for computer vision have halved every nine months in other domains. It seems like this trend can be even faster and this will possibly also continue in the future. What does this mean?
Jaime 59:39
Well, given what we have talked about, the regularity of scaling laws, the fact that these inputs keep growing and they seem like they’re not going to stop, it definitely means that for the next decade we should expect still very new, exciting developments every year.
Fin 59:57
Right? Seems like a good answer. I guess I don’t want to put you on the spot, but like, people enjoy talking about things like their timelines, their like, you know, median expectation for when we reach some version of Transformative AI. And also people like talking about their expectations for takeoff speeds. That is what happens between now and that point. I don’t know if you have numbers that you can share about what kind of your own timelines look like or at least how you think about these things given what we’ve talked about.
Jaime 01:00:29
Yeah, excellent question. So to inform my thinking about timelines at this point, I’m quite anchored on previous work that has happened. There have been some people have put together some quantitative models to try to understand, try to estimate in a very rough way what is the amount of compute that you will need in order to achieve a transformative capability, and then how fast is our stock of compute growing? And not only the stock of compute, but also how are compute requirements decreasing over time, as we talked in the algorithmic improvement section? And I find these insight view models like somewhat compelling. I think that the weakest part is probably the part that has to do with estimating these transformative thresholds for compute. And I’m hoping that you will see some more work from Epoch trying to address this question.
Fin 01:01:29
Okay.
Fin 01:01:29
But for the time being I think that they are like a reasonable starting point for conversation compared to some other models. For example, we always have to refer here to ayakotra’s model, right, that predicted a median of around 2050 where we will reach transformative artificial intelligence. I think that there is some minor quibbles with the model and Aya herself has pointed to some of the weaknesses of the model. For example, it doesn’t take into account that AI, even before we get to a point where AI can perform all economically useful tasks, AI is going to be able to perform some economically useful tasks.
Jaime 01:01:30
Right?
Fin 01:02:20
And this presumably is going to redundant in a speed up in the R and D in AI and other parts.
Fin 01:02:26
Of the economy like investment and interest flows in once they observe that some economically useful tasks are possible.
Jaime 01:02:33
And also not only interest that too, but also this has to do with like we’re going to be able to use these AIS.
Fin 01:02:42
Interesting.
Jaime 01:02:42
Yeah. Within Epoch we also sometimes use some AI help in order to help us write papers and emails.
Fin 01:02:52
Okay, I guess that’s another example of bootstrapping, right? It’s like using the less capable AI to make the more capable AI.
Jaime 01:02:59
And taking that into account, I do believe my timelines are a bit shorter than that. Without entering into too much detail, I will say that my median right now is like somewhere between 2043. But then again I want to say even within epoch there’s like a huge diversity of opinions there’s. People who have timelines are sort of like twelve years and as long as 100 years.
Fin 01:03:24
Okay, got it. Thanks. That’s still pretty useful. Okay.
Jaime 01:03:29
And the other question that you ask has to do with takeover speed. So first let me clarify a bit what we mean by takeover speeds. So in the future I have talked about how we’re going to get before we get really powerful AI, we’re going to have less powerful AI, right? And that’s going to be a really critical period for us because it’s going to be that time in which we’re going to have access to these AI systems that are precursors of the really powerful AI that we’re going to use to automate large chunks of our economy, and we’re going to be able to do lots of experiments with it. And that’s going to be like a very productive period, I expect, for alignment and strategy and understanding what this AI is going to mean for the world. So we’re really.
Jaime 01:04:13
Interested in understanding how long that critical period is going to be. How long is it going to be the period between AI starting to get good and like, AI is getting so good that it can perform any remote job.
Fin 01:04:26
Yeah, got it.
Jaime 01:04:27
And I think that right now, my take is that the time between AI that can perform like, 20% of the economic tasks and 100% of the economic tasks remotely remote economic tasks probably is going to be like, something somewhere between five years and ten years will be my guess. At this point, I definitely haven’t rolled out faster takeoffs, but that’s what I have in mind now.
Fin 01:04:57
Yeah. I want to know what are the considerations when you’re thinking about takeoff speeds? So, for instance, what kinds of factors will make it, if you learn about them will make it more likely that we should expect faster takeoffs.
Jaime 01:05:09
I’m going to ask you to wait for a couple of months.
Comparing the costs of training vs inference
Fin 01:05:16
Okay. So that was extremely interesting. I have a bunch of extra questions, I guess, just kind of extending or asking about the things you’ve talked about. Here’s one, I guess so far we’ve been talking about the requirements in terms of computing data for training ML models. Right. But once you’ve trained an ML model, you also need some compute and other things to actually run it, to implement it, as it’s called inference. Right. Instead of training, I’m interested in whether you can say anything about how the requirements are different between what you need to train a model compared to what you then need to actually run it. And I’m imagining that this might actually be relevant if, for instance, it turns out it’s much cheaper and different ways to run inference passes to run a model once you’ve trained it. Right.
Fin 01:06:11
Because in that case, well, once you’ve trained the model, maybe you might be more worried about other people, just like stealing the weights. Or maybe you might expect, once the model has been trained, that it can get rolled out extremely quickly. Right. Because all the work has basically been done.
Jaime 01:06:29
So, yeah.
Fin 01:06:29
Curious about this kind of vague question about the difference in requirements between training and inference.
Jaime 01:06:36
Yeah, this is hardly a big question, and it’s a question that we are really interested in at Epoch as well. To give you a sense of how these two quantities compare. Essentially, in order to train a system, what you need to do is you need to process the sample forward, then process it backwards through the network to try to get back the signal of what it got wrong, essentially. And then you repeat this for every data point that you have in your data set. Right. So the difference between the amount of compute that you need to train the system and to run it is going to be roughly proportional to the size of your data set, loosely speaking.
Fin 01:07:27
Got it because to do one pass, one inference pass of my model, I just need to give it one bit of data, right? Like, for instance, it could be a prompt if it’s a language model. But to train the model, you need to give it all the data.
Jaime 01:07:39
Okay, so given that we have now data sets that as I was saying, we’re burning on a trillion words, then that means that actually running the system to spit out a word is going to be like a trillion times cheaper than training the system itself. It seems like right now how this works is that, okay, you’re training, like, these really large foundation models that have very general capabilities, and since the training vastly exceeds the inference, so it’s going to seem likely like that once we develop these really powerful systems, we’re going to be able to really quickly roll them out on society.
Fin 01:08:29
Okay. Got it. And are there any kind of particular upshots of that fact that seem kind of relevant for when we’re trying to imagine, for when we’re just trying to think about what to expect, like what the world actually looks like once we’re kind of able to train these very powerful models? Like, what is this real difference in training and inference kind of practically mean?
Jaime 01:08:51
That’s a harder question than I can answer in a couple of minutes. But I suppose that what we’re going to see is that there is going to be like a very small time between demos being rebuilt and these demos being made widely available.
GPU cost predictions
Fin 01:09:07
Nice. Okay, cool. That’s useful. Another question I was interested in is that Epoch did some work specifically predicting the performance of GPUs, which I guess is important because GPUs are often used in ML applications. Yeah. Can you say something about that work? Yeah.
Jaime 01:09:26
So as I was saying, we’re really interested in how the amount of compute that’s available for two train machine learning systems is going to grow over time. One part of that is that people have more money that they can use to buy more GPUs, but the other side of that is that GPUs are becoming more efficient over time. You can get more operations for a given level of budget. Right. So we have a vested interest in understanding how this progress happens and whether it could possibly stop in the future.
Fin 01:09:56
Got it. Okay.
Jaime 01:09:57
And we have a few speculative reports in our website where we tackle these questions and essentially we just plot the data. We just look at what has been the historical cost of a flop for different versions of GPUs in the last two decades and then try to extrapolate this forward. I would recommend taking all this work with a grain of salt. This is definitely more on the speculative side aside on what we are doing at the book.
Fin 01:10:30
Okay, got it. But with that grade of salt taken, what did the, erupt, you find when you tried to plot the efficiency of GPUs.
Jaime 01:10:38
So essentially what we found is there is indeed like an exponential trend happening. The amount of compute is getting exponentially cheaper over time. Roughly what we found is that the amount of operations that you can purchase for a dollar, the amount of performance that you can purchase for a dollar, the amount of floating point operations per second per dollar, that will be the more yeah.
Fin 01:11:11
Just to clarify, this is like, for instance, if I am wanting to rent out some time on a GPU, then this is like what a dollar gets me for that time.
Jaime 01:11:20
Yes, more or less. It will be more like what you will get if you were to buy the GPUs yourself. If you were to rent it on cloud computing, then you will also take into account, like, other costs. There’s also, like a profit margin from the cloud provider and that kind of thing.
Fin 01:11:37
But I guess the reason I’m asking is like, okay, maybe I buy a GPU for like $500. I can just run that forever. Right. So the amount of compute per dollar is unclear to me.
Jaime 01:11:48
You can, but you have to take into account two things. One of them is, well, eventually your GPU is going to break. The second one is that eventually your GPU is going to be outdated. And in fact, we had another speculative analysis in our website where we try to think like, okay, given that GPUs regularly get updated, we use this fact to try to bound the length of a training run.
Fin 01:12:17
Right, okay. Is it like the longest training run? That would be sensible to do? Because at some point you might as well have just started later.
Jaime 01:12:23
Exactly. You start later, you have better Harvard, and because you have that better hardware, you’re going to make up for the fact that you started later by having this more powerful hardware bill.
Fin 01:12:33
Got it. What did you find in that?
Jaime 01:12:36
So essentially we found that at the current levels of improvement in hardware and software and also the growth of investment budgets, it seems pretty likely that machine learning systems will training runs will run for less than 15 months.
Fin 01:12:57
Okay, wow.
Jaime 01:12:57
Just because I think it’s not like the very upper bound. As a matter of fact, like, in reality, like, a 15 month training run is unfair. Dot?
Fin 01:13:06
Yeah. How long are the training runs for? For instance, Gbt three.
Jaime 01:13:10
So I cannot remember off the top of my head, but the kind of training run length that we’re seeing these days is somewhere between two months and six months for really high end systems.
How long will the longest training runs take?
Fin 01:13:22
Okay, got it. Oh, yeah. Maybe one kind of detailed question is why could I not just start a training run with the best hardware I can get my hands on, and I want it to be a really long training run. And then let’s say a couple of months in. I can get better hardware. I just like swap it out mid training run. Why is that not feasible?
Jaime 01:13:45
That could be in fact feasible. It could be the case that you can do these kind of changes. What’s going to be more complicated is incorporating advances in software. Like if a new architecture is developed in that time, well, you can’t really reuse the training that you have done so far. Also, there’s lots of complications associated with this kind of like hardware swapping. Right. It seems like you still need to make sure that the architecture is correctly transferred and all those things that could cause lots of issues.
Fin 01:14:28
Okay, cool. Yeah. Just sounds like a pain for hardware and then maybe in fact close to impossible if the architecture really significantly changes.
Jaime 01:14:37
It is a pain. It is definitely possible. I actually don’t have a really good sense of whether this is common practice. I think that maybe some companies are doing it, but I think that the norm is like you just don’t bother with it. You just like you buy more new pieces.
AI-augmented R&D
Fin 01:14:54
Yeah, makes sense. Have you or has Epoch thought much about what we can kind of anticipate in terms of AI augmented R and D?
Jaime 01:15:06
So my colleague Tamay actually released a paper on this matter. I don’t have a super good insights on it. I still have pending in my homework to familiarize myself on it. The bottom line is that yes, it seems like R and D is going to be boosted by improvements in deep learning. And it is going to be this kind of like recursive loop in which more improvements are going to lead to more R and D. More R and D leads to more improvements.
Fin 01:15:34
Right. Because I guess there’s a lot of loops here which I’m trying to get clear in my mind. Right. So one obvious loop is that AI can directly and specifically speed up the development of more capable models of like successive versions of itself, something like that. Another loop is that as AI systems are able to augment R and D and once they demonstrate success, then they might attract even more investment over time. And that’s a kind of like bootstrapping loop. And then maybe a third loop is just like if AI is able to augment R and D at such a large scale that it just means the world’s economy grows faster than it otherwise would. Well, that just means in some sense, everything that’s economically useful happens faster, including the development of AI.
Fin 01:16:36
Is that kind of roughly getting things right and there’s like lots of different loops going on?
Jaime 01:16:39
Yeah, that roughly matches my impressions. I think that at this point, the four main loops that I have in my mind when I think about artificial intelligence is what you were mentioning about increased attention on investment, improvements in the algorithmic side of it, increases in the amount of money that there is around. Like there’s going to be more productivity, more GDP to go around. And maybe the other one that I will add to the list is like improvements to hardware, right? We’re going to have possibly AI design chips and improvements to the process of hardware production that are going to allow us to decrease the cost of compute.
Research Jaime wants to see
Fin 01:17:20
Okay, Jaime, this has been super interesting. I suggest we talk about some final questions now which we ask everyone. And the first question I think you could probably give an especially useful answer to, which is are there any ideas for research projects that you’d be really excited to see people take up, maybe people who are listening to this podcast?
Jaime 01:17:42
Way too many, way too many. But let me talk, I’m going to give maybe some ideas, some things that fall squarely onto what epoch is all about and are things that we will probably research in the future. But by no means don’t let this be an impediment for you to look into them as well. And then I will talk about some things that a bit more out there and do not fall squarely into our expertise. So things that are very within epoch scope and that I’m pretty excited about. One of them is like understanding which tasks are going to be automated first. I think it came for many, it came as a surprise that our generation happened to be one of the first things automated. And I will want to ask the question of what’s going to fall next?
What is going to be like the directory in a sense of which tasks are going to be automated first? I think that this could have large implications for how fast the economy is going to be automated. The second question that I’m interested in is understanding better the availability of compute around the world. Like how different labs have access to different levels of compute, how different boundaries have different access to different levels of compute, how in the future we might expect that compute stock to grow, when are we going to hit limits, how fast can we build new factories to meet the growing demand, that kind of thing. Third thing that I’m really interested in is understanding better returns to R and D.
So we have been talking a lot about how AI could speed up research and development in different areas from Harvard to AI itself. And I want to get a better understanding of how historically more inputs into these fields have been turned into improvements and then trying to understand what happens when you are able to increase these inputs by an order of magnitude. How much does this improve development? If you increase inputs to research and development now by a factor of ten, that doesn’t mean that research is going to be like ten times faster going forward, right? There’s lots of reasons to expect strong diminishing returns in research and strong diminishing returns to allocating a large amount of resources at the same time to a particular field. There aren’t that many researchers who are going to be able to work on the problem.
There aren’t that many GPUs to do experiments. You run into these kind of bottlenecks.
Fin 01:20:27
Sure. Okay. These are all great ideas you mentioned. More like speculative, more kind of beyond epoch scope ideas. What about them? Yeah.
Jaime 01:20:36
So they will be like things that I don’t think that we currently have an expertise in epoch, but I will be really excited about other people doing it. So one of them has to do with extreme value forecasting on the statistics. I think that we have a lot that we can learn from the science of how you predict how maxima evolves over time and without getting too technical into it, I think this is really important. For example, predicting how benchmarks in machine learning are going to better in the future.
Fin 01:21:10
Okay, got it.
Jaime 01:21:11
Another one has to do more with sociology. Right. So currently there is like a very strong movement against artificial intelligence generated art by traditional artists. And this is like a force to be reckoned in the world. Right. We could see that they could organize and try to pass like some regulation that is going to affect the development of artificial intelligence in the future. It’s going to also affect the perceptions around artificial intelligence and how people are going to choose to invest or not in these technologies. I will want to understand better analogues of these questions of these movements and also try to understand better what could be the consequences of these social movements being organized. Like, what are they going to achieve in the world? Okay.
Fin 01:22:09
That’s interesting, I guess. Yeah. One example that comes to mind is the Ludlife movement, right. Where they had this very legitimate grievances, where in some sense their livelihoods were being automated away and now this word means something quite different. Right. Nice. Okay. Any other ideas?
Jaime 01:22:24
So another field that I’m really interested in getting rabbit understanding of, and there’s already some work on this is studying some historical analogues to artificial intelligence in the sense of some past technologies that ended up being revolutionary. Like it could be like electricity and the internet. There is like a really good report on this by Bengal Finkel. But I want to see more, I want to understand better. What can we learn from the management of nuclear bombs that could be applied to artificial intelligence? What could we learn from other technologies and how they were deployed and developed?
Reading recommendations
Fin 01:23:03
Sounds like those enzis ideas within that idea. Fantastic. There is a ton of stuff there. We’ll try reading them all up so people can get involved they want. All right. Here’s another question we ask everyone. Can you recommend some books or websites reports which people could go and read if they want to learn more about what we’ve been speaking about?
Jaime 01:23:24
Yes. So one resource that I would recommend a lot is our role in data. Just recently put together a collection of resources about artificial intelligence, which I think is a good introduction to the topic and to look at some of the graphs that illustrate the topics. We have been chatting about. Another one, and I don’t know if this is cheating, but I suppose I would recommend our website, Epoch.
Fin 01:23:53
Totally.
Jaime 01:23:56
There is like lots of publications. We have visualizations and other tools that you might find interesting in order to understand better this topic. And maybe a third thing that I would recommend to try to understand also the side of people who are thinking about what are going to be the consequences of developing this technology is this introductory series called The Most Important Century by Holding Karnovsky. I think that more people should be thinking about what are going to be the consequences of these technologies being deployed in the real world. And I think that’s like a really good attempt that must help orient your own thinking.
AI art
Fin 01:24:36
That’s it. Those are some great recommendations. Okay. All right, so our last question is I happen to know that you have a site project of making AI art. Maybe one question is seems very likely that AI will make it easier for people who don’t have a background as a traditional artist to generate really good art, at least by many people’s standards. Yeah. I’m also curious, I guess, on the flip side, whether you think it’ll lead to a niche for traditional non AI artists, or whether you think, let’s say AI augmented artists, whether they might reach a similar cultural position to artists today. I guess it’s a question about something like the sociology of art, given that AI can now generate art. Does that make sense?
Jaime 01:25:23
Yeah, I said the correct analog to be thinking about when thinking about AI art is probably photography. So, like, when photography was developed, it was like this new form of art. What’s the word I’m looking for? It was perceived at the time as like a replacing, in a sense, traditional paintings.
Fin 01:25:50
Yeah.
Jaime 01:25:50
But in the end, it has to know that this was not the case. Right. Photography ended up its own form of art with its own developed techniques and tradition. And I do expect that in artificial intelligence we’re also going to see a similar thing. I’m quite excited about the potential for this technology to in a sense, I want to say democratize art. Maybe that’s not the correct word, but it’s definitely going to allow many people to express themselves artistically, whereas before they didn’t have enough time to engage with some other forms of art.
Fin 01:26:28
Yeah, totally. I love the analogy to photography. I don’t know how much you know about the history, but is it the case that when photography was initially developed, it was viewed as artless or something like that?
Jaime 01:26:41
So I’m definitely not an expert. I have heard recently about something like that happening with the radio when fingers started to be replaced, which also I think that is like an interesting analog because with pay art there’s here like a very interesting and important ethical and regulatory question which has to do with the data that these machine learning systems are being trained on. Right. And we’re going to have a society to have a conversation between what’s going to be the correct way of using that data, what are going to be the norms that we’re going to establish.
Jaime 01:27:21
And I believe that something similar happened when Radio was first introduced to the public, where singers and artists at the beginning had a very strong reaction, but then a conversation was developed to try to develop better copyright norms so they could still practice their art and be fairly compensated for it.
Fin 01:27:50
Okay. What was the complaint that musical artists had when Radio came about? Was it just like, how did to make money now that it’s much easier to listen to them or something?
Jaime 01:28:00
I believe there was a combination between them feeling that they were being replaced by these machines and also a complaint of them saying this kind of automated mass music can never reach the same level of a live artist. You’re priving the public from this experience.
Fin 01:28:30
I guess part of what’s going on is that when I generate a piece of AI art, as long as I don’t include in the prompt, in the style of specific artist, then in some sense it’s just drawing on the trading data, which includes just millions of actual artists. Right. The harm is so diffuse, right, because it’s kind of affecting everyone, like a tiny bit. It’s like if I took a couple of pixels from your artwork and a couple of pixels from another person’s artwork and yeah, I don’t know if there’s any precedent for that kind of dynamic, whether the kind of copyright complaint is so diffuse across so many people.
Jaime 01:29:03
I mean, the person that comes to mind, and this might be like a bit unfair, is artists themselves. You do not learn art in a vacuum. Right. What you do is you experience lots of art from other artists and that informs your own style. But again, there is a substantial difference between these cases, right. Because in this case, there is a much more clear attribution. Right. You cannot prevent an artist from learning, but you can choose to not include a certain artwork in a training data set. Right. Now, these artists, they have very little say into how their art is being used, right?
Outro
Fin 01:29:47
Of course. Okay. We could talk about this forever, but I think we better wrap up. This has been fascinating, heavy to bear. Thank you so much for being our first ever repeat guest on the podcast. Thank you.
Jaime 01:29:59
Yeah, thank you so much. I’m very glad to be here and that you consider me to not be annoying enough, including twice.
Fin 01:30:09
That was Jaime on Trends in Machine Learning. If you find this podcast valuable in some way, one of the most effective ways to help is just to write a review. Wherever you’re listening to this. So maybe it’s Apple podcasts or Spotify or somewhere else. You can also follow us on Twitter. We’re just at hear this idea. And I’ll also mention that we still have a feedback form on our website. I can tell you we read every submission and you’ll also receive a free book for filling it out. And you can find that on our website, which fin hearthisidea.com. Okay, as always, a big thanks to our producer Jason for editing these episodes. And thank you very much for listening.