Bostrom on Superintelligence (2): The Instrumental Convergence Thesis
This is the second post in my series on Nick Bostrom’s recent book Superintelligence: Paths, Dangers, Strategies. In the previous post, I looked at Bostrom’s defence of the orthogonality thesis. This thesis claimed that pretty much any level of intelligence — when “intelligence” is understood as skill at means-end reasoning — is compatible with pretty much any (final) goal. Thus, an artificial agent could have a very high level of intelligence, and nevertheless use that intelligence to pursue very odd final goals, including goals that are inimical to the survival of human beings. In other words, there is no guarantee that high levels of intelligence among AIs will lead to a better world for us.
The orthogonality thesis has to do with final goals. Today we are going to look at a related thesis: the instrumental convergence thesis. This thesis has to do with sub-goals. The thesis claims that although a superintelligent AI could, in theory, pursue pretty much any final goal, there are, nevertheless, certain sub-goals that it is likely to pursue. This is for the simple reason that certain sub-goals will enable it to achieve its final goals. Different agents are, consequently, likely to “converge” upon those sub-goals. This makes the future behaviour of superintelligent AIs slightly more predictable from a human standpoint.
In the remainder of this post, I’ll offer a more detailed characterisation of the instrumental convergence thesis, and look at some examples of convergent sub-goals.
1. What is the Instrumental Convergence Thesis?
Bostrom characterises the instrumental convergence thesis in the following manner:
Instrumental Convergence Thesis: Several instrumental values [or goals] can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realised for a wide range of final goals and a wide range of situations, implying that these instrumental values [or goals] are likely to be pursued by a broad spectrum of situated intelligent agents.
An analogy with evolutionary theory might help us to understand this idea. (I know, I’ve now used two analogies with evolution in the first two posts of this series. I promise it won’t be trend.) In his work on evolution, the philosopher Daniel Dennett employs the concept of a “good trick”. Evolution by natural selection is a goal directed process. The goal is to ensure the survival of different genotypes. Organisms (or more specifically the genotypes they carry) adapt to the environments in which they live in order to achieve that goal. The thing is, there is huge variation in those environments: what is adaptive in one may not be adaptive in another. Nevertheless, there are certain “good tricks” that will enable organisms to survive across a wide range of environments. For example, eyesight is useful in nearly all environments. Because they are so useful, different groups of organisms — often with very divergent evolutionary histories — tend to hit on these “good tricks” over and over again, across evolutionary time. This phenomenon is actually known as convergent evolution, though I am fond of Dennett’s label. (Dennett also uses the related concept of a “Forced Move”).
I think Bostrom’s concept of instrumental convergence is very much like Dennett’s concept of a “good trick”, except that Bostrom’s concept is even broader. Dennett is dealing with evolution by natural selection, which involves one overarching final goal, being pursued in a variety of environments. Bostrom is concerned with agents who could have many possible final goals and who could be operating in many possible environments. Nevertheless, despite this added complexity, Bostrom is convinced that are certain (very general) sub-goals that are useful to agents across a wide range of possible final goals and a wide range of possible environments. Consequently, we are likely to see even superintelligent agents hitting upon these “good tricks”.
So what might these “good tricks” be? The basic rule is:
If X is likely to increase an agent’s chances of achieving its final goals (no matter what those final goals might be) across a wide range of environments, then X is likely to be a (convergent) sub-goal of all agents.
Let’s look at some possible examples. Each of these is discussed by Bostrom in his book.
2. Self-Preservation and Goal-Content Integrity
The first two mentioned by Bostrom are self-preservation and goal-content integrity. They are closely related, though the latter is more important when it comes to understanding superintelligent AIs.
The sub-goal of self-preservation is familiar to humans. Indeed, as Bostrom notes, humans tend to pursue this as a final goal: with certain exceptions, there is almost nothing more valuable to a human being than its own survival. The situation is slightly different for an AI. Unless it is deliberately created in such a way that it has no intrinsic final goals — i.e. it learns to acquire goals over time — or unless it is explicitly programmed with the final goal of self-preservation, the AI’s interest in its own survival will always play second fiddle to its interest in achieving its final goal. Nevertheless, with the exception of the goal of immediate self-destruction, most of those goals will take time to achieve. Consequently, it will be instrumentally beneficial for the AI to preserve its own existence until the goal is achieved (or until it is certain that its own destruction is necessary for achieving the goal).
Embedded in this is the more important convergent sub-goal of goal-content integrity. In essence, this is the idea that an agent needs to retain its present goals into the future, in order to ensure that its future self will pursue and attain those goals. Humans actually use a variety of tricks to ensure that they maintain their present goals. Smokers who really want to quit, for example, will adopt a range of incentives and constraints in order to ensure that their future selves will stick with the goal of quitting. We can imagine artificial agents needing to do the same sort of thing. Though when we imagine this we have to remember that artificial agents are unlikely to suffer from weakness of the will in the same way as human agents: just preserving the goal over time will be enough for them. Bostrom argues that goal-content integrity is more important than self-preservation for AIs. This is because, as noted, the need for self-preservation is highly contingent upon the nature of the final goal; whereas the integrity of the final goal itself is not.
That said, Bostrom does think there are scenarios in which an agent may change its final goals. He gives a few examples in the text. One is that it might change them in order to secure trusting partners to cooperative exchanges. The idea is that in order to pursue its goals, an agent may need to cooperate with other agents. But those other agents may not trust the agent unless it alters its goals. This may give the agent an incentive to change its final goals. It could also be the case that the agent’s final goal includes preferences about the content of its final goals. In other words, it may be programmed to ensure that it is motivated by certain values, rather than that it pursue a particular outcome. This could entail alteration of goals over time. Finally, the cost of maintaining a certain final goal, relative to the likelihood of achieving that goal, might be so large that the agent is incentivised to “delete” or “remove” that final goal.
I think the idea of an agent altering its final goals is a coherent one. Humans do it all the time. But I have some worries about these examples. For one thing, I am not sure they are internally coherent. The notion of an agent changing its final goals in order to secure cooperative partners, seems pretty odd to me. It seems like its final goals would, in that case, simply be kept “in reserve” and a superficial mask of alteration put in place to appease the cooperative partners. Furthermore, in his defence of the orthogonality thesis, and later in his defence of the AI doomsday scenario (which we’ll look in the next post), Bostrom seemed to assume that final goals would be stable and overwhelming. If they could be as easily altered as these examples seem to suggest, then the impact of those defences might be lessened.
3. Cognitive Enhancement and Technological Perfection
Another plausible convergent sub-goal for an intelligent agent would be the pursuit of its own cognitive enhancement. The argument is simple. An agent must have the ability to think and reason accurately about the world in order to pursue its goals. Surely, it can do this better if it enhances its own cognitive abilities? Enhancement technologies are an obvious way of doing this. Furthermore, the first AI that is in a position to become a superintelligence might place a very high instrumental value on its own cognitive enhancement. Why? Because doing so will enable it to obtain a decisive strategic advantage over all other agents, which will place it in a much better position to achieve its goals.
There are some exceptions to this. As noted in the discussion of the orthogonality thesis, it is possible that certain types of cognitive skill are unnecessary when it comes to the attainment of certain types of goal. Bostrom uses the example of “Dutch book arguments” to suggest that proficiency in probability theory is a valuable cognitive skill, but also notes that if the agent does not expect to encounter “Dutch book”-type scenarios, it may not be necessary to acquire all that proficiency. Similarly, an agent might be able outsource some of its cognitive capacities to other agents. In fact, humans do this all the time: it’s one of the reasons we are creating AIs.
Another plausible convergent sub-goal is technological perfection. This would be the pursuit of advanced (“perfected”) forms of technology. We use technology to make things easier for ourselves all the time. Building and construction technologies, for example, enable architects and engineers to better realise their goals; medical technologies help us all to prevent and cure illnesses; computing software makes it easier for me to write articles and blog posts (in fact, the latter wouldn’t even be possible without technology). An AI is likely to view technology in the same way, constantly seeking to improve it and, since an AI is itself technological, trying to integrate the new forms of technology with itself. Again, this would seem to be particularly true in the case of a “singleton” (an AI with no other rivals or opposition). It is likely to use technology to obtain complete mastery over its environment. Bostrom suggests that this will encompass the development of space colonisation technologies (such as Von Neumann probes) and molecular/nano technologies.
Again, there will be exceptions to all this. The value of technological perfection will be contingent upon the agent’s final goals. The development of advanced technologies will be costly. The agent will need to be convinced that those costs are worth it. If it can pursue its goals in some technologically less efficient manner, with significant cost savings, it may not be inclined toward technological perfection.
4. Resource Acquisition
The final sub-goal discussed by Bostrom is resource acquisition. This too is an obvious one. Typically, agents need resources in order to achieve their goals. If I want to build a house, I need to acquire certain resources (physical capital, financial capital, human labour etc.). Similarly, if a superintelligent AI has the goal of, say, maximising the number of paperclips in the universe, it will need some plastic or metal that it can fashion into paperclips. AIs with different goals would try to acquire other kinds of resources. The possibilities are pretty endless.
There is perhaps one important difference between humans and AIs when it comes to resource acquisition. Humans often accumulate resources for reasons of social status. The bigger house, the bigger car, the bigger pile of money — these are all things that help to elevate the status of one human being over another. This can be useful to humans for a variety of reasons. Maybe they intrinsically enjoy the elevated status, or maybe the elevated status gets them other things. Given that an AI need not be subject to the same social pressures and psychological quirks, we might be inclined to think that they will be less avaricious in their acquisition of resources. We might be inclined to think that they will only accumulate a modest set of resources: whatever they need to achieve their final goals.
We would be wrong to think this. Or so, at least, Bostrom argues. Advances in technology could make it the case that virtually anything could be disassembled and reassembled (at an atomic or even sub-atomic level) into a valuable resource. Consequently, virtually everything in the universe could become a valuable resource to a sufficiently advanced AI. This could have pretty far-reaching implications. If the AI’s goal is to maximise some particular quantity or outcome, then it would surely try to acquire all the resources in the universe and put them to use in pursuing that goal. Furthermore, even if the AI’s goal is ostensibly more modest (i.e. doesn’t involve “maximisation”), the AI may still want to create backups and security barriers to ensure that goal attainment is preserved. This too could consume huge quantities of resources. Again, Bostrom points to the likelihood of the AI using Von Neumann probes to assist in this. With such probes they could colonise the universe and harvest its resources.
As you no doubt begin to see, the likelihood of convergence upon this final sub-goal is particularly important when it comes to AI doomsday arguments. If it is true that a superintelligent AI will try to colonise the universe and harvest all its resources, we could easily find ourselves among those “resources”.
So that’s a brief overview of the instrumental convergence thesis, along with some examples of instrumentally convergent sub-goals. I haven’t offered much in the way of critical commentary in this post. That’s partly because Bostrom qualifies his own examples quite a bit anyway, and also partly because this post is laying the groundwork for the next one. That post will deal with Bostrom’s initial defence of the claim that a superintelligence explosion could spell doom for humans. I’ll have some more critical comments when we look at that.
John Danaher is an academic with interests in the philosophy of technology, religion, ethics and law. John holds a PhD student specialising in the philosophy of criminal law (specifically, criminal responsibility and game theory). He formerly was a lecturer in law at Keele University, interested in technology, ethics, philosophy and law. He is currently a lecturer at the National University of Ireland, Galway (starting July 2014).
He blogs at http://
This article previously appeared here. Republished under creative commons license.