Stopping Superintelligence

Stopping Superintelligence

Hi, I'm Tim Tyler - and today I will be discussing the possibility of constructing a superintelligent agent that doesn't object to being switched off.

Firstly an introduction:

Expected utility maximisers

An expected utility maximiser is a theoretical agent which considers its actions, computes their consequences and then rates the outcomes according to a utility function. It performs the action which it thinks is likely to produce the largest utility - and then iterates this process.

Expected utility maximisation is a general framework for modelling rational intentional agents.

Self-improving systems

Self-improving systems are dynamical systems with specified goals that attempt to improve their ability to reach their goals as time passes.

Utility Functions

Future superintelligences are likely to be accurately modelled as self-improving expected utility maximisers. However, it is not yet clear what utility function they are likely to use.

Looking at existing synthetic intelligent agents - such as Deep Blue - it seems possible that their utility functions may be extremely complex.

Getting the utility function right is important. A powerful superintelligent agent closely resembles a wish-granting genie - but as in traditional stories, it is necessary to be careful what you wish for.

For example, consider what happens if a gold-mining agent is constructed. The resulting agent might then mine the entire planet, converting it to rubble in its search of gold atoms. Attempts to switch it off would be strongly resisted - as though the machine is reasoning that - if it is turned off - production of gold would slow down - a terrible state of affairs from the point of view of maximising gold production.

This type of runaway superintelligence is one type of undesirable outcome that can arise from a careless choice of utility function - poor choices in this area can lead to negative outcomes.

Utility Engineering

There are some issues associated with whether various proposed architectures will allow us full control over a system's utility function. However, here, I will assume that we will be able to engineer systems with whatever utility function we choose.

Since poor choice of utility function can have negative consequences for humanity, there seem to be various options.

One is to get things right the first time - and build a superintelligence which is human-friendly - and shares human values.

However, that seems likely to prove to be a challenging engineering project. If that strategy is taken, it seems likely that other superintelligence construction projects - which are not so choosy about the exact details of the utility function will materialise first.

Another is to provide a mechanism for dynamically updating the utility function to reflect human desires. That poses some technical problems - since superintelligent agents will naturally resist modifications to their utility functions. However, this is essentially the solution that Asimov originally proposed. Asimov's moral robots simply obeyed humans - and subsequent commands could override earlier ones.

To explore the possibilities in this area, I will consider a simpler problem here - the problem of whether we can make a superintelligence that suspends its activities after a specified time, or on request.

If you can stop a superintelligence that is misbehaving, then you can probably reprogram it and then start it up again - thereby obtaining a crude version of a machine intelligence with dynamically configurable goals.

Stopping superintelligence

Powerful expected utility maximisers naturally resist being turned off. Being turned off usually eliminates all chances of obtaining utility in the future - an extremely negative outcome.

However, by careful engineering of the utility function, it is possible to engineer systems that don't mind being turned off.

Steve Omohundro has analysed this type of system - in a talk entitled "AI and the Future of Human Morality". Here is a some very brief footage of Steve describing the problem:

[Very brief footage of Steve Omohundro]

Then he describes an objection, which he attributes to Carl Shulman:

[Footage of Steve Omohundro]

I agree with Steve on very many things, but here I think that the analysis he presents is sloppy. A correctly-constructed intelligent agent is not likely to conclude it should not switch itself off because of its doubts - unless it has good evidence for those doubts. Steve argues that only a small doubt is enough because the negative utility of switching yourself off incorrectly is so large. However, that analysis is not correct. The utility associated with being turned off is actually one of the parameters under control of the designers. They can configure this so it is dynamically equal to the expected utility of being switched on - at all times. In such cases, the machine will not really have much objection to being switched off.

Then, Steve describes an possible way of resolving the percieved problem:

[Footage of Steve Omohundro]

Steve then goes on to poke holes in this argument, saying that such agents would not correctly conclude that they are living in a low-utility real world and would instead prefer the delusion of a high utility simulation - drawing an analogy with Cypher wanting to stay in the matrix.

Hypothetically, if we grant that conclusion, then that does, in fact allow a resolution of the original "switching off" problem - simply make the utility associated with being switched off higher than the utility of concluding that the world is some kind of illusion or simulation.

To me, it seems reasonable to expect that such agents will, in fact, be built in a manner that makes them value real world utility much more highly than anything they can obtain via a simulation.

As you can see, I find Steve's analysis of this whole issue unconvincing.

So, without further ado, here is my own analysis:

To give my conclusion up front, I think that engineering a superintelligent machine that can switch itself off at a specified time is a reasonably tractable problem.

Engineering a machine that doesn't mind being switched off my humans - and engineering a machine that switches itself off when it has completed a specified task - are two more closely-related problems with very similar solutions.

However, there are some issues in this area. Here is my assesment of them:


The first problem associated with switching such an agent off is specifying exactly what needs to be switched off to count as the agent being being in an "off" state. This is the problem of the agent's identity.

Humans have an intuitive sense of their own identity, and the concept usually deliniates a fleshy sack surrounded by skin. However, phenotypes extend beyond that - as Richard Dawkins pointed out in his book, The Extended Phenotype.

For a machine intelligence, the problem is a thorny one. Machines may construct other machines, and set these to work. They may sub-contract their activities to other agents. Telling a machine to turn itself off and then being faced with an army of its minions and hired help still keen to perform the machine's original task is an example of how this problem might manifest istelf.

The agent may be actively motivated to perform subcontracting and to construct an army of minions. Such activities may magnify its powers.

Also, before such an agent is switched off it may well naturally want to avoid being switched off - and so might exhibit tendencies to distribute its sensors and actuators widely by subcontracting tasks out or similar - as a defense against its centre of operations being attacked.

Next, there is the meaning of the term "off". Assuming that we can specify the spatial and temporal extent of the agent, there remains the problem of what constitutes an "off" state. One intuitive way of doing that is to assert that the machine must not take any positive actions after a specified date. Apart from the issue of what constitutes a "positive action", there is a problem with this - namely that this is likely to lead to permanent destruction of the machine - as it makes sure that it will stay permanently switched off. If the machine is destroyed, there may be side effects of that.

To give an example, side effects might arise if the machine uses a nuclear bomb to effect its own destruction. Also, a destroyed machine cannot be recycled and reused.

Finally, there a dilemma - concerning whether such agents should look ahead to beyond their own switch-off date. Preferences concerning the state of the world after they are switched off may motivate an agent to micromanage that subsequent state - e.g. by constructing minions or by subcontracting. In other words, it motivates the machine to continue its operations after it is supposed to be switched off.

However, the alternative has disadvantages as well. If there is a future time after which the agent's preferences are not considered that simply provides a point in time beyond which problems can be concealed. For example, if an agent does not value anything that happens after the year 2100, it will not be properly concerned about protecting that future environment from the effects of its waste products.

This can be illustrated by an analogy with the antagonistic pleiotropy theory of aging. In that theory, there is a selection pressure to delay the date of expression of deleterious genes - which ultimately results in organisms exhibiting senescence. Similarly, if a superintelligent agent can obtain utility by putting its environmental problems beyond a future barrier which it cannot see beyond, that is probably what it will do.

Of course, this dilemma only applies to the case when the machine has some idea of when it will be turned off. If the switch-off date is down to the whim of humans, it may not have the option of not considering the future beyond a specified point in time.


How can this list of problems be addressed?

One thing that might help is to put the agent into a quiescent state before being switched off. In the quiescent state, utility depends on not taking any of its previous utility-producing actions. This helps to motivate the machine to ensure subcontractors and minions can be told to cease and desist. If the agent is doing nothing when it is switched off, hopefully, it will continue to do nothing.

Problems with the agent's sense of identity can be partly addressed by making sure that it has a good sense of identity. If it makes minions, it should count them as somatic tissue, and ensure they are switched off as well. Subcontractors should not be "switched off" - but should be tracked and told to desist - and so on.

Problems with the definition of the off state can be partly addressed by laboriously specifying what constitutes an off state.

Lastly, make sure the machine is not left running for too long without being turned off and then inspected and reviewed.

Such steps would not remove all risk of a runaway scenario materialising - but they should be pretty effective.

I think this analysis shows that many concerns over runaway intelligent machines will prove to be relatively easily avoidable - assuming that we choose to prioritise safety.

One problem will be that there will be a strong motivation not to regularly turn off the superintelligent agents - because of how useful they are.



  1. Steve Omohundro - AI and the Future of Human Morality

  2. Steve Omohundro - AI and the Future of Human Morality - transcript

Tim Tyler | Contact |