The Wirehead Problem


The Wirehead Problem

Introduction

Hi, I'm Tim Tyler - and today I will be discussing the wirehead problem in the context of machine intelligence.

The term "Wireheads" refers to agents who have short-circuited their reward systems - in order to gain access to pleasurable sensations without performing useful tasks.

Drug addicts are one example of agents which have turned themselves into wireheads. Instead of eating, having sex, and other rewarding activities, these agents stimulate their pleasure centres directly - typically using chemical compounds that mimic the neurotransmitters which are involved in reinforcement learning, or pain-killing.

Another example illustrates the origin of the term "wirehead" - here rodents have their brains connected up to remote electrodes - and then have their pleasure centres repeatedly stimulated when they perform some specified action.

Here, a rat pleasures itself by pressing a metal bar. Self-stimulation is indicated by the flashing light. The rat immediately shows considerable interest.

Here a rat is kept in a Skinner box. The rat's self-stimulation continues for extended periods - displacing its normal interests in food, water and sex.

The problem

One question that arises in the context of constructing machine intellingence is: how can we prevent mechanical wireheads from arising?

The first relevant observation is that wireheads are fairly widespread. In addition to human drug addicts, other systems are vulnerable to similar types of corruption:

Money is intended to motivate people towards actions society regards as productive - but it motivates some people to perform bank robberies, and other people to engage in counterfitting activities - actions that go straight for the reward, and bypass the behaviours which society intended money to produce.

Other things besides money can be forged. Products, reputations, qualifications, citizenship and identity are other common targets for forgers - in each case something desirable is obtained while bypassing the normal means of its production.

On the other hand, the frequency of wirehead behaviour is typically kept relatively low: wireheads are usually in a minority, due to the use of anti-wirehead strategies.

Wireheads can be dangerous

A wirehead doesn't necessarily sit there, doing nothing in a state of sublime ecstacy. Its pleasurable state may still require effort to maintain. Consider heroin addicts - they are short-circuiting their pleasure centres, but they still need cash to fund their habit - and the result can be crime and violence.

Policing and surveillance

Looking at the anti-wirehead strategies used, police surveillance is one of the main approaches. It seems reasonable to expect surveillance to become ubiquitous in the transparent society of the future - which might be bad news for prospective wireheads.

Superintelligence and self-improving systems

Interest in the wirehead problem often centres around the issue of whether superintelligences can be defended against these kinds of problems once they can self-improve.

By assumption, the superintelligences under discussion have total access to their inner workings, and the power to change them as they wish.

If superintelligences tend to use their intelligence to find ways of stimulating their pleasure centres directly, it seems likely that this would compromise their reliability and limit their usefulness.

Exploring the problem

To defend against wireheading, a superintelligence needs two main architectural elements:

  • it must evaluate the desirability of the expected consequences of changes to its goal system with respect to its current goals;
  • its goals should accurately represent the state the agent is supposed to produce;

It is the second condition that leads to the main problems.

Say you build a superintelligence with the goal of cooling down the planet. If you give it access to a range of thermometers and tell it to minimise their temperature readings, the superintelligence may notice that it can best attain its goals by immersing the thermometers in liquid nitrogen.

How can you tell the superintelligence that it's actually the temperature that needs minimising - and not some proxy for it?

This turns out to be a non-trivial problem. You need to give the superintelligence a sophisticated understanding of its goal - including things like what the concept of temperature means and how it is measured.

The wirehead problem can be illustrated even in relatively simple systems.

Say you want the superintelligence to find as many prime numbers as possible. Here the utility function might reference a counter representing the number of prime numbers which it has identified so far. However, a wirehead might notice that it could increment the counter without even attempting to test the candidate numbers for primality.

How can you tell a self-modifying superintelligent agent to actually maximise the number of prime numbers found - and not simply poke values into a counter?

Failure modes

Some solutions to this problem do not appear to be promising:

Attempting to limit self-modification by walling off the agent's utility function seems destined to fail - this is getting in between a superintelligent agent and its utility - which is rarely a good idea.

Similarly, making the agent feel revulsion when it thinks about modifying its goal system is another unlikely solution. That just creates motivation for the agent to hire a third party to perform such modifications.

The proposed resolution

The key to the problem is widely thought to be to make the agent in such a way that it doesn't want to modify its goals - and so has a stable goal structure which it actively defends. This resolution was pioneered and promoted by Eliezer Yudkowsky. Here he is describing the basic idea:

[Eliezer Yudkowsky footage]

Another researcher who has investigated self-improving systems in depth is Steve Omohundro. He seems to have arrived at a position similar to Eliezer. Here is Steve describing the situation:

[Steve Omohundro footage]

Proof needed

Unfortunately, the problem of how to avoid wirehead behaviour is one where we have relatively little experimental evidence which bears directly on the issue.

Also, the wirehead problem is an extremely complicated one - and it is difficult to think clearly about all the issues - and so there is the possibility of making mistakes when reasoning about the topic. There is not yet any rigorous proof that real systems will actually act as though they have stable goals once they become capable of self-modification. Here is Eliezer on that topic:

[Eliezer Yudkowsky footage]

Real systems

The real intelligent systems which we can see exhibit varying levels of resistance to wireheading:

Some people refuse to take pleasure-inducing drugs - while others become addicts all too easily. Most people work honestly for a living - but there are some who rob banks. Most corporations perform a service to their stockholders - but some engage in accounting fraud to elevate their stock price, and then use insider-trading techniques to siphon off their money.

While humans normally appear to have stable goal systems, there is the phenomenon of religious conversion to consider - in which people's goal systems often appear to undergo dramatic changes.

Of course humans are handicapped by being evolved agents. Agents which are deliberately engineered to be resistant to the temptation to wirehead themselves may perform a lot better than humans do.

Implications for machine intelligence

It seems likely that some approaches to constructing machine intelligence will be prone to the wirehead problem. So, although there appears to be a theoretical solution, getting agents to the point where they have sufficent understanding of their own goals to know enough to avoid self-modifications that trash them looks likely to prove to be a non-trivial exercise.

One proposed architecture for synthetic intelligence involves a neural network surrounded by sensor and motor signals, with positive and negative critical feedback.

This type of architecture - which happens to be the only one for which we have practical proof that it works - along with other strategies based directly on reinforcement learning - seems especially likely to be vulnerable to the wirehead problem.

Significance of the problem

The details of this problem and its resolution make a difference for futurists, and the architects of intelligent machines.

I rate the problem as one of the most important and interesting philosophical issues in machine intelligence. If one percent of the people who talk about machine consciousness considered the problem, we might have a better understanding of it today.

Enjoy,

References


Tim Tyler | Contact | http://alife.co.uk/