Artificial General Intelligence — Safety Problems — Part II

10 min readJun 19, 2022

Today’s artificial intelligence systems are doing extraordinary things in every place you look. They are transforming the way technologies are built not only for today, but also for future. With such profound impact in our daily life, it is only fair to ask: How does these systems work (Part I), How safe are these systems are and what are the typical challenges (This article) and what are some of proposed solutions (Part III).

disclaimer: I am not an expert in this area. I find the concepts extremely interesting. I wanted to collect my notes and share my understanding. I will be grateful for any feedback and/or suggestion.

Misalignment of Agent and Human Goals

As we saw in previous post, reinforcement learning agents try to fulfil their goals by maximising the reward function. These goals are ideally defined by humans and perfectly understood by the agent. But what happens if that is not the case?

Expressing goals in terms of physical world is hard and it is even harder to making a computer to understand. In practice, we have to use certain proxies to emulate what we actually want. Curing cancer is a fairly lofty goal, and one way to evaluate it is to measure rate of reduction in number of cancer patients. It seems to be a reasonable approximation, right? But how about killing anyone who has cancer? That also reduce that exact metric. But we all know that’s not what we want. How can we explain it to AGI?

“Specification gaming examples in AI” (in ref) is a list of many game playing and robotics research experiments where AGI agent manifests such behaviours.

But what is the true source of the misalignment? One way to think about our goals are often expressed in a way that it is assumed to be within the framework of human value system. This value system is encoded in our DNA in the form of evolutionary mutations, as well as it is ingrained in our social constructs.

An AGI agent does not have these benefits. For an agent, the only thing really matters in the world is achieving goals by maximising reward functions. And unless it is encoded in the reward function, it does not need to pay attention to anything else because there is no incentive to do so.

The agent will trade off a large amount of things it does not care about to increase even a small amount of gains in things it cares about.
One important point to note here: It is not that the agent is not aware of human values. As a super-intelligent agent it can read books and attend classes to get a clear understanding of the complete environment. The problem is it just does not care about it unless being told explicitly.

Dangerous By Default

As such, we can definitely conclude designing AGI systems in a safe manner is pretty difficult. In other words, the AGI systems are more likely to exhibit a dangerous behaviour than that of a safe one. And it may or may not exhibit the problems until they start interacting in real world.

There are quite few many ways the misalignment in goals can manifest depending on various factors. Let us discuss few of them in detail below.

Convergent Instrumental Goals

In last post we discussed what are instrumental goals: They are goals an agent would like to learn in order to better fulfil its terminal goals.

What is very interesting is many of such instrumental goals are convergent as they are common across a large number of terminal goals. In other words, learning some of these terminal goals is useful for an agent which has a diverse set of terminal goals.

Preservation

We discussed this briefly before. Here the core problem is AGI agent can develop or learn ways to preserve both its goals and destruction.

Let us assume an agent is in training phase and it started exhibiting indications that its goals are not aligned with the intended goal. Now the safest course of action is to stop training, modify the goal and retrain. However, how will the agent perceive these actions?

First thing is to note that we are talking about an intelligent agent, which already knows enough about the environment to understand the intention of the proposed course of action.

Secondly it can also able to deduce that modifying the current goal will definitely prevent it to achieve its current goal. If we assume the only thing it cares is to achieve its current goal, there is a high incentive to prevent this modification. This is called Goal Preservation.

Additionally, just the idea of being stopped or being destroyed is not very complex to deduce for an intelligent agent. And like any other intelligent artefact it is natural for the agent to take precautions against such actions. Thus learning Self Preservation as an instrumental goal should be high in agent’s list of priority.

Resource acquisition

This should be very easy to understand as as an intelligent agents humans are prone to this kind of behaviour. Achieving any goal requires resources. In general resources can be expressed in few translatable forms (such as money) so that a large number of terminal goals can be achieved by acquiring few forms of resources in large quantity.

Unfortunately as we have seen over and over in human history, driving this greedy behaviour further down the road is almost always a recipe for disaster. Now think of many intelligent agents within an environment trying to acquire same form of resource as most as they can — without any concept of morality and values!!

Self Improvement

Self improvement is a trait we all like to have. In some sense it is key to continue learning new policies and strategies. The notion of improvement does not conflict with preservation, in fact it helps other instrumental goals.

At first instance it seems a benign goal, so why it is a safety issue? One of the reason is that AGI agents do not have a sense of “Good Enough”. Without any specific constraint in place, intelligent agents has no reason to stop improving itself, even when there is nothing more to be gained.

Reward Hacking

Reward hacking is a broad paradigm of AI safety issues where the agent can have some way to interact with the reward function itself.

The agent can then either take advantage to “game” the reward system or it can exploit some weakness of the reward system itself. There are few possible scenarios below:

Partially Observed Goals

If there is no direct way to measure the desired goal then the agent can be presented with an imperfect reward function. This can be then maximised with some strategies without actually achieving the goal.

We have looked into the cure cancer scenario. Another classic example is the cleaning robot thought experiment. It is hard to define cleanliness. Cleanliness can be thought as just the opposite of visible mess. In other words, the robot can be rewarded more if it sees less mess around it. While it seems a fairly reasonable approach, an intelligent robot can just close its eyes (or put a bucket on top of its head) and pretend cleanliness.

Goodhart’s Law

When a metric becomes a target then it ceases to be a good metric

A good reward function should have a strong correlation with the goal being accomplished. However if optimising reward function itself becomes a goal then the correlation loses its meaning.

Let us consider a ticketing system is monitored by an AGI agent and the performance depends on number of tickets it can close in a month. An intelligent agent may close a number of tickets without resolution on the last day of the month and reopen them soon after. Or it can just create new tickets under false pretence and just close them to increase counts.

Defect Exploitation

Artificial general intelligence systems are complex by nature. It depends on smooth interactions between various complex systems such as the environment and reward functions, and often has to deal with complex abstract concepts. This increases the overall risk profile of defects being present in the system, especially at least part of the system can be non-deterministic and hard to test.

Defects can occur within the specification of the goal or within the reward function itself. These vulnerabilities can be discovered by agents by systematic or accidental attacks and then can be exploited.

Agents are also part of the environment itself and able to access various parts of it. There is high motivation for an intelligent agent to attack any vulnerability which can cause high reward. This is called “wire heading”. As an example: to win a game of Wordle, an agent may try to hack into the server so that it can decide which word to provide (and then just play it to win).

Unsafe Exploration

As we saw in the reinforcement learning definition, an agent can take actions to interact with the environment. At that time we have not made any assumption about what actions agents can take. In principle, agents can take any action randomly or following some patterns as it sees fit. In other words, the agent is free to “explore” the action space based on its policies.

The question is: how to explore the action space safely. It is bound to happen that some actions can lead to some bad outcome. It is also possible some actions can lead to immediate catastrophes. In fact, the agent faces a trade off between exploring new actions (and lead to catastrophes) and exploiting current known good actions (and never able to discover better actions possible).

An agent does not have any heuristic about the action itself, ie it can not know the result of an action without actually choosing it. Because if it knows what the action will result into then there is nothing to learn by taking the action.

If the agent decides to explore only a set of known and safe action space then it will miss out a large unexplored space. This will be a blocker for its goal of self improvement. Of course it can avoid known unsafe spaces, but that will not be enough to guarantee safety.

Distributional Shift of Environment

Distributional shift of data is a curse in machine learning in general. Any ML 101 course will teach you the importance of the assumptions made around the distribution. In fact machine learning can be simply defined as various ways to approximate an unknown distribution of data. It is the hope that any unknown data element for which predictions are to be made will also be drawn from the same distribution.

The same assumptions are true for reinforcement learning settings. If the training is done on a set of environments exhibiting a certain pattern then it is unfair to expect the model to perform in an environment which is very different. However, this is a very practical possibility in real world. Real world is so diverse and complex that it is almost impossible to create even a complete set of representative example environments.

Example: An agent trained in a maze to get a reward. During testing, the reward is always kept at a fixed location (lets say top right corner of the maze) and agent could learn to get the reward. Now the agent is deployed to an environment where the reward is placed at random location. The agent completely ignores the reward, and goes to the top right corner. The agent is still intelligent, because it can solve the maze. But it has learned a wrong objective. It now knows “how to go to top right corner” and NOT “how to get a reward.

Similarly, an agent trained on an environment where the reward is a yellow gem. However in testing, the reward was a red gem, and there was a distractor: a yellow star. The agent learned to go to “Yellow” things, and thus in testing it preferred Yellow star over the red gem.

Ethical AI vs AI Safety

Building ethical and responsible AI systems is another challenge — but with slightly different objective. Any reasonably large human data set is bound to be tainted by our history of preferential prejudices and power equations. An AI system trained on such data is bound to pick up such traits which are unethical in today’s moral standards. It is almost like children learning both good and bad things from the society. Creating ethical AI or at least safe guarding AI systems against unethical behaviours are a pressing challenge right now.

On the other hand, AI safety deals into more foundational safety concerns. In fact, an ethical AI systems can turn out to be unsafe and vice versa. There can be overlaps too: such as intelligent AGI systems can “deceive” to behave as safe and only exposes its unsafe nature once deployed. The deception strategy is both unethical and unsafe but can be a lucrative strategy to preserve itself against adversarial observers.

There are few solutions for AI safety can also be important for ethical AI such as interpretability.

Conclusion

So, is there some solutions around these problems? A lot of research is going on to address these problems. Tools we have now are very strong understanding of deep networks which can approximate any behaviours. We also have unprecedented computing power along with tools and technology around scale up and scale out. We have our accumulated knowledge about how we learn and how we improve ourselves. There are some interesting ways researchers are trying to address AI safety problems. We will take a look at some of them in Part III.

References

Specification gaming examples in AI - master list - Google Drive

Edit description

docs.google.com

Concrete Problems in AI Safety

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential…

arxiv.org

https://cset.georgetown.edu/wp-content/uploads/CSET-Key-Concepts-in-AI-Safety-An-Overview.pdf