Artificial General Intelligence — Safety Problems — Part III

7 min readJun 23, 2022

Today’s artificial intelligence systems are doing extraordinary things in every place you look. They are transforming the way technologies are built not only for today, but also for future. With such profound impact in our daily life, it is only fair to ask: How does these systems work (Part I), How safe are these systems are and what are the typical challenges (Part II) and what are some of proposed solutions (This Article).

disclaimer: I am not an expert in this area. I find the concepts extremely interesting. I wanted to collect my notes and share my understanding. I will be grateful for any feedback and/or suggestion.

Repeating themes

We want AGI agents to explore strategies and policies such that they can perform better than human in general. But we also want the consequences of their actions are not harmful. Resolving this adversarial relationship between reward and risk, like many other things in life, is the key for AI safety.

In order to achieve a reasonable answer to the problem, there are number of active researches are ongoing, to find clever, better, faster and feasible solutions around some repeating themes as follows.

Better understanding of agent’s learning path and improve it
Better ways to find explore vs exploit models
How to scale and automate oversight and safety controls
Decomposing systems into smaller sub-systems which can be then designed to be stable and safe

Interpretability

Interpretability or Explainability is now deemed as critical for any AI system to be deployed in real world. It helps few critical AI safety issues:

Understanding: Sometimes it is more important to understand why a machine learning system makes a decision than what the decision actually is. This understanding of the inner workings may give indications about areas where the agent may not be performing well and the take actions to bolsters any such gaps.
Debugging: In the grand scheme of things agents can be seen as complex software programs. Software program debugging and testing is a generally understood field. By bringing best practices of debugging and testing techniques can help to improve overall performance.
Accountability: Interpretable systems can be easier to reason with when it comes down to accountability of any decision taken by the agent. In case of catastrophic outcomes it may not be very important aspect though.

In general, an interpretable system helps to at least quantify the associated risk profile. This information can then be used effectively to measure risk in real world and can be a key element while deciding if an agent is “ready” for deployment.

Interpretability does not make an AGI safe by itself, but it is a key ingredient to be a safe system

Adversarial Reward Function — Reward Modelling

If we go back to basic reinforcement learning model we will find the agent and the reward function is in an adversarial relationship. While the agent tries to find policies to increase the reward while the reward function tries to penalise any unintended policies.

One implementation of such strategy is to define a reward model. In this case a neural network model tries to predict how a human rewards agent’s action. The reward model interacts with a human observer for a limited time in a periodic manner. This way the network learns human preferences in context of the reward function, and it consults the human only when it is really uncertain: and that can happen when an agent tries some new policy. This is a novel way to incorporate human preferences in rewarding mechanism.

Safe Exploration

As we discussed in Part II, exploring the configuration space is an essential goal for an agent to learn interesting and efficient policies. On the other side, exploration is, by definition, can be dangerous. So how can an agent explore safely? There are few suggestions :

Simulation: This is probably the most known and most sensible way of exploration, though simulating the real world correctly is not a trivial task itself.
Demonstration: One option to limit the need of exploration is to use human experts as a point of reference and learn from them. This can help the agent to learn a generally safe baseline policy. Obvious downside of this option is that the agent can not be better than an human expert, but in some high risk scenarios that may be an acceptable outcome
Oversight: While exploration providing oversight can be a safe way to train. It is similar to having a second pair of break in a car which is used to teach driving.

Scalable Supervision

Scaling human oversight is expensive and slow. If a human is appointed to supervise an agent in every step it takes, it is probably cheaper to hire the human to do the work!

One possible way is to ask the human to check the agent once in a while randomly. This way some of the episodes of agent’s work will be “labelled” and some are “unlabelled”. If the agent can learn quickly enough from the sparse set of labelled episode then this mechanism of semi-supervised reinforcement learning can definitely be a safe alternative.

Another possible approach is to use ideas from traditional system integration testing. In this hierarchical reinforcement learning model, sub-agents are grouped under high level agents. Higher level agents can train on a sparse set of reward signal, while creating a more dense reward signals for the sub-agents.

Hierarchical RL seems a particularly promising approach to oversight, especially given the potential promise of combining ideas from hierarchical RL with neural network function approximators

Iterative Distillation and Amplification

Can you recall how were you taught algebra (or any other problems)? First we are assigned some simple problems and teacher shows us how to do them. Progressively we get better in doing simple problems and at the same time the teacher introduces more complex problems and patterns. This is a simple analogy of iterative distillation and amplification (IDA).

In this model an agent distills down learning from a computationally expensive model. Distilled knowledge then amplifies agents’s capability to take better decisions with less cost. The agent can continue the process iteratively by trading off the cost of running the expensive model and frequency of running it. Like adversarial modelling, the agent can start focus on the areas it is doing badly and thus speeds up the learning process.

Response to Out of Distribution Scenarios

One of the core assumption of machine learning systems is that the training and deployment environments will be pretty close in context of the things which matters for the model. In practice, and especially in context of general intelligence settings, seemingly similar environments can have diverse properties. For example, a self driving system which is trained solely on highways can exhibit some unsafe behaviours if deployed on busy city traffic, in spite of both environments can be considered as “road”.

There are few possibility to approach this problems, one of which is extend the training process using ensemble techniques and increase robustness.

Single Model, Multiple variations: Define and train a set of models and calibrate the outcome over a weighted population of samples.
Single Model, Multiple distributions: A model can be trained over various distributions. If the model can perform well across various distributions then there is a high chance that it will perform well enough in a novel distribution

Another approach is to expect outliers and design safety overrides into the system to address situations when model is really uncertain. This can be done by falling back to human oversight, systematic overrides or a combination.

Quantilisation

If we look at the basic reinforcement learning set up then we see one of biggest safety issue is agent’s obsession on maximising the reward function. There is absolutely no incentive for an agent not to do so, even for small fraction increase.

By we as human agents, we do not do that. One of the reason is we have a different value system. It is not comparable with the value system agent has. But we want agent to honour at least part of our value system, such as “do not kill” or “do not steal” etc. In other words, we want agents to behave more like a human when taking decisions.

But at the same time, we do not want to give up immense potential an agent can uncover by exploring alternative strategies human would never consider or not able to pursue due to physical constraints.

Quantilisers are devised to combine best of both worlds. It can be thought of a set of controls which can define how an agent will behave. At one end of the spectrum is most risk aversive human model and on the other end agent is free to maximise reward function in any way it can. Given the risk profile of the given task such mechanisms can provide a safety control.

Conclusion

How do we typically build large complex structures? We design it right and meticulously test each components. So why can’t we do the same for Artificial General Intelligence? Unfortunately, it is a trade off between our ambition and our risk appetite. We humans have developed a value system over many years of biological and social evolution. It is impossible to expect the same will be somehow inherited by new breed of systems and agents.

If we assume AGI will be a reality and AGI systems can be as powerful as we want them to be, then being safe must be a non-negotiable trait. Because it is such a distant possibility and there are so many roadblocks we need to cross before we have a realistic chance, AI safety is often ignored as a priority. Hope this series will help us all to understand why AI safety is important and how we should mature it over the time so we will be ready when AGIs are.

References

https://www.youtube.com/c/RobertMilesAI

Iterated Distillation and Amplification

Guest post summarizing my approach to aligned RL.

ai-alignment.com

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda - EA Forum

You can read this post as a instead (IMO much better to read). This document aims to clarify the AI safety research…

forum.effectivealtruism.org

https://ai-alignment.com/semi-supervised-reinforcement-learning-cf7d5375197f