Artificial intelligence, more or less, is in our day-to-day life, even though we do not realize it. When you use Google to search keywords, AI is there to give you the best results. When you ask your smart speaker “what is the weather today?”, AI is there for you. When your e-mail client distinguishes spam emails and puts them in a separate inbox, AI is there for you. All those AI systems are focused on doing only one thing; a spam AI system (in technical terms: spam classifier) can not be used for another purpose. With today’s tech, we have only those domain-specific AI systems. What about an AI system that can generalize?

An artificial intelligence system that understands or learns any task is called artificial general intelligence (AGI). Here, we are not talking about a system that does one thing; we are talking about a system that can learn like a human. So, the question arises; how can we develop AGI? Scientists from DeepMind published a paper Reward is enough[1] where they hypothesize that intelligence, and its associated abilities, can be understood as subserving the maximization of reward[1].

To understand the concept of reward, we need to understand reinforcement learning. That is why before diving into Reward is enough paper, we will explain reinforcement learning.

Reinforcement Learning

Reinforcement learning is a machine learning method where an agent tries to learn by interacting with its environment. An agent can be a robot or software bot. The best way to understand reinforcement learning is to give an example. Let’s say you want to develop a chess bot. Every time the chess bot does a wrong action, we will give it a negative reward (let’s say -1) and every time it makes a correct move we will give a positive reward (+1), and if the wins the match we will reward it with +10.

At the beginning, the chess bot will make random moves that do not make sense to anybody. After millions of iterations, the chess bot will start making sense and will learn the rules of the game, and will learn how to develop new strategies and start winning the match. That is what happened with AlphaGo[2]. AlphaGo is a Go-playing machine learning system that was developed by DeepMind by using reinforcement learning methods. AlphaGo won the match against 18-time world champion Lee Sedol in 2016 [3].

How reinforcement learning works can be described with a simple single image [Image#1]. The algorithm of how reinforcement learning works can be explained with those steps:

  1. The agent takes an action in the environment
  2. It receives a reward based on its previous action
  3. Now, the agent is in a different state from the previous state
  4. Learn from it
  5. Go to step#1
A diagram of how reinforcement learning works

Since we have an idea of how reinforcement learning works now, we can continue our main topic; Reward is enough [1].

Reward is Enough

This paper starts the abstract with these sentences:

In this article we hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalisation and imitation.

The main point of the article is; artificial general intelligence can be achieved if an agent tries to maximize a reward in a complex environment because the complexity of the environment will force the agent to learn complex abilities like; social intelligence, language, etc …

a diagram of how reinforcement learning works

The paper gives the example of how a squirrel and a kitchen robot can learn complex abilities by maximizing a single reward [Image #2].

If the goal of a squirrel is to minimize hunger (maximize negative hunger):

the squirrel-brain must presumably have abilities of perception (to identify good nuts), knowledge (to understand nuts), motor control (to collect nuts), planning (to choose where to cache nuts), memory (to recall locations of cached nuts) and social intelligence (to bluff about locations of cached nuts, to ensure they are not stolen). [1]

Squirrel develops all those skills only to minimize hunger. The complex environment forces it to learn those skills. If it can’t identify nuts (perception), it may collect the wrong object which does not help it to minimize hunger. If it can’t develop motor controls, it can’t collect and carry nuts. If it can’t develop planning skills, it will not cache the nuts and whenever it is hungry it has to find nuts again. It has to develop memory skills so it can remember where it cached the nuts.

The paper continues by explaining reinforcement learning, which we explained in the previous section. After that; it goes with the main hypothesis of the paper.

Hypothesis Reward-is-Enough

Intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment.

In 7 subsections; the paper explains why the reward is enough for various skills:

  • Reward is enough for knowledge and learning
  • Reward is enough for perception
  • Reward is enough for social intelligence
  • Reward is enough for language
  • Reward is enough for generalization
  • Reward is enough for imitation
  • Reward is enough for general intelligence

In all those subsections, it all comes to one main point; that skill has to be developed to maximize the reward, which I tried to explain with the squirrel example.

AlphaGo Zero Example

This paper is not giving any new technical details, it is more a philosophical paper rather than a technical one. So, we have this question; can reward maximization help to build skills that were not explained? The answer is obvious for the people who have done some research on reinforcement learning and the answer is yes.

AlphaGo Zero was asked to maximize the wins for the Go game and it learned the necessary strategies to win the game. AlphaGo zero paper describes it like the following:

AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts including fuseki (opening), tesuji (tactics), life-and-death, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles. [4]

Technically, we know that a reinforcement learning agent, in relatively simpler environments than the complex real world, it can develop skills to maximize the reward. But, here is the fact: we, as humans, never experienced a reinforcement learning agent that interacts with the complex real-world and develops general artificial intelligence so far. That is something, probably, we will learn over time.

Conclusion

Inthis article, we talked about the Reward is enough paper written DeepMind scientists. This paper is not a technical one and it discusses that if a reinforcement learning agent interacts with a complex environment; it will learn necessary skills that were not introduced at the beginning, so it can maximize the reward.

References

[1] https://reader.elsevier.com/reader/sd/pii/S0004370221000862?token=76799ADE5861C4C8C9E3D47DD0351F56EA11AE41108E132F0F4A1DFA25EF2DBC74A3072D4E8C8D5827A8C3B06DD26648&originRegion=us-east-1&originCreation=20211213231723

[2] https://deepmind.com/research/case-studies/alphago-the-story-so-far

[3] https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

[4] https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf