Snake Played by a Deep Reinforcement Learning Agent
Ever since I watched the Netflix documentary AlphaGo, I have been fascinated by Reinforcement Learning. Reinforcement Learning is comparable with learning in real life: you see something, you do something, and your act has positive or negative consequences. You learn from the consequences and adjust your actions accordingly. Reinforcement Learning has many applications, like autonomous driving, robotics, trading and gaming. In this post, I will show how the computer can learn to play the game Snake using Deep Reinforcement Learning.
The Basics
If you are familiar with Deep Reinforcement Learning, you can skip the following two sections.
Reinforcement Learning
The concept behind Reinforcement Learning (RL) is easy to grasp. An agent learns by interacting with an environment. The agent chooses an action, and receives feedback from the environment in the form of states (or observations) and rewards. This cycle continues forever or until the agent ends in a terminal state. Then a new episode of learning starts. Schematically, it looks like this:
Reinforcement Learning: an agent interacts with the environment by choosing actions and receiving observations (or states) and rewards.
The goal of the agent is to maximize the sum of the rewards during an episode. In the beginning of the learning phase the agent explores a lot: it tries different actions in the same state. It needs this information to find the best actions possible for the states. When the learning continues, exploration decreases. Instead, the agent will exploit his moves: this means he will choose the action that maximizes the reward, based on his experience.
Deep Reinforcement Learning
Deep Learning uses artificial neural networks to map inputs to outputs. Deep Learning is powerful, because it can approximate any function with only one hidden layer¹. How does it work? The network exists of layers with nodes. The first layer is the input layer. Then the hidden layers transform the data with weights and activation functions. The last layer is the output layer, where the target is predicted. By adjusting the weights the network can learn patterns and improve its predictions.
As the name suggests, Deep Reinforcement Learning is a combination of Deep Learning and Reinforcement Learning. By using the states as the input, values for actions as the output and the rewards for adjusting the weights in the right direction, the agent learns to predict the best action for a given state.
Deep Reinforcement Learning in Action
Let’s apply these techniques to the famous game Snake. I bet you know the game, the goal is to grab as many apples as possible while not walking into a wall or the snake’s body. I build the game in Python with the turtle library.
Me playing Snake.
Defining Actions, Rewards and States
To prepare the game for a RL agent, let’s formalize the problem. Defining the actions is easy. The agent can choose between going up, right, down or left. The rewards and state space are a bit harder. There are multiple solutions, and one will work better than the other. For now, let’s try the following. If the snake grabs an apple, give a reward of 10. If the snake dies, the reward is -100. To help the agent, give a reward of 1 if the snake comes closer to the apple, and a reward of -1 if the snake moves away from the apple.
There are a lot of options for the state: you can choose to give scaled coordinates of the snake and the apple or to give directions to the location of the apple. An important thing to do is to add the location of obstacles (the wall and body) so the agent learns to avoid dying. Below a summary of actions, state and rewards. Later in the article you can see how adjustments to the state affect performance.
Actions, rewards and state
Creating the Environment and the Agent
By adding some methods to the Snake program, it’s possible to create a Reinforcement Learning environment. The added methods are: reset(self), step(self, action) and get_state(self) . Besides this it’s necessary to calculate the reward every time the agent takes a step (check out run_game(self)).
The agent uses a Deep Q Network to find the best actions. The parameters are:
# epsilon sets the level of exploration and decreases over time
param[‘epsilon’] = 1
param[‘epsilon_min’] = .01
param[‘epsilon_decay’] = .995
# gamma: value immediate (gamma=0) or future (gamma=1) rewards
param[‘gamma’] = .95
# the batch size is needed for replaying previous experiences
param[‘batch_size’] = 500
# neural network parameters
param[‘learning_rate’] = 0.00025
param[‘layer_sizes’] = [128, 128, 128]
If you are interested in the code, you can find it on my GitHub.
Snake Played by the Agent
Now it is time for the key question! Does the agent learn to play the game? Let’s find out by observing how the agent interacts with the environment.
The first games, the agent has no clue:
The first games.
The first apple! It still seems like the agent doesn’t know what he is doing…
Finds the first apple… and hits the wall.
End of game 13 and beginning of game 14:
Improving!
The agent learns: it doesn’t take the shortest path but finds his way to the apples.
Game 30:
Good job! New high score!
Wow, the agent avoids the body of the snake and finds a fast way to the apples, after playing only 30 games!
Playing with the State Space
The agent learns to play snake (with Experience Replay), but maybe it’s possible to change the state space and achieve similar or better performance. Let’s try the following four state spaces:
State space ‘no direction’: don’t give the agent the direction the Snake is going.
State space ‘coordinates’: replace the location of the apple (up, right, down and/or left) with the coordinates of the apple (x, y) and the snake (x, y). The coordinates are scaled between 0 and 1.
State space ‘direction 0 or 1’: the original state space.
State space ‘only walls’: don’t tell the agent when the body is up, right, down or left, only tell it if there’s a wall.
Can you make a guess and rank them from the best state space to the worst after playing 50 games?
An agent playing snake prevents seeing the answer
Made your guess?
Here is a graph with the performance using the different state spaces:
Defining the right state accelerates learning! This graph shows the mean return of the last twenty games for the different state spaces.
It is clear that using the state space with the directions (the original state space) learns fast and achieves the highest return. But the state space using the coordinates is improving, and maybe it can reach the same performance when it trains longer. A reason for the slow learning might be the number of possible states: 20⁴2⁴4 = 1,024,000 different states are possible (the snake canvas is 2020 steps, there are 2⁴ options for obstacles, and 4 options for the current direction). For the original state space the number of possible states is equal to: 3²2⁴*4 = 576 (3 options each for above/below and left/right). 576 is more than 1,700 times smaller than 1,024,000. This influences the learning process.
Playing with the Rewards
What about the rewards? Is there a better way to program them?
Recall that our rewards were formatted like this:
Blooper #1: Walk in CirclesWhat if we change the reward -1 to 1? By doing this, the agent will receive a reward of 1 every time it survives a time step. This can slow down learning in the beginning, but in the end the agent won’t die, and that’s a pretty important part of the game!
Well, does it work? The agent quickly learns how to avoid dying:
Agent receives a reward of 1 for surviving a time step.
-1, come back please!
Blooper #2: Hit the WallNext try: change the reward for coming closer to the apple to -1, and the reward of grabbing an apple to 100, what will happen? You might think: the agent receives a -1 for every time step, so it will run to the apples as fast as possible! This could be the truth, but there’s another thing that might happen…
The agent runs into the nearest wall to minimize the negative return.
Experience Replay
One secret behind fast learning of the agent (only needs 30 games) is experience replay. In experience replay the agent stores previous experiences and uses these experiences to learn faster. At every normal step, a number of replay steps (batch_size parameter) is performed. This works so well for Snake because given the same state action pair, there is low variance in reward and next state.
Blooper #3: No Experience ReplayIs experience replay really that important? Let’s remove it! For this experiment a reward of 100 for eating an apple is used.
This is the agent without using experience replay after playing 2500 games:
Training without experience replay. Even though the agent played 2500 (!) games, the agent can’t play snake. Fast playing, otherwise it would take days to reach the 10000 games.
After 3000 games, the highest number of apples caught in one game is 2.
After 10000 games, the highest number is 3… Was this 3 learning or was it luck?
It seems indeed that experience replay helps a lot, at least for these parameters, rewards and this state space. How many replay steps per step are necessary? The answers might surprise you. To answer this question we can play with the batch_size parameter (mentioned in the section Creating the Environment and the Agent). In the original experiment the value of batch_size was 500.
An overview of returns with different experience replay batch sizes:
Training 200 games with 3 different batch sizes: 1 (no experience replay), 2 and 4. Mean return of previous 20 episodes.
Even with batch size 2 the agent learns to play the game. In the graph you can see the impact of increasing the batch size, the same performance is reached more than 100 games earlier if batch size 4 is used instead of batch size 2.
Conclusions
The solution presented in this article gives results. The agent learns to play snake and achieves a high score (number of apples eaten) between 40 and 60 after playing 50 games. That is way better than a random agent!
The attentive reader would say: ‘The maximum score for this game is 399. Why doesn’t the agent achieve a score of anything close to 399? There’s a huge difference between 60 and 399!’
That’s right! And there is a problem with the solution from this article: the agent does not learn to avoid enclosing. The agent learns to avoid obstacles directly surrounding the snake’s head, but it can’t see the whole game. So the agent will enclose itself and die, especially when the snake is longer.
Enclosing.
An interesting way to solve this problem is to use pixels and Convolutional Neural Networks in the state space². Then it is possible for the agent to ‘see’ the whole game, instead of just nearby obstacles. It can learn to recognize the places it should go to avoid enclosing and get the maximum score.
Related
How I taught my computer to play Spot it! using OpenCV and Deep Learning
Solving Nonograms with 120 Lines of Code
[1] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators (1989), Neural networks 2.5: 359–366
[2] Mnih et al, Playing Atari with Deep Reinforcement Learning (2013)
Don’t forget to subscribe if you’d like to get an email whenever I publish a new article.
The post Snake Played by a Deep Reinforcement Learning Agent appeared first on Towards Data Science.