Model and Methods

I used the deep Q-learning algorithm to train my agent. In this deep Q-learning, the agent’s policy is dictated by a neural network, called the Q network, that takes environment states s as input and outputs the Q-value of each possible action a. The Q-value Q(s,a) represents the expected total reward for taking that action. To find Q(s,a), the network is trained to reduce the mean-squared error loss between the current Q(s,a) and the return Y, which is the sum of the reward for taking action a when in state s and a discounted future reward. The discount factor can be lowered to prioritize immediate rewards or raised to prioritize long-term reward.

The reward function, which dictates the goal to the actor, returns 1 if the snake eats the fruit, -1 if the snake crashes, and the inverse of the distance between the snake and fruit otherwise. This incentivizes the actor to guide the snake away from any walls or its body and towards the fruit. The added reward for being close to the fruit helps reduce the sparsity of rewards, helping the actor learn.

To improve stability, two networks are trained instead of one. The main network is used by the actor to propose an action, while the target network is used to estimate the return Y. When training, the main network is by minimizing the loss, while the target network is updated using a weighted average of the current target network and the updated main network. For the optimization, I used the Adam method with learning rate 0.001.

The state space of Snake is a grid of pixels represented as a rank 3 tensor. To allow for efficient simulation, I kept the grid size at 8x8, so with 3 RGB values there are 192 elements in each state. To preserve spatial relations, I use a convolutional neural network architecture for the Q-network. The architecture consists of a CNN layer with kernel size 3 and padding, followed by a CNN with kernel size 5 and no padding, followed by a flattening layer and two dense layers with 16 and 4 units, respectively.

Model and Methods

This website uses cookies.