# Training Tic-Tac-Toe Expert with Temporal Difference Learning

Reinforcement learning is a technique to obtain an intelligent system using feedback information from the environment. I see a lot of interesting applications in robotics, mastering a game, website optimization, and so on. So, in other to get comprehensive understanding of reinforcement learning, I decided to review the methods and techniques of reinforcement learning reading the book "Reinforcement Learning: An Introduction" and summarize my learnings here in a regular basis by writing a blog post per chapter. This first post is about "Chapter 1 The Reinforcement Learning Problem". Sometimes I may cite a sentence from this book with a notation like "This is a cited paragraph. (page xx)".

Also, I will try to include some interactive demo in every article, which should be helpful for readers to understand how it works, and of course, to deepen my comprehension. The demo code is available at GitHub, so take a look at if you're interested in how they are coded. Let's get it started!

## The Reinforcement Learning Problem

As a gazelle calf starts running at 20 miles per hour right half an hour after it's born struggling to its feet, an intelligent agent learns how to behave interacting with its environment.

"learning from interaction is a foundational idea underlying nearly all throries of learning and intelligence (page 21)".

Reinforcement learning is different from supervised learning or unsupervised learning. In supervised learning setting, a training set of labeled examples is given to the system and the objective is to optimize parameters of the system in order to predict the label of data, even if they are not present in the training set. On the other hand, in reinforcement learning setting, the evaluation of one action may not be provided until the last minute when the system reaches to the goal of the problem. Also, the situation may change over time according to the action the system take. This is another unique property of reinforcement learning.

Reinforcement learning is also different from unsupervised learning, which aims to capture the structure of unlabelled data (e.g. clustering, dimension reduction). Unsupervised learning does not provide a good framework to capture the good behavior which tries to maximize a reward in the given environment.

## Elements of Reinforcement Learning

In reinforcement learning, an agent tries to obtain a good behavior by interacting with its environment. Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system.

• A policy.
• A reward signal.
• A value function.
• A model of the environment (optional).

A policy is about the mapping from perceived status to actions to be taken. This defined how the learning agent behave over time. Selecting the promising action all the time can be one policy (a greedy policy), while selecting at random may be the other extreme side of a policy (exploration always).

A reward signal defines the goal in a reinforcement learning problem. An agent cannot modify the mechanism how the environment generate rewards, so agent needs to change their action and belonging state in the environment to maximize a reward.

A value function specifies the value of a state in the long run while a reward describes instant desirability. Imagine a table which lists every possible state and its favourableness. One state might yield low or zero reward, but is regularly followed by anoother state which yield high rewards. In that case, the value of that regularly visited state needs to be estimated high. In fact, reinforcement learning is almost all about how to estimate values efficiently. "The central role of value estimation is arguably the most important thing we have learned about reinforcement learning over the last few decades. (page 27)"

So, why not estimate the optimal values directly, let's say, using general optimization algorithms such as genetic algorithms or simulated annealing? Of course, it is possible to optimize the values using them. By iterating the trial over and over, nearly optimal value function could be found. However, it can be very inefficient when it comes to reinforcement learning because it ignores the history of the interaction every single agent took. The history may include important information about the value of a state (e.g. "Everything went well right before I stepped into that situation..."), but meta-heuristic algorithms just care the final rewards each agent achieved. That's why learning methods which utilize the history of interaction is popular in reinforcement learning.

## Tic-Tac-Toe Demo

Okay, here's the fun part! I will describe how we can train an agent to play tic-tac-toe well. The demo code is available at GitHub.

First, we need to define the possible states the agent can take. Naively, we can think of 3^9 = 19683 states because there are 9 grids and each grid can take 3 values (O, X or vacant). The number is way larger than imagined, isn't it? Of course, it is including unfeasible states (e.g. all grids are filled by O) and some states are considered to be equivalent if we rotate the board. For now, let's consider these all states for simplicity. We initialize the values of them as zero.

How we decide the reward of the goal state? Winning the game should be positive thing, while losing should be negative. We put +1 for winning and -1 for losing here. For a draw, we put a small negative reward, -0.5, to motivate the agent to win.

Regarding policy, we use a policy called epsilon-greedy. With epsilon-greedy policy, an agent take the most promising action greedily with a large probability (1 - $\varepsilon$), while chose an action randomly with a small probability ($\varepsilon$). Please note that epsilon means a very small number conventionally.

Finally, we use a learning method called temporal-difference to train the agent. In temporal-difference method, we iterate to update the value of a state like below.

$V(s) = V(s) + \alpha (V'(s) - V(s))$

$s$ denotes a state and $V(s)$ denotes value of the state $s$. While $s$ denotes the state before greedy move is made, $s'$ denotes the state after the move. $\alpha$ denotes a parameter called step size. This controls how much we update the value according to the new findings (=reward). By iterating this update over time, the value function becomes accurate.

In this demo, the agent is trained by playing against another artificial agent. The opponent agent is also trained by the same learning method. By competing each other, it gets smarter gradually.

### Playing against itself

ITERATION: ((iter))
ITERATION SPEED
slow fast

### Playing against you

Versus #((copied_iter)) generation's agent

By tapping "copy", you can bring in the trained agent here. Around after 2000 iterations, it becomes harder to beat.

As shown in the demo, it plays the fool in the early stage. Seems like there's no intelligent here. Please note that they even don't know the rule of tic-tac-toe at first. They put pieces almost randomly and sometimes get rewarded by chance. That's why you'll get irritated seeing this behavior ("What are you doing!!??"). After they iterate the game more than 2000 times, it becomes smarter gradually. You can speed up the iteration by shifting the ITERATION SPEED range, by the way.

You wanna try how they are smart? By hitting COPY button in another blank board, you can import the trained agent and play against it. Do you feel they are getting better after some iterations?

If you're curious about the actual implementation in Javascript, please take a look at code. It is around 200 lines of code including the demo implementation. I really appreciate your pull request if you find any error in the implementation. Demo is implemented with Vue.js.

That's all! Thank you for reading. I will write the next article base on the next chapter in the book, "Multi-armed Bandits".

<< Back to index