Waiter_group/Reinforcement_learning.md

# Reinforcement learning for route planning in restaurant
##### Tao-Sen Chang    s442720

###### We did the route planning by special algorithm on last task. In this machine learning sub-project I try to show different approach for the agent who can traversal multiple destinations on the grid system, and of course, get the shortest path of the route. What I want to use is called reinforcement learning.

## What is reinforcement learning?
###### Reinforcement learning is how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. The agent makes a sequence of decisions, and learn to perform the best actions every step. For example, in my project there is a waiter in the grid and he has to reach many tables for serving the meal, so he must learn the shortest path to get a table.

## How to do that?
###### The idea of how to complete the reinforcement learning is not quite easy. However, there is an example - rat in a maze. Instead of writing the whole algorithm at the beginning, I use the existing codes(tools) by adjusting some parameters to train the agent on our 16x16 grid.
https://www.samyzaf.com/ML/rl/qmaze.html
###### I train the agent(waiter) with rewards and penalties, the waiter in the above grid gets a small penalty for every legal move. The reason is that we want it to get to the target table in the shortest possible path. However, the shortest path to the target table is sometimes long and winding, and our agent (the waiter) may have to endure many errors until he gets to the table.
###### For example, one of the training parameters(rewards) are:
```
if rat_row == win_target_x and rat_col == win_target_y: # if reach the final target
    return 1.0
if mode == 'blocked':   # move to the block in the grid (blocks are tables or kitchen in our grid)
    return -1.0
if (rat_row, rat_col) in self.visited: # when get to the visited grid point
    return -0.5
if mode == 'invalid': # when move to the boundary
    return -0.75
if mode == 'valid': # to make the route shorter, we give a penalty by moving to valid grid point
    return -0.04
if (rat_row, rat_col) in self.curr_win_targets: # if reach any table
    return 1.0
```
```
self.min_reward = -0.5 * self.maze.size
```
## Q-learning
###### We want to get the maximum reward from each action in a state. Here defines action=π(s).
###### Q(s,a) = the maximum total reward we can get by choosing action a in state s. Hence it's obvious that we get the function π(s)=argmaxQ(s,ai)   Now the question is how to get Q(s,a)?
###### There is a solution called Bellman's Equation: Q(s,a) = R(s,a) + maxQ(s′,ai)
###### R(s,a) is the reward in current state s, action a. And s′ means the next state, so maxQ(s′,ai) means the maximum reward in 4 actions from next state. In the code we have the Experience Class to memorize each "episode", but the memory is limited, therefore if reach the max_memory, then delete the old episode which has lower effect to current episode.
###### There is a coefficient called discount factor, usually denoted by γ which is required for the Bellman equation for stochastic environments. So the new Bellman's Equation can be written as Q(s,a) = R(s,a) + γ * maxQ(s′,ai). This discount factor is to diminish the effects which are far from current state.
```
class Experience(object):
    def __init__(self, model, max_memory=100, discount=0.95):
        self.model = model
        self.max_memory = max_memory
        self.discount = discount
        self.memory = list()
        self.num_actions = model.output_shape[-1]

    def remember(self, episode):
        # episode = [envstate, action, reward, envstate_next, game_over]
        # memory[i] = episode
        # envstate == flattened 1d maze cells info, including rat cell (see method: observe)
        self.memory.append(episode)
        if len(self.memory) > self.max_memory:
            del self.memory[0]

    def predict(self, envstate):
        return self.model.predict(envstate)[0]

    def get_data(self, data_size=10):
        env_size = self.memory[0][0].shape[1]   # envstate 1d size (1st element of episode)
        mem_size = len(self.memory)
        data_size = min(mem_size, data_size)
        inputs = np.zeros((data_size, env_size))
        targets = np.zeros((data_size, self.num_actions))
        for i, j in enumerate(np.random.choice(range(mem_size), data_size, replace=False)):
            envstate, action, reward, envstate_next, game_over = self.memory[j]
            inputs[i] = envstate
            # There should be no target values for actions not taken.
            targets[i] = self.predict(envstate)
            # Q_sa = derived policy = max quality env/action = max_a' Q(s', a')
            Q_sa = np.max(self.predict(envstate_next))
            if game_over:
                targets[i, action] = reward
            else:
                # reward + gamma * max_a' Q(s', a')
                targets[i, action] = reward + self.discount * Q_sa
        return inputs, targets
```

## Training
###### Following is the algorithm for training neural network model to solve the problem. One epoch means one loop of the training, and in each epoch the agent will finally become either "win" or "lose".
###### Another coefficient "epsilon" is exploration factor which decides the probability of whether the agent will perform new actions instead of following the previous experiences (which is called exploitation). By this way the agent could not only collect better rewards from previous experiences, but also have the chances to explore unknown area where might get more rewards. If one of the strategies is determined, then let's start training it by neural network. (inputs: size equals to the maze size, targets: size is the same as the number of actions (4 in our case)).
```
# Exploration factor
epsilon = 0.1
def qtrain(model, maze, **opt):
    global epsilon
    n_epoch = opt.get('n_epoch', 15000)
    max_memory = opt.get('max_memory', 1000)
    data_size = opt.get('data_size', 50)
    weights_file = opt.get('weights_file', "")
    name = opt.get('name', 'model')
    start_time = datetime.datetime.now()

    # If you want to continue training from a previous model,
    # just supply the h5 file name to weights_file option
    if weights_file:
        print("loading weights from file: %s" % (weights_file,))
        model.load_weights(weights_file)

    # Construct environment/game from numpy array: maze (see above)
    qmaze = Qmaze(maze)

    # Initialize experience replay object
    experience = Experience(model, max_memory=max_memory)

    win_history = []   # history of win/lose game
    n_free_cells = len(qmaze.free_cells)
    hsize = qmaze.maze.size//2   # history window size
    win_rate = 0.0
    imctr = 1
    pre_episodes = 2**31 - 1

    for epoch in range(n_epoch):
        loss = 0.0
        #rat_cell = random.choice(qmaze.free_cells)
        #rat_cell = (0, 0)
        rat_cell = (12, 12)

        qmaze.reset(rat_cell)
        game_over = False

        # get initial envstate (1d flattened canvas)
        envstate = qmaze.observe()

        n_episodes = 0
        while not game_over:
            valid_actions = qmaze.valid_actions()
            if not valid_actions: break
            prev_envstate = envstate
            # Get next action
            if np.random.rand() < epsilon:
                action = random.choice(valid_actions)
            else:
                action = np.argmax(experience.predict(prev_envstate))

            # Apply action, get reward and new envstate
            envstate, reward, game_status = qmaze.act(action)
            if game_status == 'win':
                print("win")
                win_history.append(1)
                game_over = True
                # save_pic(qmaze)
                if n_episodes <= pre_episodes:
                    # output_route(qmaze)
                    print(qmaze.visited)
                    with open('res.data', 'wb') as filehandle:
                        pickle.dump(qmaze.visited, filehandle)
                    pre_episodes = n_episodes

            elif game_status == 'lose':
                print("lose")
                win_history.append(0)
                game_over = True
                # save_pic(qmaze)
            else:
                game_over = False

            # Store episode (experience)
            episode = [prev_envstate, action, reward, envstate, game_over]
            experience.remember(episode)
            n_episodes += 1

            # Train neural network model
            inputs, targets = experience.get_data(data_size=data_size)
            h = model.fit(
                inputs,
                targets,
                epochs=8,
                batch_size=16,
                verbose=0,
            )
            loss = model.evaluate(inputs, targets, verbose=0)


        if len(win_history) > hsize:
            win_rate = sum(win_history[-hsize:]) / hsize

        dt = datetime.datetime.now() - start_time
        t = format_time(dt.total_seconds())

        template = "Epoch: {:03d}/{:d} | Loss: {:.4f} | Episodes: {:d} | Win count: {:d} | Win rate: {:.3f} | time: {}"
        print(template.format(epoch, n_epoch-1, loss, n_episodes, sum(win_history), win_rate, t))
```

## Testing
###### Use this algorithm to our 16x16 grid and train.
```
grid = [[1 for x in range(16)] for y in range(16)]
table1 = Table(2, 2)
table2 = Table (2,7)
table3 = Table(2, 12)
table4 = Table(7, 2)
table5 = Table(7, 7)
table6 = Table(7, 12)
table7 = Table(12, 2)
table8 = Table(12, 7)

kitchen = Kitchen(13, 13)
maze = np.array(grid)
model = build_model(maze)
qtrain(model, maze, epochs=1000, max_memory=8*maze.size, data_size=32)
```
###### Also I create a list called win_targets to put the position of tables in the grid.
```
win_targets = [(4, 4),(4, 9),(4, 14),(9, 4),(9, 9),(9, 14),(14, 4),(14, 9)]
```
###### After tons of training, I realize it is not an easy task to obtain the shortest route in every training - that means most of the training are failed - especially in the case that the win_targets has more targets. For example, the result of training 8 targets is like this(part of result):
```
...
Epoch: 167/14999 | Loss: 0.0299 | Episodes: 407 | Win count: 63 | Win rate: 0.422 | time: 2.44 hours
Epoch: 168/14999 | Loss: 0.0112 | Episodes: 650 | Win count: 63 | Win rate: 0.414 | time: 2.46 hours
Epoch: 169/14999 | Loss: 0.0147 | Episodes: 392 | Win count: 64 | Win rate: 0.422 | time: 2.47 hours
Epoch: 170/14999 | Loss: 0.0112 | Episodes: 668 | Win count: 65 | Win rate: 0.422 | time: 2.48 hours
Epoch: 171/14999 | Loss: 0.0101 | Episodes: 487 | Win count: 66 | Win rate: 0.430 | time: 2.50 hours
Epoch: 172/14999 | Loss: 0.0121 | Episodes: 362 | Win count: 67 | Win rate: 0.438 | time: 2.51 hours
Epoch: 173/14999 | Loss: 0.0101 | Episodes: 484 | Win count: 68 | Win rate: 0.445 | time: 2.52 hours
...
```
###### The only one which is successful contains 4 targets(win_targets = [(4, 4),(4, 9),(4, 14),(9, 4)])
```
...
Epoch: 223/14999 | Loss: 0.0228 | Episodes: 30 | Win count: 165 | Win rate: 0.906 | time: 64.02 minutes
Epoch: 224/14999 | Loss: 0.0160 | Episodes: 52 | Win count: 166 | Win rate: 0.906 | time: 64.09 minutes
Epoch: 225/14999 | Loss: 0.0702 | Episodes: 34 | Win count: 167 | Win rate: 0.914 | time: 64.14 minutes
Epoch: 226/14999 | Loss: 0.0175 | Episodes: 40 | Win count: 168 | Win rate: 0.922 | time: 64.19 minutes
Epoch: 227/14999 | Loss: 0.0271 | Episodes: 46 | Win count: 169 | Win rate: 0.930 | time: 64.25 minutes
Epoch: 228/14999 | Loss: 0.0194 | Episodes: 40 | Win count: 170 | Win rate: 0.938 | time: 64.30 minutes
...
Epoch: 460/14999 | Loss: 0.0236 | Episodes: 60 | Win count: 401 | Win rate: 1.000 | time: 1.48 hours
Reached 100% win rate at epoch: 460
n_epoch: 460, max_mem: 2048, data: 32, time: 1.48 hours
```
###### In my opinion, there are 3 reasons cause such bad results.
###### 1. The parameters in the algorithm are not optimal including the rewards, exploration rate, and discount factor. To adjust the parameters and to validate them costs lots of time, and the most intuitive way is always not the best solution. For example, the parameters of 4 targets are fine, but if the number of targets expanded to 8, the parameters are not just 1/2 of the original ones.
###### 2. Because of the exploration rate, every time the same training and testing data may have a different result. It increases the difficulty to verify our result. The only way to check whether the parameters generate ideal results is training continuously until we collect sufficient data.
###### 3. The algorithm is for a rat in a maze at the beginning, and the number of default target is only one. If we apply it for multiple targets, there may be inadequate for some reason. Moreover, the default size is 7x7. It is possible that the 16x16 grid is too huge for this algorithm.