Although reinforcement learning has a history dating back to the 1970’s (think Temporal Difference Learning), the deep component – of using large multi-layered neural network as underlying models – is still a relatively new addition. Pivotal research in Deep Reinforcement Learning (DRL) such as AlphaGo has spurred a growing interest in the field. And though not yet perfect due to for example sample inefficiency, a difficult reward design and generalization issues, DRL has also prompted an increasing, albeit still timid, number of real-world applications.

One of these applications is using DRL agents in traffic light control systems. In comparison to traditional controlling policy, these DRL agents typically involve taking real-time traffic information and dynamically adjusting the light’s phase program, thus ensuring the efficiency of the system. For this project, I try to implement a similar learning agent on an existing intersection.

This would involve correctly reproducing the intersection in a simulation software, integrating it in a custom DRL environment, selecting a reinforcement learning algorithm and designing an appropriate neural network for the underlying model. After everything is setup, the model can be fitted and evaluated.

For selecting an existing intersection I look no further than my own city. One particular set of traffic lights has caused me a great many headaches in the past and perfectly fits the scope of the project. Not too complex nor too simple, it offers sufficient opportunity for improvement compared to a traditional controlling policy while not exploding the necessary computation time.

Simulation of Urban MObility (SUMO) is an open source road traffic simulator that amongst others allows to set up detailed traffic light systems and offers through TraCI a handy interface for retrieving and setting various simulation parameters (e.g. the current traffic light phase, position of the vehicles, waiting times and so on) on-line. For converting a desired intersection from OpenStreetMap to an equivalent SUMO simulation a number of steps are required which are laid out below.

From the desired intersection location in OpenStreetMap we export an .osm extension file, which we then convert to a .net file purposed for using in SUMO. We'll prefix our files with trone

from the name of the street:

> netconvert --osm-files trone.osm -o trone.net.xmlDepending on the level of detail provided in OpenStreetMap, and after filtering any irrelevant information, the resulting .net file should more or less accurately represent the intersection.

After loading the .net file within SUMO we verify that the road network is correctly defined. A first check could for example be to examine and if necessary change the allowed maximum speed on the different lanes: a 120 km/h allowed speed in an urban area might suggest erroneous data. Another check could pertain to the allowed lane changes at the traffic light as well as the default phase program, making sure there are no conflicts (see article thumbnail).

Once the traffic parameters have been checked, a simulation needs to be provided in the form of predefined vehicle-routes in a .rou file. These routes can easily be created using the SUMO-provided randomTrips.py script with the previously created .net file as the input. One should also define the starting and ending times of the simulation as well as the number of routes to produce for each time-unit. Furthermore weights (in the form of .xml files) *can* be used to fine-tune the probability of specific sources and destinations for all the routes.

> python randomTrips.py -n trone.net.xml -r trone.rou.xml -b 0 -e 600 -p 1 --weights-prefix trone

For picking out some minor remaining errors, nothing beats visualizing the simulation in SUMO (with not much more than the .net and .rou files defined in the .cfg file):

> sumo-gui -c trone.sumo.cfg

For some bells and whistles, we can also export all polygons from the .osm file for a prettier representation of our intersection:

> polyconvert --net-file trone.net.xml --osm-files trone.osm --type-file typemap.xml -o trone.poly.xml

We can even take it a step further and take a snapshot of our crossing from Google Maps and overlay it with our road network. This result can be seen in the final run.

Considering the many resources that are available covering the theory behind reinforcement learning environments, we will only introduce it briefly here.

In reinforcement learning, we define an agent which performs rewarded actions in an environment: At some point in time _{t} the agent observes the *state* (*S _{t}*) of the environment and takes an

The design of the environment is influenced by recent research on the subject, while making changes where necessary in order to accommodate the scope and particularities of this project.

The *state* of the environment is mainly represented by the current position and speed of the vehicles near the intersection. This is done by decoding the intersection into a grid of equally sized squares. One matrix then encodes the positions of the vehicles with binary values in the respective squares: 0 for the absence of a vehicle and 1 otherwise. Another matrix encodes the speed of the vehicles in a similar manner (see picture below). Next to the speed and position of the vehicles, the state of the environment also includes a matrix indicating the last three active phases, therefore providing memory of past actions.

The *action* space is defined by selecting which phase to activate in the next cycle. There are four available phases to choose from: North⇅South, East⇄West, North↘East & West↖South and finally South↙East & West↗North . Once a phase is chosen it remains active for 3 seconds after which a new action needs to be taken. If the chosen phase is different from the last one, a yellow phase is required to bridge the two phases, safely bringing any incoming cars to a stop. Defining the action space as a selection of the next phase rather than the selection of the duration of the next phase within a fixed program allows more flexibility to the agent. Indeed within a fixed program the agent will always need to cycle through the predefined sequence of phases, potentially activating phases (even for a short time) in the absence of cars.

The first step in computing the *reward* is to look at all the vehicles near the intersection. We then define waiting time as the accumulated time needed for a car to cross the intersection. To prevent the agent from taking strategies with too much preference for specific lanes, we borrow the concept of disutility from economics which is defined as the adverse or harmful effects associated with a particular activity or process,

. We thus state that a driver's disutility increases more if that driver has already waited for a long period of time. In other words, some person *especially* when carried out over a long period_{i} won't mind too much waiting for a minute more if he's been waiting for only a few seconds yet, while this might not be true had he been waiting for 5 minutes already. The disutility function is given as:

$$D(w_{i,t}) = w_{i,t}^m$$

where *w _{i,t}* is the accumulated waiting time for person

$$r_t = D_t-D_{t+1}$$

where

$$D_t = \sum_{i_t=1}^{N_t} D(w_{i,t})$$

Note that for consistency the full code is not displayed, but rather the general idea and key lines. For a comprehensive overview, refer to the GitHub repo.

The increased interest in DRL has resulted in the creation of new open source frameworks for training DRL models. One such toolkit is OpenAI Gym which amongst other supports the simulation of many environments as well as a handy base class for custom environments. To support the application of our traffic light system, we subclass the `Env`

class which at the very least requires the definition of the `__init__()`

constructor and a `step()`

& `reset()`

method:

```
from gym import Env
class SumoEnv(Env):
def __init__(self):
...
def step(self, action):
...
def reset(self):
...
```

In the `__init__()`

constructor we use gym's predefined spaces to define ours. The discrete space is appropriate for our 4 available actions and we use box spaces for our speed, position and phase history matrices. These are furthermore defined within a dictionary space to eventually allow specific modeling for each type of observation. We also automatically set the seed for reproducibility which will enable comparing models over similarly generated simulations. A state can also be provided which, together with the seed, allows for intermittent training over multiple sessions.

```
from gym import spaces
from gym.utils import seeding
class SumoEnv(Env):
def __init__(self, seed=None, state=None):
self.action_space = spaces.Discrete(4)
self.observation_space = spaces.Dict({'phase':spaces.Box(low=0, high=1, shape=(4,1,3), dtype=np.float32),
'vehicle':spaces.Box(low=0, high=14, shape=(29,29,2), dtype=np.float32)})
self.seed(seed, state)
def seed(self, seed, state):
'''
Returns:
Float of the seed used.
'''
self.rng, seed = seeding.np_random(seed)
if state != None:
self.rng.set_state(state)
return [seed]
```

The `reset()`

method will be called at the beginning of the training and after each episode. It generates the environment by closing any previously made SUMO connections (in case it wasn't the first episode) and opening a new one.

In between we use the trips.py script to generate new weights .xml files (based on the given seed and state) from which in turn new routes are generated, just as we have done before. Only this time it is automated so that we can use this (reproducible) randomness to limit our agent from overfitting and maximizing the chances of better generalization. For the script to work we need to provide it a list of names of the SUMO source and destination (street)edges to be used.

The method finally requires a starting observation which can be fetched from the gym spaces.

```
import traci
from sumolib import checkBinary
from trips import generate, delete
class SumoEnv(Env):
...
def reset(self):
'''
Returns:
Nd.array of starting observation.
'''
traci.close()
delete(prefix="trone")
generate(prefix="trone", src=["src"+str(n) for n in range(5)], dst=["dst"+str(n) for n in range(9)], rng=self.rng, scale=(10,10))
traci.start([checkBinary("sumo"), "-c", "trone.cfg"])
phase_start = self.observation_space.spaces['phase'].low
position_speed_start = self.observation_space.spaces['vehicle'].low
observation = {'phase':phase_start, 'vehicle':position_speed_start}
return observation
```

The `step()`

method forms the core of the `Env`

class by matching the diagram presented earlier: it requires an action as input, and outputs a reward and observation. The action defines the next desired phase, and if different from the current one (i.e. an active action was taken) we set the duration to zero in order to trigger the bridging yellow phase. Over an interval of 3 seconds, which neatly overlaps the yellow phase in case of an active action, we then track all cars entering and exiting the intersection and compute the total added disutility through the `trackSim()`

function. After that, the phase is changed to the desired one if needed.

For the model training we also need to know if the episode is finished, so we create a Boolean variable which tests if the elapsed simulation time is larger than our predefined end time, in this case one hour.

Finally we compute the current state of the environment with the `getVehicleInfo()`

function which returns our speed and position matrices. On top of these two we add the next phase to our phase history matrix and also add it to our observation.

```
import traci
import numpy as np
class SumoEnv(Env):
...
def step(self, action, observe=True):
'''
Args:
action: the desired phase.
Returns:
observation: Dictionary of the observation space after the action has been taken.
reward: Float of the reward the action taken has resulted in.
done: Boolean indicating if the current episode is done.
'''
# Action-dependent phase changing
desired_phase = action
current_phase = traci.trafficlight.getPhase("tls_id")
if desired_phase != current_phase:
traci.trafficlight.setPhaseDuration("tls_id", 0)
# Action-independent simulation tracking
sim_info = self.trackSim(sec=3)
reward = sim_info['added_disutility']
# Action-dependent phase changing
if desired_phase != current_phase:
traci.trafficlight.setPhase("tls_id", desired_phase)
# Episode end check
done = self._elapsedTime() >= 3600
# Observation
position_speed_mat = self.getVehicleInfo()
self.phase_hist = [desired_phase]+self.phase_hist[:2]
phase_hist = np.dstack(self.phase_hist)
observation = {'phase':phase_hist, 'vehicle':position_speed_mat}
return observation, reward, done, {}
```

There are two main parts to the modeling of our agent. The first part pertains to the reinforcement learning technique itself, which will learn the appropriate policy telling the agent which action to take. The second part of the modeling is the function approximator. Many resources are available covering the theory behind reinforcement learning and deep neural networks. As such we will mainly focus on the project's specificities in relation to these subjects.

When the agent decides to take a specific action we want him to not only look at the impending reward but also to future rewards, as long as those rewards don't lie in some *too* distant future. It might indeed be necessary to cause some vehicles to stop right now in order to improve the overall flow of the intersection after some time. How can we influence the agent to take such decisions? Instead of assigning only the immediate reward to a specific action, we also take into account potential reward from the best possible action it can take thereafter, as well as the one after that, and so on.

This idea of propagating potential rewards resulting from the best future actions lies at the core of Q-learning: a reinforcement learning algorithm used to find the optimal policy for selecting actions using a Q function. The Q function tells the agent the value of a possible action *a* in a particular state *s* is computed as the immediate reward plus the expected discounted cumulative rewards, if the agent followed a greedy policy afterwards. In other words if the agent, after taking action *a*, took those subsequent actions which would maximize the values of Q. The computation of the Q values for each `<action,state>`

pair is an iterative process of updating values defined by:

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha(r + \gamma\max_{a_{t+1}}Q(s_{t+1},a_{t+1})-Q(s_t,a_t)) $$

where

the *Q*(*s _{t}*,

The *Q* values are initialized by the user and will only reach better approximations as more `<action,state>`

pairs are explored. Would the agent also follow a greedy policy for its behavior - the actual action taken - then it might easily get stuck in a sub-optimal strategy, due to limited confidence in *Q* values. For this an ε-greedy policy is used: at each step the agent randomly picks an action with a probability ε, otherwise it does so greedily. We use a linearly annealing policy where we start with high values of ε (high exploration) and linearly decrease to low values of ε (high exploitation).

For a state *s* and action *a*, the *Q* values don't get updated when the reward plus discounted best future *Q* values are identical to the previous *Q* value. In such a case the goal is achieved in that the agent can accurately predict which action will result in the highest value. Note that this is an optimization problem which can be expressed with a loss function and solved with gradient descent. Considering the type of input data we choose to use a deep neural network (more on that later). The model thus predicts two pieces of information in the previous equation: the current *Q* value and maximum possible *Q* value for the next state (i.e. the target).

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha(\underset{\color{gray}{loss}}{\bbox[5px,border:1px solid gray]{r + \gamma\underset{\color{blue}{predicted\ max\ next\ state}}{\bbox[5px,border:2px solid blue]{\max_{a_{t+1}}Q(s_{t+1},a_{t+1}}})-\underset{\color{red}{predicted\ current}}{\bbox[5px,border:2px solid red]{Q(s_t,a_t)}}}}) $$

Notice from the previous equation that the definition is recursive in that our network ends up predicting its own output. At each batch our *Q* values shift but so does our target values: we get closer to our target but also move it, causing our network to converge with more difficulty. The introduction made by DeepMind to alleviate this problem is to fix the target network by intermittently copying the *Q* network, thus offering a more stable learning. We implement this type of fixed target network below.

The first part of the code defines the neural network's architecture. It has been proven that convolutional neural networks work well on data with some spatial relationship. Considering our speed and position matrices it would seem only appropriate to use such a network. More particularly we put the data through two convolutional layers by sliding a 4x4 filter with a stride of 1 (i.e. square-by-square). The first layer consists of 32 filters and the second layer of 64 filters. We also use a pooling layer to remove less important information and reduce the dimensionality. The phase history data on the other hand will be fed as a flattened array. Both outputs are then combined and used as inputs for two fully-connected layers of 128 and then 64 neurons.

The ε, γ and fixed target network parameters as discussed previously are defined in the subsequent lines. Another parameter is the number of warm-up steps which sets the number of steps the experience is stored before any training is performed. The batch size shows the number of samples fed through the network at every step.

```
from keras.models import Sequential, Model
from keras.layers import Input, Flatten, LeakyReLU, Conv2D, MaxPooling2D, Activation, Dense, concatenate
from keras.optimizers import Adam
from rl_.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl_.multiInputProcessor import MultiInputProcessor
from sumoEnv import SumoEnv
env = SumoEnv(...)
### Extract action space and shapes from the environment.
nb_actions = env.action_space.n
shape_phase = env.observation_space.spaces['phase'].shape
shape_vehicle = env.observation_space.spaces['vehicle'].shape
### Phase model & input.
model_phase = Sequential()
model_phase.add(Flatten(data_format='channels_last', input_shape=shape_phase))
model_phase_input = Input(shape=shape_phase, name='phase')
model_phase_encoded = model_phase(model_phase_input)
### Vehicle model & input.
model_vehicle = Sequential()
model_vehicle.add(Conv2D(32, kernel_size=(4,4), strides=(1,1), data_format='channels_last', input_shape=shape_vehicle))
model_vehicle.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), data_format='channels_last'))
model_vehicle.add(LeakyReLU())
model_vehicle.add(Conv2D(64, kernel_size=(4,4), strides=(1,1), data_format='channels_last'))
model_vehicle.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), data_format='channels_last'))
model_vehicle.add(LeakyReLU())
model_vehicle.add(Flatten(data_format='channels_last'))
model_vehicle_input = Input(shape=shape_vehicle, name='vehicle')
model_vehicle_encoded = model_vehicle(model_vehicle_input)
### Concatenation and final model.
conc = concatenate([model_phase_encoded, model_vehicle_encoded])
hidden = Dense(128)(conc)
hidden = LeakyReLU()(hidden)
hidden = Dense(64)(hidden)
hidden = LeakyReLU()(hidden)
output = Dense(nb_actions, activation='linear')(hidden)
model = Model(inputs=[model_phase_input, model_vehicle_input], outputs=output)
### Policy, Memory & Agent set-up.
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.01, value_test=.01, nb_steps=100000)
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, policy=policy, batch_size=64, gamma=.95, nb_steps_warmup=2000, target_model_update=.001)
dqn.processor = MultiInputProcessor(2)
dqn.compile(optimizer=Adam(lr=.001))
### Fit.
hist = dqn.fit(env, nb_episodes=200, verbose=0)
```

We train the *(fixed) Target Deep Q-Network* (TDQN) model for 200 episodes and use a fixed duration program of 20 seconds as a baseline. Our best-in-class model achieves a cumulative reward of -1019065 over its full episode, slightly better than the -2512711 reward in the exact same simulation for the fixed duration alternative. We use a Savitzky–Golay filter to smooth out the results presented below. At first glance the result might not look that impressive.

Yet, remember that the reward is based on an exponential disutility function. To frame the results better, below is the same model's results on the average waiting time. The results look much more compelling here, going from an average of over 40s average waiting time to an average of less than 30s.

To conclude we present below a short clip of what the agent's actions look like after 200 episodes of training. The color represents the cumulative waiting time for each vehicle going from a low blue to a high red.

Published on 6 January 2019