UC Berkeley EECS MEng 21'

Published May 21, 2021

This article is a summary of my work in my final project of UC Berkeley’s CS 267 Applications of Parallel Computers. Our report can be accessed here. The main idea of the project is to speed up the training process of reinforcement learning (RL) by parallelizing the execution of environments. I implemented and tested a parallelized CartPole environment using PyTorch Tensor and another one using Numba kernels. I did the scaling tests of these two environments along with Stable Baselines’ vectorized environments.

Stable Baselines’ vectorized environment leverages Python’s multiprocessing library, which spawns worker processes. Each worker process executes one environment and sends the results to the main process, which introduces overhead from the inter-process communication. RLlib provides remote environments, which create env instances in Ray actors and step them in parallel. These remote processes introduce communication overheads and only help if the environment is very expensive to step / reset. NVIDIA has announced Isaac Gym, a physics simulation environment for reinforcement learning research. Isaac Gym enables a complete end-to-end GPU RL pipeline by leveraging NVIDIA’s PhysX GPU-accelerated simulation engine, which allows it to gather the experience data required for robotics RL. Isaac Gym also enables observation and reward calculations to take place on the GPU, thereby avoiding significant performance bottlenecks. In particular, costly data transfers between the GPU and the CPU are eliminated. NVIDIA declares that researchers can achieve the same level of success as OpenAI’s supercomputer on a single GPU in about 10 hours.

I used the Cartpole environment for testing. I first tested Stable Baselines’ vectorized environments, which uses a wrapper for multiple environments, calling each environment in sequence on the current Python process. In the experiment, the number of environments is bounded above by 32 on Google Colab, and the run time of an environment step increases almost linearly even when it is under 32, meaning that this approach could not improve throughput effectively.

Inspired by this post, I rewrote CartPole-v1 using PyTorch Tensor operations. All the state vectors and calculations are performed on a CUDA GPU, which significantly speeds up the performance. However, the operations are executed sequentially, which means threads that finish earlier have to wait until other threads to finish.

```
import numpy as np
import torch
import math
from gym import spaces
class CartPole:
metadata = {
'render.modes': ['human', 'rgb_array'],
'video.frames_per_second': 50
}
def __init__(self, env_count=1, device="cpu"):
self.gravity = 9.8
self.masscart = 1.0
self.masspole = 0.1
self.total_mass = (self.masspole + self.masscart)
self.length = 0.5 # actually half the pole's length
self.polemass_length = (self.masspole * self.length)
self.force_mag = 10.0
self.tau = 0.02 # seconds between state updates
self.kinematics_integrator = 'euler'
# Angle at which to fail the episode
self.theta_threshold_radians = 12 * 2 * math.pi / 360
self.x_threshold = 2.4
self.env_count = env_count
# Angle limit set to 2 * theta_threshold_radians so failing observation
# is still within bounds.
high = np.array([self.x_threshold * 2,
np.finfo(np.float32).max,
self.theta_threshold_radians * 2,
np.finfo(np.float32).max],
dtype=np.float32)
self.action_space = spaces.Discrete(2)
self.observation_space = spaces.Box(-high, high, dtype=np.float32)
self.seed()
self.viewer = None
self.state = None
self.done = torch.full([env_count], True, dtype=torch.bool, device=device)
self.state = torch.zeros([self.env_count, 4], dtype=torch.float32, device=device)
self.device = device
def seed(self, seed=None):
return [seed]
def step(self, action, auto_reset=True):
#breakpoint()
# All env must already have been reset.
self.done[:] = False
x, x_dot, theta, theta_dot = self.state[:, 0], self.state[:, 1], self.state[:, 2], self.state[:, 3]
#breakpoint()
force = self.force_mag * ((action * 2.) - 1.)
costheta = torch.cos(theta)
sintheta = torch.sin(theta)
# For the interested reader:
# https://coneural.org/florian/papers/05_cart_pole.pdf
temp = (force + self.polemass_length * theta_dot ** 2 * sintheta) / self.total_mass
thetaacc = ((self.gravity * sintheta - costheta * temp)
/ (self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass)))
xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass
if self.kinematics_integrator == 'euler':
x = x + self.tau * x_dot
x_dot = x_dot + self.tau * xacc
theta = theta + self.tau * theta_dot
theta_dot = theta_dot + self.tau * thetaacc
else: # semi-implicit euler
x_dot = x_dot + self.tau * xacc
x = x + self.tau * x_dot
theta_dot = theta_dot + self.tau * thetaacc
theta = theta + self.tau * theta_dot
self.state[:, 0], self.state[:, 1], self.state[:, 2], self.state[:, 3] = x, x_dot, theta, theta_dot
self.done = (
(x < -self.x_threshold)
| (x > self.x_threshold)
| (theta < -self.theta_threshold_radians)
| (theta > self.theta_threshold_radians)
)
reward = ~self.done
if auto_reset:
self.state = self.reset()
return self.state, reward, self.done, {}
def reset(self):
#breakpoint()
self.state = torch.where(self.done.unsqueeze(1), (torch.rand(self.env_count, 4, device=self.device) -0.5) / 10., self.state)
#self.state = (torch.rand((self.env_count, 4)) -0.5) / 10.
return self.state
```

Numba’s CUDA Kernel provides a convenient interface to use GPU in a Python program. This approach gives the best performance among the three approaches. There is one hyperparameter I need to specify: the number of threads per block (TPB), because the number of thread blocks could be calculated given tensor size and TPB). I tested the run time of a single environment step with different TPBs and found that 64 and above give the best performance.

```
import numpy as np
from numba import cuda
import math
from gym import spaces
rng_limit = int(1e7)
@cuda.jit
def reset_kernel(state, rng_states):
i = cuda.grid(1)
if i < state.shape[0]:
state[i,0]=(cuda.random.xoroshiro128p_uniform_float32(rng_states,i%rng_limit) -0.5) / 10.
state[i,1]=(cuda.random.xoroshiro128p_uniform_float32(rng_states,i%rng_limit) -0.5) / 10.
state[i,2]=(cuda.random.xoroshiro128p_uniform_float32(rng_states,i%rng_limit) -0.5) / 10.
state[i,3]=(cuda.random.xoroshiro128p_uniform_float32(rng_states,i%rng_limit) -0.5) / 10.
@cuda.jit
def step_kernel(state, action, done, reward, rng_states):
gravity = 9.8
masscart = 1.0
masspole = 0.1
total_mass = (masspole + masscart)
length = 0.5 # actually half the pole's length
polemass_length = (masspole * length)
force_mag = 10.0
tau = 0.02 # seconds between state updates
kinematics_integrator = 'euler'
# Angle at which to fail the episode
theta_threshold_radians = 12 * 2 * math.pi / 360
x_threshold = 2.4
i = cuda.grid(1)
if i < state.shape[0]:
#breakpoint()
# All env must already have been reset.
done[i] = False
x, x_dot, theta, theta_dot = state[i, 0], state[i, 1], state[i, 2], state[i, 3]
#breakpoint()
force = force_mag * ((action[i] * 2.) - 1.)
costheta = math.cos(theta)
sintheta = math.sin(theta)
# For the interested reader:
# https://coneural.org/florian/papers/05_cart_pole.pdf
temp = (force + polemass_length * theta_dot ** 2 * sintheta) / total_mass
thetaacc = ((gravity * sintheta - costheta * temp)
/ (length * (4.0 / 3.0 - masspole * costheta ** 2 / total_mass)))
xacc = temp - polemass_length * thetaacc * costheta / total_mass
if kinematics_integrator == 'euler':
x = x + tau * x_dot
x_dot = x_dot + tau * xacc
theta = theta + tau * theta_dot
theta_dot = theta_dot + tau * thetaacc
else: # semi-implicit euler
x_dot = x_dot + tau * xacc
x = x + tau * x_dot
theta_dot = theta_dot + tau * thetaacc
theta = theta + tau * theta_dot
state[i, 0], state[i, 1], state[i, 2], state[i, 3] = x, x_dot, theta, theta_dot
done[i] = (
(x < -x_threshold)
| (x > x_threshold)
| (theta < -theta_threshold_radians)
| (theta > theta_threshold_radians)
)
reward[i] = ~done[i]
if done[i]:
state[i,0]=(cuda.random.xoroshiro128p_uniform_float32(rng_states,i%rng_limit) -0.5) / 10.
state[i,1]=(cuda.random.xoroshiro128p_uniform_float32(rng_states,i%rng_limit) -0.5) / 10.
state[i,2]=(cuda.random.xoroshiro128p_uniform_float32(rng_states,i%rng_limit) -0.5) / 10.
state[i,3]=(cuda.random.xoroshiro128p_uniform_float32(rng_states,i%rng_limit) -0.5) / 10.
class CartPole2:
metadata = {
'render.modes': ['human', 'rgb_array'],
'video.frames_per_second': 50
}
def __init__(self, env_count=1, tpb=1):
self.gravity = 9.8
self.masscart = 1.0
self.masspole = 0.1
self.total_mass = (self.masspole + self.masscart)
self.length = 0.5 # actually half the pole's length
self.polemass_length = (self.masspole * self.length)
self.force_mag = 10.0
self.tau = 0.02 # seconds between state updates
self.kinematics_integrator = 'euler'
# Angle at which to fail the episode
self.theta_threshold_radians = 12 * 2 * math.pi / 360
self.x_threshold = 2.4
self.env_count = env_count
# Angle limit set to 2 * theta_threshold_radians so failing observation
# is still within bounds.
high = np.array([self.x_threshold * 2,
np.finfo(np.float32).max,
self.theta_threshold_radians * 2,
np.finfo(np.float32).max],
dtype=np.float32)
self.action_space = spaces.Discrete(2)
self.observation_space = spaces.Box(-high, high, dtype=np.float32)
self.seed()
self.viewer = None
self.state = None
self.done = cuda.to_device(np.full([env_count], True, dtype=np.bool))
self.state = cuda.device_array([self.env_count, 4], dtype=np.float32)
self.reward=cuda.to_device(np.zeros((env_count,)))
self.rng_states = cuda.random.create_xoroshiro128p_states(min(env_count,rng_limit), seed=0)
self.env_count=env_count
self.num_envs=env_count
self.tpb=tpb
def seed(self, seed=None):
return [seed]
def step(self, action, copy=True):
step_kernel[math.ceil(self.env_count/self.tpb), self.tpb](self.state, action, self.done, self.reward, self.rng_states)
return self.state, self.reward, self.done.copy_to_host() if copy else self.done, {}
def reset(self):
reset_kernel[math.ceil(self.env_count/self.tpb), self.tpb](self.state, self.rng_states)
return self.state
```

Another student in our project team tested the environments on an on-policy RL algorithm, PPO. The results showed that although more parallelization gives more samples, it does not necessarily improve the training rewards. One possible reason is that the learner processed all the transitions in one batch. One way to avoid this is to use a distributed RL algorithm like R2D2.

Using CUDA could improve the throughput of reinforcement learning by speeding up the environment steps. The next step is to test its impact on training time and rewards for different RL algorithms.