Deep Deterministic Policy Gradient (DDPG)
Overview
DDPG is a popular DRL algorithm for continuous control. It runs reasonably fast by leveraging vector (parallel) environments and naturally works well with different action spaces, therefore supporting a variety of games. It also has good sample efficiency compared to algorithms such as DQN.
Original paper:
Reference resources:
Implemented Variants
Variants Implemented | Description |
---|---|
ddpg_continuous_action.py , docs |
For continuous action space. Also implemented Mujoco-specific code-level optimizations |
Below are our single-file implementations of PPO:
ddpg_continuous_action.py
The ppo.py has the following features:
- For continuous action space. Also implemented Mujoco-specific code-level optimizations
- Works with the
Box
observation space of low-level features - Works with the
Box
(continuous) action space
Usage
poetry install
poetry install -E pybullet
python cleanrl/ddpg_continuous_action.py --help
python cleanrl/ddpg_continuous_action.py --env-id HopperBulletEnv-v0
poetry install -E mujoco # only works in Linux
python cleanrl/ddpg_continuous_action.py --env-id Hopper-v3
Implementation details
Our ddpg_continuous_action.py is based on the OurDDPG.py
from sfujim/TD3, which presents the the following implementation difference from (Lillicrap et al., 2016)1:
-
ddpg_continuous_action.py uses a gaussian exploration noise \(\mathcal{N}(0, 0.1)\), while (Lillicrap et al., 2016)1 uses Ornstein-Uhlenbeck process with \(\theta=0.15\) and \(\sigma=0.2\).
-
ddpg_continuous_action.py runs the experiments using the
openai/gym
MuJoCo environments, while (Lillicrap et al., 2016)1 uses their proprietary MuJoCo environments. -
ddpg_continuous_action.py uses the following architecture:
while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)1 uses the following architecture (difference highlighted):class QNetwork(nn.Module): def __init__(self, env): super(QNetwork, self).__init__() self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod() + np.prod(env.single_action_space.shape), 256) self.fc2 = nn.Linear(256, 256) self.fc3 = nn.Linear(256, 1) def forward(self, x, a): x = torch.cat([x, a], 1) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x class Actor(nn.Module): def __init__(self, env): super(Actor, self).__init__() self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 256) self.fc2 = nn.Linear(256, 256) self.fc_mu = nn.Linear(256, np.prod(env.single_action_space.shape)) def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) return torch.tanh(self.fc_mu(x))
class QNetwork(nn.Module): def __init__(self, env): super(QNetwork, self).__init__() self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 400) self.fc2 = nn.Linear(400 + np.prod(env.single_action_space.shape), 300) self.fc3 = nn.Linear(300, 1) def forward(self, x, a): x = F.relu(self.fc1(x)) x = torch.cat([x, a], 1) x = F.relu(self.fc2(x)) x = self.fc3(x) return x class Actor(nn.Module): def __init__(self, env): super(Actor, self).__init__() self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 400) self.fc2 = nn.Linear(400, 300) self.fc_mu = nn.Linear(300, np.prod(env.single_action_space.shape)) def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) return torch.tanh(self.fc_mu(x))
-
ddpg_continuous_action.py uses the following learning rates:
while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)1 uses the following learning rates:q_optimizer = optim.Adam(list(qf1.parameters()), lr=3e-4) actor_optimizer = optim.Adam(list(actor.parameters()), lr=3e-4)
q_optimizer = optim.Adam(list(qf1.parameters()), lr=1e-4) actor_optimizer = optim.Adam(list(actor.parameters()), lr=1e-3)
-
ddpg_continuous_action.py uses
--batch-size=256 --tau=0.005
, while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)1 uses--batch-size=64 --tau=0.001
Experiment results
PR vwxyzjn/cleanrl#120 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/ppo.
Below are the average episodic returns for ppo.py
. To ensure the quality of the implementation, we compared the results against openai/baselies
' PPO.
Environment | ppo.py |
openai/baselies ' PPO |
---|---|---|
CartPole-v1 | 488.75 ± 18.40 | 497.54 ± 4.02 |
Acrobot-v1 | -82.48 ± 5.93 | -81.82 ± 5.58 |
MountainCar-v0 | -200.00 ± 0.00 | -200.00 ± 0.00 |
Learning curves:
Tracked experiments and game play videos: