⏳
Loading cheatsheet...
MDPs, Q-Learning, Policy Gradient, DQN, Actor-Critic, PPO, environments, and real-world applications.
Reinforcement Learning (RL) trains agents to make sequential decisions by maximizing cumulative reward through trial and error interaction with an environment.
import gymnasium as gym
import numpy as np
# ── OpenAI Gym / Gymnasium Basics ──
env = gym.make('CartPole-v1', render_mode=None)
observation, info = env.reset(seed=42)
print(f"Observation space: {env.observation_space}") # Box(4,)
print(f"Action space: {env.action_space}") # Discrete(2)
for step in range(100):
action = env.action_space.sample() # Random action
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
break
env.close()
# ── Key RL Concepts ──
# State (s): current environment observation
# Action (a): agent's decision
# Reward (r): feedback from environment
# Policy (pi): strategy mapping states to actions
# Value function V(s): expected cumulative reward from state s
# Q-function Q(s,a): expected reward from state s, action a
# Return: sum of discounted rewards G_t = r_t + gamma*r_{t+1} + ...
# Discount factor gamma: 0.99 typical (values future rewards slightly less)Q-Learning learns the optimal action-value function Q*(s,a) without requiring a model of the environment. DQN extends Q-Learning with deep neural networks for high-dimensional state spaces.
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
from collections import deque
import random
# ── Q-Learning (Tabular) ──
# Q(s,a) <- Q(s,a) + alpha * [r + gamma * max Q(s',a') - Q(s,a)]
# Epsilon-greedy: epsilon probability random, 1-epsilon greedy
# ── Deep Q-Network (DQN) ──
class DQN(nn.Module):
def __init__(self, state_dim, action_dim, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, action_dim),
)
def forward(self, x):
return self.net(x)
# ── Replay Buffer ──
class ReplayBuffer:
def __init__(self, capacity=10000):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (np.array(states), np.array(actions), np.array(rewards),
np.array(next_states), np.array(dones))
# ── Training Loop ──
env = gym.make('CartPole-v1')
policy_net = DQN(4, 2)
target_net = DQN(4, 2)
target_net.load_state_dict(policy_net.state_dict())
optimizer = optim.Adam(policy_net.parameters(), lr=1e-3)
buffer = ReplayBuffer(10000)
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01
gamma = 0.99
batch_size = 64
for episode in range(500):
state, _ = env.reset()
total_reward = 0
for t in range(500):
# Epsilon-greedy action selection
if random.random() < epsilon:
action = env.action_space.sample()
else:
with torch.no_grad():
action = policy_net(torch.FloatTensor(state)).argmax().item()
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
buffer.push(state, action, reward, next_state, done)
state = next_state
total_reward += reward
# Train from replay buffer
if len(buffer.buffer) > batch_size:
s, a, r, ns, d = buffer.sample(batch_size)
s = torch.FloatTensor(s)
a = torch.LongTensor(a)
r = torch.FloatTensor(r)
ns = torch.FloatTensor(ns)
d = torch.FloatTensor(d)
q_values = policy_net(s).gather(1, a.unsqueeze(1)).squeeze(1)
with torch.no_grad():
max_q = target_net(ns).max(1)[0]
target_q = r + gamma * max_q * (1 - d)
loss = nn.MSELoss()(q_values, target_q)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if done:
break
epsilon = max(epsilon_min, epsilon * epsilon_decay)
# Update target network periodically
if episode % 10 == 0:
target_net.load_state_dict(policy_net.state_dict())| Improvement | Description | Effect |
|---|---|---|
| Target Network | Separate slow-updating network for stable targets | Prevents oscillation, stabilizes training |
| Experience Replay | Store and sample past transitions randomly | Breaks correlation, improves data efficiency |
| Double DQN | Use policy net to select, target net to evaluate | Reduces overestimation of Q-values |
| Dueling DQN | Separate state-value and advantage streams | Better value estimation |
| Prioritized Replay | Sample important transitions more often | Faster learning on rare events |
| Noisy DQN | Noise in network weights for exploration | Eliminates need for epsilon-greedy |
Policy gradient methods directly optimize the policy function. PPO (Proximal Policy Optimization) is the most widely used RL algorithm for large-scale applications.
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
from torch.distributions import Categorical
# ── Actor-Critic Network ──
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim, hidden=64):
super().__init__()
self.actor = nn.Sequential(
nn.Linear(state_dim, hidden), nn.ReLU(),
nn.Linear(hidden, action_dim), nn.Softmax(dim=-1),
)
self.critic = nn.Sequential(
nn.Linear(state_dim, hidden), nn.ReLU(),
nn.Linear(hidden, 1),
)
def forward(self, state):
return self.actor(state), self.critic(state)
# ── PPO Training ──
env = gym.make('CartPole-v1')
model = ActorCritic(4, 2)
optimizer = optim.Adam(model.parameters(), lr=3e-4)
gamma = 0.99
eps_clip = 0.2 # PPO clipping parameter
gae_lambda = 0.95
K_epochs = 4 # PPO update epochs
def compute_gae(rewards, values, dones, next_value):
advantages = []
gae = 0
values = list(values) + [next_value]
for t in reversed(range(len(rewards))):
delta = rewards[t] + gamma * values[t+1] * (1-dones[t]) - values[t]
gae = delta + gamma * gae_lambda * (1-dones[t]) * gae
advantages.insert(0, gae)
return advantages
for episode in range(1000):
state, _ = env.reset()
log_probs, values, rewards, dones = [], [], [], []
for t in range(200):
state_t = torch.FloatTensor(state)
probs, value = model(state_t)
dist = Categorical(probs)
action = dist.sample()
log_probs.append(dist.log_prob(action))
values.append(value.squeeze())
next_state, reward, terminated, truncated, _ = env.step(action.item())
rewards.append(reward)
dones.append(float(terminated or truncated))
state = next_state
if terminated or truncated:
with torch.no_grad():
_, next_value = model(torch.FloatTensor(next_state))
break
# Compute returns and advantages
returns = []
R = 0
for r in reversed(rewards):
R = r + gamma * R
returns.insert(0, R)
advantages = compute_gae(rewards, values, dones, next_value)
# Convert to tensors
old_log_probs = torch.stack(log_probs)
returns = torch.FloatTensor(returns)
advantages = torch.FloatTensor(advantages)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# PPO Update
for _ in range(K_epochs):
probs, new_values = model(torch.FloatTensor(
[s for s in [env.reset()[0]]][:len(rewards)]))
# Simplified - in practice, use proper rollout storage
# Compute ratio and clipped objective
# This is a simplified PPO; use stable-baselines3 for production| Algorithm | Type | Sample Efficiency | Stability | Best For |
|---|---|---|---|---|
| Q-Learning | Value-based | Moderate | Moderate | Small discrete environments |
| DQN | Value-based (Deep) | Low | Moderate | Atari games, image inputs |
| REINFORCE | Policy gradient | Very Low | Low | Simple policy gradient baseline |
| A2C | Actor-Critic | Moderate | Moderate | Parallel environments, continuous control |
| PPO | Actor-Critic (clipped) | Moderate | High | General purpose, LLM RLHF, robotics |
| SAC | Actor-Critic (max entropy) | High | High | Continuous control, exploration |
| TD3 | Actor-Critic (twin critics) | High | High | Continuous control, deterministic policy |
| Environment | State Space | Action Space | Difficulty |
|---|---|---|---|
| CartPole-v1 | Box(4) | Discrete(2) | Easy (solved with DQN) |
| MountainCar-v0 | Box(2) | Discrete(3) | Medium (sparse reward) |
| Pendulum-v1 | Box(3) | Box(1) | Medium (continuous actions) |
| Acrobot-v1 | Box(6) | Discrete(3) | Medium |
| LunarLander-v2 | Box(8) | Discrete(4) | Easy-Medium |
| Environment | Description | State Shape | Benchmark Score |
|---|---|---|---|
| Breakout-v4 | Break bricks with paddle | (210, 160, 3) | 300+ (random: 1.2) |
| Pong-v4 | Classic Pong game | (210, 160, 3) | 20+ (random: -20) |
| SpaceInvaders-v4 | Shoot aliens | (210, 160, 3) | 1000+ (random: 150) |
| MsPacman-v4 | Maze-based eating game | (210, 160, 3) | 2000+ (random: 200) |