Reinforcement Learning Cheatsheet Cheatsheet

🎮

RL Fundamentals

Core Concepts

Reinforcement Learning (RL) trains agents to make sequential decisions by maximizing cumulative reward through trial and error interaction with an environment.

rl_basics.py

import gymnasium as gym
import numpy as np

# ── OpenAI Gym / Gymnasium Basics ──
env = gym.make('CartPole-v1', render_mode=None)
observation, info = env.reset(seed=42)

print(f"Observation space: {env.observation_space}")  # Box(4,)
print(f"Action space: {env.action_space}")           # Discrete(2)

for step in range(100):
    action = env.action_space.sample()  # Random action
    observation, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        break
env.close()

# ── Key RL Concepts ──
# State (s): current environment observation
# Action (a): agent&apos;s decision
# Reward (r): feedback from environment
# Policy (pi): strategy mapping states to actions
# Value function V(s): expected cumulative reward from state s
# Q-function Q(s,a): expected reward from state s, action a
# Return: sum of discounted rewards G_t = r_t + gamma*r_{t+1} + ...
# Discount factor gamma: 0.99 typical (values future rewards slightly less)

RL Problem Formulation

MDP Tuple(S, A, P, R, gamma) - States, Actions, Transition probabilities, Rewards, Discount factor. The Markov property means the future is independent of the past given the present.

Exploration vs ExploitationTrade-off between trying new actions (exploration) and using known good actions (exploitation). Too much exploration = slow learning. Too little = getting stuck in suboptimal policies.

Credit AssignmentDetermining which past actions contributed to the current reward. Temporal credit assignment (which timestep), structural credit assignment (which action).

🧮

Q-Learning & DQN

Value-Based Methods

Q-Learning learns the optimal action-value function Q*(s,a) without requiring a model of the environment. DQN extends Q-Learning with deep neural networks for high-dimensional state spaces.

dqn.py

import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
from collections import deque
import random

# ── Q-Learning (Tabular) ──
# Q(s,a) <- Q(s,a) + alpha * [r + gamma * max Q(s&apos;,a&apos;) - Q(s,a)]
# Epsilon-greedy: epsilon probability random, 1-epsilon greedy

# ── Deep Q-Network (DQN) ──
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim),
        )

    def forward(self, x):
        return self.net(x)

# ── Replay Buffer ──
class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (np.array(states), np.array(actions), np.array(rewards),
                np.array(next_states), np.array(dones))

# ── Training Loop ──
env = gym.make('CartPole-v1')
policy_net = DQN(4, 2)
target_net = DQN(4, 2)
target_net.load_state_dict(policy_net.state_dict())
optimizer = optim.Adam(policy_net.parameters(), lr=1e-3)
buffer = ReplayBuffer(10000)

epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01
gamma = 0.99
batch_size = 64

for episode in range(500):
    state, _ = env.reset()
    total_reward = 0

    for t in range(500):
        # Epsilon-greedy action selection
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            with torch.no_grad():
                action = policy_net(torch.FloatTensor(state)).argmax().item()

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        buffer.push(state, action, reward, next_state, done)
        state = next_state
        total_reward += reward

        # Train from replay buffer
        if len(buffer.buffer) > batch_size:
            s, a, r, ns, d = buffer.sample(batch_size)
            s = torch.FloatTensor(s)
            a = torch.LongTensor(a)
            r = torch.FloatTensor(r)
            ns = torch.FloatTensor(ns)
            d = torch.FloatTensor(d)

            q_values = policy_net(s).gather(1, a.unsqueeze(1)).squeeze(1)
            with torch.no_grad():
                max_q = target_net(ns).max(1)[0]
                target_q = r + gamma * max_q * (1 - d)

            loss = nn.MSELoss()(q_values, target_q)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if done:
            break

    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Update target network periodically
    if episode % 10 == 0:
        target_net.load_state_dict(policy_net.state_dict())

DQN Improvements

Improvement	Description	Effect
Target Network	Separate slow-updating network for stable targets	Prevents oscillation, stabilizes training
Experience Replay	Store and sample past transitions randomly	Breaks correlation, improves data efficiency
Double DQN	Use policy net to select, target net to evaluate	Reduces overestimation of Q-values
Dueling DQN	Separate state-value and advantage streams	Better value estimation
Prioritized Replay	Sample important transitions more often	Faster learning on rare events
Noisy DQN	Noise in network weights for exploration	Eliminates need for epsilon-greedy

📈

Policy Gradient Methods

Actor-Critic, PPO, A2C

Policy gradient methods directly optimize the policy function. PPO (Proximal Policy Optimization) is the most widely used RL algorithm for large-scale applications.

ppo_basic.py

import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
from torch.distributions import Categorical

# ── Actor-Critic Network ──
class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=64):
        super().__init__()
        self.actor = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, action_dim), nn.Softmax(dim=-1),
        )
        self.critic = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, 1),
        )

    def forward(self, state):
        return self.actor(state), self.critic(state)

# ── PPO Training ──
env = gym.make('CartPole-v1')
model = ActorCritic(4, 2)
optimizer = optim.Adam(model.parameters(), lr=3e-4)
gamma = 0.99
eps_clip = 0.2  # PPO clipping parameter
gae_lambda = 0.95
K_epochs = 4  # PPO update epochs

def compute_gae(rewards, values, dones, next_value):
    advantages = []
    gae = 0
    values = list(values) + [next_value]
    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * values[t+1] * (1-dones[t]) - values[t]
        gae = delta + gamma * gae_lambda * (1-dones[t]) * gae
        advantages.insert(0, gae)
    return advantages

for episode in range(1000):
    state, _ = env.reset()
    log_probs, values, rewards, dones = [], [], [], []

    for t in range(200):
        state_t = torch.FloatTensor(state)
        probs, value = model(state_t)
        dist = Categorical(probs)
        action = dist.sample()

        log_probs.append(dist.log_prob(action))
        values.append(value.squeeze())
        next_state, reward, terminated, truncated, _ = env.step(action.item())
        rewards.append(reward)
        dones.append(float(terminated or truncated))
        state = next_state

        if terminated or truncated:
            with torch.no_grad():
                _, next_value = model(torch.FloatTensor(next_state))
            break

    # Compute returns and advantages
    returns = []
    R = 0
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    advantages = compute_gae(rewards, values, dones, next_value)

    # Convert to tensors
    old_log_probs = torch.stack(log_probs)
    returns = torch.FloatTensor(returns)
    advantages = torch.FloatTensor(advantages)
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    # PPO Update
    for _ in range(K_epochs):
        probs, new_values = model(torch.FloatTensor(
            [s for s in [env.reset()[0]]][:len(rewards)]))
        # Simplified - in practice, use proper rollout storage
        # Compute ratio and clipped objective
        # This is a simplified PPO; use stable-baselines3 for production

RL Algorithm Comparison

Algorithm	Type	Sample Efficiency	Stability	Best For
Q-Learning	Value-based	Moderate	Moderate	Small discrete environments
DQN	Value-based (Deep)	Low	Moderate	Atari games, image inputs
REINFORCE	Policy gradient	Very Low	Low	Simple policy gradient baseline
A2C	Actor-Critic	Moderate	Moderate	Parallel environments, continuous control
PPO	Actor-Critic (clipped)	Moderate	High	General purpose, LLM RLHF, robotics
SAC	Actor-Critic (max entropy)	High	High	Continuous control, exploration
TD3	Actor-Critic (twin critics)	High	High	Continuous control, deterministic policy

🏋️

Gymnasium Environments

Practice

Classic Control Environments

Environment	State Space	Action Space	Difficulty
CartPole-v1	Box(4)	Discrete(2)	Easy (solved with DQN)
MountainCar-v0	Box(2)	Discrete(3)	Medium (sparse reward)
Pendulum-v1	Box(3)	Box(1)	Medium (continuous actions)
Acrobot-v1	Box(6)	Discrete(3)	Medium
LunarLander-v2	Box(8)	Discrete(4)	Easy-Medium

Atari Environments

Environment	Description	State Shape	Benchmark Score
Breakout-v4	Break bricks with paddle	(210, 160, 3)	300+ (random: 1.2)
Pong-v4	Classic Pong game	(210, 160, 3)	20+ (random: -20)
SpaceInvaders-v4	Shoot aliens	(210, 160, 3)	1000+ (random: 150)
MsPacman-v4	Maze-based eating game	(210, 160, 3)	2000+ (random: 200)

💡

Getting started: Start with CartPole-v1 (DQN, PPO). Then MountainCar-v0 (reward shaping). Then LunarLander-v2. For continuous control: Pendulum-v1, BipedalWalker. Use stable-baselines3 for production-quality implementations.

💬

Interview Questions

Top 8

Q1: Exploration vs Exploitation

AnswerExploration: trying new actions to discover better strategies. Exploitation: using known best actions to maximize reward. Methods: epsilon-greedy (random with prob epsilon), UCB (upper confidence bound - balances automatically), Boltzmann exploration (softmax over Q-values), entropy bonus (intrinsic motivation for novel states). Too much exploration wastes time; too little gets stuck in local optima.

Q2: Why does DQN need target network?

AnswerWithout target network: Q-learning target = r + gamma * max Q(s',a') uses the same network being updated. This creates a moving target problem - as Q-values change, the target also changes, causing oscillation and instability. Solution: separate target network updated less frequently (every N steps or polyak averaging: theta_target = 0.995*theta_target + 0.005*theta_policy). Provides stable targets for learning.

Q3: PPO vs REINFORCE

AnswerREINFORCE: vanilla policy gradient. Updates policy by: gradient = E[log pi(a|s) * G_t]. High variance, no baseline, unstable. PPO: clipped surrogate objective limits how much the policy can change per update (ratio clipped to [1-eps, 1+eps]). Multiple epochs on same data. Lower variance, more stable, sample efficient. PPO is the default for LLM RLHF.

Q4: On-policy vs Off-policy

AnswerOn-policy: learns from actions taken by the current policy (PPO, A2C, REINFORCE). Must collect new data after each update. Off-policy: can learn from data collected by any policy (DQN, SAC, Q-Learning). Uses replay buffer. Off-policy is more sample efficient (can reuse old data) but may have distribution mismatch issues.

Q5: What is the credit assignment problem?

AnswerIn RL, rewards are often delayed. Credit assignment determines which action(s) caused a reward. Temporal: was it the action 1 step ago or 10 steps ago? Solution: discount factor gamma (closer actions get more credit), eligibility traces, GAE (Generalized Advantage Estimation). Structural: which of multiple actions contributed? Solution: baselines, advantage functions.

Q6: RLHF for LLMs

AnswerRLHF (Reinforcement Learning from Human Feedback): 1) Train reward model on human preference pairs (chosen vs rejected responses). 2) Use PPO to optimize the LLM policy against the reward model. 3) Add KL penalty to prevent the policy from diverging too far from the reference model. Alternatives: DPO (Direct Preference Optimization) - simpler, no separate reward model or PPO training.

Q7: Reward shaping

AnswerModifying the reward signal to guide learning. Dense rewards (step-by-step) are easier to learn from than sparse rewards (only at end). Example: MountainCar - default reward is -1 per step, reaching goal is 0. Better: reward = position-based shaping. Risk: reward hacking - agent finds loopholes in shaped reward. Potential-based shaping (F(potential) = gamma*V(s') - V(s)) guarantees same optimal policy.

Q8: Curse of dimensionality in RL

AnswerAs state/action spaces grow, the number of possible state-action pairs explodes exponentially. Tabular methods (Q-table) become infeasible. Solutions: function approximation (neural networks) to generalize across similar states, hierarchical RL (high-level + low-level policies), attention mechanisms, and factorization of state space. DQN handles ~4M pixel inputs using CNNs.

⏳

Loading cheatsheet...

import gymnasium as gym import numpy as np # ── OpenAI Gym / Gymnasium Basics ── env = gym.make('CartPole-v1', render_mode=None) observation, info = env.reset(seed=42) print(f"Observation space: {env.observation_space}") # Box(4,) print(f"Action space: {env.action_space}") # Discrete(2) for step in range(100): action = env.action_space.sample() # Random action observation, reward, terminated, truncated, info = env.step(action) if terminated or truncated: break env.close() # ── Key RL Concepts ── # State (s): current environment observation # Action (a): agent's decision # Reward (r): feedback from environment # Policy (pi): strategy mapping states to actions # Value function V(s): expected cumulative reward from state s # Q-function Q(s,a): expected reward from state s, action a # Return: sum of discounted rewards G_t = r_t + gamma*r_{t+1} + ... # Discount factor gamma: 0.99 typical (values future rewards slightly less)

import torch import torch.nn as nn import torch.optim as optim import gymnasium as gym from collections import deque import random # ── Q-Learning (Tabular) ── # Q(s,a) <- Q(s,a) + alpha * [r + gamma * max Q(s',a') - Q(s,a)] # Epsilon-greedy: epsilon probability random, 1-epsilon greedy # ── Deep Q-Network (DQN) ── class DQN(nn.Module): def __init__(self, state_dim, action_dim, hidden=128): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, action_dim), ) def forward(self, x): return self.net(x) # ── Replay Buffer ── class ReplayBuffer: def __init__(self, capacity=10000): self.buffer = deque(maxlen=capacity) def push(self, state, action, reward, next_state, done): self.buffer.append((state, action, reward, next_state, done)) def sample(self, batch_size): batch = random.sample(self.buffer, batch_size) states, actions, rewards, next_states, dones = zip(*batch) return (np.array(states), np.array(actions), np.array(rewards), np.array(next_states), np.array(dones)) # ── Training Loop ── env = gym.make('CartPole-v1') policy_net = DQN(4, 2) target_net = DQN(4, 2) target_net.load_state_dict(policy_net.state_dict()) optimizer = optim.Adam(policy_net.parameters(), lr=1e-3) buffer = ReplayBuffer(10000) epsilon = 1.0 epsilon_decay = 0.995 epsilon_min = 0.01 gamma = 0.99 batch_size = 64 for episode in range(500): state, _ = env.reset() total_reward = 0 for t in range(500): # Epsilon-greedy action selection if random.random() < epsilon: action = env.action_space.sample() else: with torch.no_grad(): action = policy_net(torch.FloatTensor(state)).argmax().item() next_state, reward, terminated, truncated, _ = env.step(action) done = terminated or truncated buffer.push(state, action, reward, next_state, done) state = next_state total_reward += reward # Train from replay buffer if len(buffer.buffer) > batch_size: s, a, r, ns, d = buffer.sample(batch_size) s = torch.FloatTensor(s) a = torch.LongTensor(a) r = torch.FloatTensor(r) ns = torch.FloatTensor(ns) d = torch.FloatTensor(d) q_values = policy_net(s).gather(1, a.unsqueeze(1)).squeeze(1) with torch.no_grad(): max_q = target_net(ns).max(1)[0] target_q = r + gamma * max_q * (1 - d) loss = nn.MSELoss()(q_values, target_q) optimizer.zero_grad() loss.backward() optimizer.step() if done: break epsilon = max(epsilon_min, epsilon * epsilon_decay) # Update target network periodically if episode % 10 == 0: target_net.load_state_dict(policy_net.state_dict())

Improvement

Description

Effect

Target Network

Separate slow-updating network for stable targets

Prevents oscillation, stabilizes training

Experience Replay

Store and sample past transitions randomly

Breaks correlation, improves data efficiency

Double DQN

Use policy net to select, target net to evaluate

Reduces overestimation of Q-values

Dueling DQN

Separate state-value and advantage streams

Better value estimation

Prioritized Replay

Sample important transitions more often

Faster learning on rare events

Noisy DQN

Noise in network weights for exploration

Eliminates need for epsilon-greedy

import torch import torch.nn as nn import torch.optim as optim import gymnasium as gym from torch.distributions import Categorical # ── Actor-Critic Network ── class ActorCritic(nn.Module): def __init__(self, state_dim, action_dim, hidden=64): super().__init__() self.actor = nn.Sequential( nn.Linear(state_dim, hidden), nn.ReLU(), nn.Linear(hidden, action_dim), nn.Softmax(dim=-1), ) self.critic = nn.Sequential( nn.Linear(state_dim, hidden), nn.ReLU(), nn.Linear(hidden, 1), ) def forward(self, state): return self.actor(state), self.critic(state) # ── PPO Training ── env = gym.make('CartPole-v1') model = ActorCritic(4, 2) optimizer = optim.Adam(model.parameters(), lr=3e-4) gamma = 0.99 eps_clip = 0.2 # PPO clipping parameter gae_lambda = 0.95 K_epochs = 4 # PPO update epochs def compute_gae(rewards, values, dones, next_value): advantages = [] gae = 0 values = list(values) + [next_value] for t in reversed(range(len(rewards))): delta = rewards[t] + gamma * values[t+1] * (1-dones[t]) - values[t] gae = delta + gamma * gae_lambda * (1-dones[t]) * gae advantages.insert(0, gae) return advantages for episode in range(1000): state, _ = env.reset() log_probs, values, rewards, dones = [], [], [], [] for t in range(200): state_t = torch.FloatTensor(state) probs, value = model(state_t) dist = Categorical(probs) action = dist.sample() log_probs.append(dist.log_prob(action)) values.append(value.squeeze()) next_state, reward, terminated, truncated, _ = env.step(action.item()) rewards.append(reward) dones.append(float(terminated or truncated)) state = next_state if terminated or truncated: with torch.no_grad(): _, next_value = model(torch.FloatTensor(next_state)) break # Compute returns and advantages returns = [] R = 0 for r in reversed(rewards): R = r + gamma * R returns.insert(0, R) advantages = compute_gae(rewards, values, dones, next_value) # Convert to tensors old_log_probs = torch.stack(log_probs) returns = torch.FloatTensor(returns) advantages = torch.FloatTensor(advantages) advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) # PPO Update for _ in range(K_epochs): probs, new_values = model(torch.FloatTensor( [s for s in [env.reset()[0]]][:len(rewards)])) # Simplified - in practice, use proper rollout storage # Compute ratio and clipped objective # This is a simplified PPO; use stable-baselines3 for production

Algorithm

Type

Sample Efficiency

Stability

Best For

Q-Learning

Value-based

Moderate

Small discrete environments

DQN

Value-based (Deep)

Low

Moderate

Atari games, image inputs

REINFORCE

Policy gradient

Very Low

Low

Simple policy gradient baseline

A2C

Actor-Critic

Moderate

Parallel environments, continuous control

PPO

Actor-Critic (clipped)

Moderate

High

General purpose, LLM RLHF, robotics

SAC

Actor-Critic (max entropy)

High

Continuous control, exploration

TD3

Actor-Critic (twin critics)

High

Continuous control, deterministic policy

Environment

State Space

Action Space

Difficulty

CartPole-v1

Box(4)

Discrete(2)

Easy (solved with DQN)

MountainCar-v0

Box(2)

Discrete(3)

Medium (sparse reward)

Pendulum-v1

Box(3)

Box(1)

Medium (continuous actions)

Acrobot-v1

Box(6)

Discrete(3)

Medium

LunarLander-v2

Box(8)

Discrete(4)

Easy-Medium

Environment

Description

State Shape

Benchmark Score

Breakout-v4

Break bricks with paddle

(210, 160, 3)

300+ (random: 1.2)

Pong-v4

Classic Pong game

(210, 160, 3)

20+ (random: -20)

SpaceInvaders-v4

Shoot aliens

(210, 160, 3)

1000+ (random: 150)

MsPacman-v4

Maze-based eating game

(210, 160, 3)

2000+ (random: 200)