Module 11: Learning-Based Control

Introduction

Learning enables robots to acquire skills that are difficult to program explicitly. This module covers reinforcement learning, imitation learning, and their application to robotic control.

Section 1: Reinforcement Learning Foundations

1.1 Markov Decision Processes

Markov Decision Process (MDP): A mathematical framework for sequential decision-making defined by states, actions, transitions, and rewards.

$V^*(s) = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s'|s,a)V^*(s')\right]$

1.2 Policy Gradient Methods

def policy_gradient_update(policy, trajectories):
    loss = 0
    for traj in trajectories:
        returns = compute_returns(traj.rewards)
        for t, (s, a, r, R) in enumerate(zip(...)):
            log_prob = policy.log_prob(s, a)
            loss -= log_prob * R  # REINFORCE
    loss.backward()
    optimizer.step()

Section 2: Deep RL Algorithms

2.1 PPO (Proximal Policy Optimization)

Stable policy updates through clipping:

$L^{CLIP}(\theta) = \mathbb{E}[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]$

2.2 SAC (Soft Actor-Critic)

Maximum entropy RL for exploration.

RL requires many samples. Simulation training is typically necessary before real-world deployment.

Section 3: Imitation Learning

3.1 Behavioral Cloning

Supervised learning from demonstrations:

def behavioral_cloning(demonstrations):
    model = PolicyNetwork()
    for epoch in range(epochs):
        for state, action in demonstrations:
            pred_action = model(state)
            loss = mse_loss(pred_action, action)
            loss.backward()
            optimizer.step()
    return model

3.2 DAgger

Dataset Aggregation for correcting distribution shift.

Section 4: Reward Engineering

4.1 Reward Design

Challenges:

Sparse rewards (hard to learn)
Dense rewards (reward hacking)
Multi-objective tradeoffs

4.2 Learning from Preferences

Using human feedback to shape rewards.

Summary

Key takeaways:

RL enables skill acquisition through trial and error
PPO and SAC are practical algorithms for robotics
Imitation learning leverages human expertise
Reward design significantly impacts learning outcomes

Key Concepts

Policy Gradient: Learning by gradient ascent on expected return
Actor-Critic: Combining policy and value function learning
Imitation Learning: Learning from demonstrations
Reward Shaping: Designing rewards for desired behavior

Introduction​

Section 1: Reinforcement Learning Foundations​

1.1 Markov Decision Processes​

1.2 Policy Gradient Methods​

Section 2: Deep RL Algorithms​

2.1 PPO (Proximal Policy Optimization)​

2.2 SAC (Soft Actor-Critic)​

Section 3: Imitation Learning​

3.1 Behavioral Cloning​

3.2 DAgger​

Section 4: Reward Engineering​

4.1 Reward Design​

4.2 Learning from Preferences​

Summary​

Key Concepts​

Further Reading​