Skip to main content

Module 11: Learning-Based Control

Introduction

Learning enables robots to acquire skills that are difficult to program explicitly. This module covers reinforcement learning, imitation learning, and their application to robotic control.

Section 1: Reinforcement Learning Foundations

1.1 Markov Decision Processes

Markov Decision Process (MDP): A mathematical framework for sequential decision-making defined by states, actions, transitions, and rewards.

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s'|s,a)V^*(s')\right]

1.2 Policy Gradient Methods

def policy_gradient_update(policy, trajectories):
loss = 0
for traj in trajectories:
returns = compute_returns(traj.rewards)
for t, (s, a, r, R) in enumerate(zip(...)):
log_prob = policy.log_prob(s, a)
loss -= log_prob * R # REINFORCE
loss.backward()
optimizer.step()

Section 2: Deep RL Algorithms

2.1 PPO (Proximal Policy Optimization)

Stable policy updates through clipping:

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]

2.2 SAC (Soft Actor-Critic)

Maximum entropy RL for exploration.

RL requires many samples. Simulation training is typically necessary before real-world deployment.

Section 3: Imitation Learning

3.1 Behavioral Cloning

Supervised learning from demonstrations:

def behavioral_cloning(demonstrations):
model = PolicyNetwork()
for epoch in range(epochs):
for state, action in demonstrations:
pred_action = model(state)
loss = mse_loss(pred_action, action)
loss.backward()
optimizer.step()
return model

3.2 DAgger

Dataset Aggregation for correcting distribution shift.

Section 4: Reward Engineering

4.1 Reward Design

Challenges:

  • Sparse rewards (hard to learn)
  • Dense rewards (reward hacking)
  • Multi-objective tradeoffs

4.2 Learning from Preferences

Using human feedback to shape rewards.

Summary

Key takeaways:

  1. RL enables skill acquisition through trial and error
  2. PPO and SAC are practical algorithms for robotics
  3. Imitation learning leverages human expertise
  4. Reward design significantly impacts learning outcomes

Key Concepts

  • Policy Gradient: Learning by gradient ascent on expected return
  • Actor-Critic: Combining policy and value function learning
  • Imitation Learning: Learning from demonstrations
  • Reward Shaping: Designing rewards for desired behavior

Further Reading

  1. Sutton, R.S. & Barto, A.G. (2018). "Reinforcement Learning: An Introduction"
  2. Levine, S. et al. (2016). "End-to-End Training of Deep Visuomotor Policies"