Skip to main content

Motion Mimic PPO

Module: rsl_rl.algorithms.mmppo

MMPPO Architecture

๐ŸŽฏ Algorithm Overviewโ€‹

DAgger-MMPPO (Motion Mimic Proximal Policy Optimization) is a novel two-stage reinforcement learning framework that combines the Dataset Aggregation (DAgger) algorithm with PPO training logic for whole-body humanoid robot motion learning. Unlike direct teleoperation, MMPPO enables robots to learn human-like motion patterns through behavior cloning, addressing the challenge of mapping reference motions to executable robot actions.

๐Ÿ—๏ธ Core Conceptsโ€‹

๐ŸŽญ Two-Stage Training Frameworkโ€‹

Stage 1: DAgger Policy Training

  • Simplified physical environment with fixed robot base
  • Focus on joint-level motion cloning using expert demonstrations
  • Learn initial policy ฯ€_D that approximates optimal imitation policy
  • State substitution: Replace agent base state with expert reference state

Stage 2: Policy Training with DAgger

  • Full physical environment with task and imitation rewards
  • LoRA-enhanced DAgger network for adaptive expert reference
  • Dual optimization: actor improvement + DAgger network fine-tuning
  • Integration with AMP and symmetry constraints

๐ŸŽฏ Key Innovationsโ€‹

  • State Space Decomposition: Split robot state into base and joint components
  • Reference State Mapping: Expert state substitution for ideal base control simulation
  • LoRA Integration: Low-rank adaptation for efficient DAgger network fine-tuning
  • Multi-Modal Learning: Support for various auxiliary learning modules (RND, AMP, Symmetry)

๐Ÿงฎ Mathematical Foundationโ€‹

๐Ÿ“Š MDP Formulationโ€‹

The robot behavior cloning task is formulated as an MDP: (S, ลœ, A, R, Rฬ‚, T, ฮณ)

  • S: Agent's state space
  • ลœ: Expert reference state space
  • A: Agent's action space
  • R(s_t, a_t, s_next): Intrinsic task reward
  • Rฬ‚((s_t, ล_t), a_t, (s_next, ล_next)): Imitation reward
  • T(s_next | s_t, a_t): State transition probability
  • ฮณ โˆˆ [0,1): Discount factor

๐ŸŽฏ Policy Definitionโ€‹

The policy is defined as:

a_t = ฯ€(o_t, รด_t)  if รด_t โ‰  None
a_t = ฯ€(o_t) if รด_t = None

Where:

  • a_t: Action at time t
  • ฯ€: Policy function
  • o_t: Current observation
  • รด_t: Reference observation (may be None)

๐Ÿ“ˆ Loss Functionsโ€‹

Imitation Loss (KL Divergence):

L_KL(ฯ€_ฮธ | ฯ€ฬ‚_ฮธ) = E[s_t ~ D][-log ฯ€_ฮธ(รข_t | s_t)]

DAgger Loss:

L_DAgger(รข_t_Dagg, a_t_Actor, ฯ) = (1-ฯ) * norm_2(รข_t_Dagg - a_t_Actor)ยฒ

๐Ÿง  Core Componentsโ€‹

๐Ÿ“ Loss Functionsโ€‹

๐Ÿ”ง SmoothL2Lossโ€‹

Module Name: rsl_rl.algorithms.mmppo.SmoothL2Loss

Definition:

class SmoothL2Loss(nn.Module):
def __init__(self, delta: float = 1.0)

๐Ÿ“ฅ Parameters:

  • delta (float): Threshold for switching between quadratic and linear loss. Default is 1.0

๐Ÿ”ง Functionality: Huber loss implementation that combines L2 loss for small errors and L1 loss for large errors:

loss(x) = 0.5 * xยฒ               if |x| < delta
delta * (|x| - 0.5*delta) otherwise

๐Ÿ”ง L2Lossโ€‹

Module Name: rsl_rl.algorithms.mmppo.L2Loss

Definition:

class L2Loss(nn.Module):
def __init__(self)

๐Ÿ”ง Functionality: Standard L2 (mean squared error) loss function: loss = mean((input - target)ยฒ)


๐Ÿค– MMPPO Algorithmโ€‹

Module Name: rsl_rl.algorithms.mmppo.MMPPO

Definition:

class MMPPO:
def __init__(self, actor_critic, num_learning_epochs=1, num_mini_batches=1,
clip_param=0.2, gamma=0.998, lam=0.95, value_loss_coef=1.0,
entropy_coef=0.0, learning_rate=1e-3, max_lr=1e-2, min_lr=1e-4,
# DAgger parameters
teacher_coef=None, teacher_loss_coef=None, teacher_lr=5e-4,
teacher_supervising_intervals=0, teacher_updating_intervals=0,
teacher_apply_interval=5, teacher_coef_mode="kl",
# Auxiliary modules
rnd_cfg=None, symmetry_cfg=None, amp_cfg=None,
# Training configuration
multi_gpu_cfg=None, auto_mix_precision=False, **kwargs)

๐Ÿ“ฅ Core Parametersโ€‹

PPO Configuration:

  • actor_critic (ActorCriticMMTransformer): Multi-modal actor-critic network
  • num_learning_epochs (int): Number of learning epochs per update. Default is 1
  • num_mini_batches (int): Number of mini-batches per epoch. Default is 1
  • clip_param (float): PPO clipping parameter. Default is 0.2
  • gamma (float): Discount factor. Default is 0.998
  • lam (float): GAE lambda parameter. Default is 0.95
  • value_loss_coef (float): Value function loss coefficient. Default is 1.0
  • entropy_coef (float): Entropy regularization coefficient. Default is 0.0

Learning Rate Management:

  • learning_rate (float): Base learning rate. Default is 1e-3
  • max_lr (float): Maximum learning rate. Default is 1e-2
  • min_lr (float): Minimum learning rate. Default is 1e-4
  • max_lr_after_certain_epoch (float): Max LR after restriction epoch. Default is 5e-3
  • max_lr_restriction_epoch (int): Epoch to start max LR restriction. Default is 25000

๐Ÿ“ฅ DAgger Parametersโ€‹

Core DAgger Configuration:

  • teacher_coef (float | None): DAgger coefficient (None disables DAgger). Default is None
  • teacher_loss_coef (float | None): Imitation loss coefficient. Default is None
  • teacher_lr (float): DAgger network learning rate. Default is 5e-4
  • teacher_coef_mode (str): DAgger coefficient mode ("kl", "norm", "original_kl"). Default is "kl"

Training Schedule:

  • teacher_supervising_intervals (int): Epochs before DAgger supervision starts. Default is 0
  • teacher_updating_intervals (int): Epochs before DAgger updates start. Default is 0
  • teacher_apply_interval (int): Interval for applying imitation loss. Default is 5
  • teacher_update_interval (int): Interval for DAgger updates. Default is 1

๐Ÿ“ฅ Auxiliary Module Configurationโ€‹

Random Network Distillation (RND):

  • rnd_cfg (dict | None): RND configuration dictionary. Default is None

Symmetry Learning:

  • symmetry_cfg (dict | None): Symmetry learning configuration. Default is None

Adversarial Motion Prior (AMP):

  • amp_cfg (dict | None): AMP discriminator configuration. Default is None

Distributed Training:

  • multi_gpu_cfg (dict | None): Multi-GPU training configuration. Default is None
  • auto_mix_precision (bool): Enable automatic mixed precision. Default is False

๐ŸŽฏ Core Methodsโ€‹

๐Ÿ”„ Training Lifecycleโ€‹

๐Ÿ“Š Storage Initializationโ€‹

def init_storage(self, training_type, num_envs, num_transitions_per_env,
actor_obs_shape, actor_ref_obs_shape, critic_obs_shape,
critic_ref_obs_shape, action_shape)

๐Ÿ”ง Function: Initializes rollout storage for experience collection with multi-modal observations.

๐ŸŽญ Mode Controlโ€‹

def test_mode(self)
def train_mode(self)

๐Ÿ”ง Function: Switch between training and evaluation modes for all network components.

๐Ÿš€ Action Generationโ€‹

๐ŸŽฏ Action Selectionโ€‹

def act(self, obs, ref_obs, critic_obs, ref_critic_obs)

๐Ÿ“ฅ Input:

  • obs (torch.Tensor): Current observations
  • ref_obs (tuple): Reference observations (data, mask)
  • critic_obs (torch.Tensor): Critic observations
  • ref_critic_obs (tuple): Reference critic observations (data, mask)

๐Ÿ“ค Output: Selected actions from the policy

๐Ÿ”ง Function: Generates actions using the actor-critic network and stores transition data including DAgger actions if enabled.

๐Ÿ”„ Environment Step Processingโ€‹

def process_env_step(self, rewards, dones, infos)

๐Ÿ“ฅ Input:

  • rewards (torch.Tensor): Environment rewards
  • dones (torch.Tensor): Episode termination flags
  • infos (dict): Additional environment information

๐Ÿ”ง Function: Processes environment step, adds intrinsic rewards (RND, AMP), handles bootstrapping, and stores transitions.

๐Ÿ“ˆ Learning Updatesโ€‹

๐Ÿงฎ Return Computationโ€‹

def compute_returns(self, last_critic_obs)

๐Ÿ“ฅ Input: last_critic_obs (torch.Tensor): Final critic observations for bootstrapping ๐Ÿ”ง Function: Computes returns and advantages using GAE (Generalized Advantage Estimation).

๐Ÿ”„ Main Update Loopโ€‹

def update(self, epoch=0) -> dict

๐Ÿ“ฅ Input: epoch (int): Current training epoch ๐Ÿ“ค Output: Dictionary containing loss statistics

๐Ÿ”ง Function: Main training update including:

  • PPO policy and value updates
  • DAgger network fine-tuning with LoRA
  • Auxiliary module updates (RND, AMP, symmetry)
  • Learning rate scheduling
  • Multi-GPU synchronization

๐Ÿค Multi-GPU Supportโ€‹

๐Ÿ“ก Parameter Broadcastingโ€‹

def broadcast_parameters(self)
def broadcast_parameters_dagger(self)

๐Ÿ”ง Function: Broadcast model parameters from rank 0 to all GPUs.

๐Ÿ”„ Gradient Reductionโ€‹

def reduce_parameters(self)
def reduce_parameters_dagger(self)

๐Ÿ”ง Function: Collect and average gradients across all GPUs.

๐ŸŽฏ Loss Computationโ€‹

๐Ÿ” Imitation Lossโ€‹

def _imitation_loss(self, actions_batch, obs_batch, ref_obs_batch, dagger_actions_batch)

๐Ÿ“ฅ Input:

  • actions_batch (torch.Tensor): Actor actions
  • obs_batch (torch.Tensor): Observation batch
  • ref_obs_batch (tuple): Reference observation batch
  • dagger_actions_batch (torch.Tensor): DAgger network actions

๐Ÿ“ค Output: Imitation loss value

๐Ÿ”ง Function: Computes KL divergence loss between actor policy and DAgger targets using different modes:

  • "kl": Standard KL divergence
  • "original_kl": Original implementation KL
  • "norm": L2 norm-based loss

๐ŸŽ“ DAgger Updateโ€‹

def update_dagger(self, actions_batch, obs_batch, ref_obs_batch, dagger_actions_batch)

๐Ÿ“ฅ Input: Same as imitation loss ๐Ÿ“ค Output: DAgger loss value

๐Ÿ”ง Function: Updates DAgger network using LoRA fine-tuning with smoothed L2 loss between DAgger predictions and teacher-student blended actions.


๐ŸŽฎ Training Algorithmsโ€‹

๐ŸŽฏ Stage 1: DAgger Policy Trainingโ€‹

# Pseudocode for Stage 1
def stage1_training():
# Initialize with simplified environment (fixed base)
configure_simplified_env(fixed_base=True, prioritize_imitation=True)

for iteration in range(N_iter):
# Collect rollouts
batch = collect_rollouts(policy=pi_theta)

# State substitution: replace agent base with expert base
for transition in batch:
transition.state = (expert_base_state, agent_joint_state)

# Compute rewards (heavily weighted towards imitation)
rewards = w_R * task_reward + w_imitation * imitation_reward

# PPO update
for epoch in range(K):
update_policy_and_value(batch, mini_batches=M)

return trained_dagger_policy

๐ŸŽฏ Stage 2: Policy Training with DAggerโ€‹

# Pseudocode for Stage 2
def stage2_training():
# Initialize actor from Stage 1, create LoRA-enhanced DAgger network
actor = initialize_from_stage1()
dagger_net = create_lora_enhanced_network(stage1_policy)

for iteration in range(N_iter):
# Collect rollouts in full environment
batch = collect_rollouts(actor, full_environment=True)

for epoch in range(K):
for mini_batch in shuffle(batch):
# Standard PPO losses
surrogate_loss = compute_ppo_loss(actor, mini_batch)
value_loss = compute_value_loss(critic, mini_batch)

# Generate DAgger actions
dagger_actions = dagger_net(observations, ref_observations)

# Imitation loss (KL divergence)
imitation_loss = compute_kl_loss(actor.distribution, dagger_actions)

# Combined actor loss
actor_loss = surrogate_loss + w_im * imitation_loss + value_loss

# Update actor and critic
update_networks(actor_loss)

# DAgger network LoRA fine-tuning
dagger_loss = (1-rho) * norm_2(dagger_actions - actor_actions)ยฒ
update_lora_parameters(dagger_loss)

# Anneal DAgger coefficient
rho = anneal_schedule(rho, iteration)

return trained_policy

๐Ÿš€ Usage Exampleโ€‹

import torch
from rsl_rl.algorithms.mmppo import MMPPO
from rsl_rl.modules.actor_critic_mm_transformer import ActorCriticMMTransformerV2

# Define observation structure
term_dict = {
"base_lin_vel": 3, "base_ang_vel": 3,
"joint_pos": 19, "joint_vel": 19,
"feet_contact": 4
}

ref_term_dict = {
"ref_base_lin_vel": 3, "ref_joint_pos": 19,
"ref_feet_contact": 4
}

# Create actor-critic model
actor_critic = ActorCriticMMTransformerV2(
term_dict=term_dict,
ref_term_dict=ref_term_dict,
num_actions=19,
history_length=5,
dim_model=256,
enable_lora=True # Enable LoRA for DAgger
)

# Configure auxiliary modules
rnd_cfg = {
"num_states": 48,
"learning_rate": 1e-3,
"reward_scale": 0.1
}

amp_cfg = {
"net_cfg": {
"backbone_input_dim": 96, # 2 * state_dim
"backbone_output_dim": 256,
"activation": "elu"
},
"learning_rate": 1e-3,
"gradient_penalty_coeff": 10.0,
"amp_update_interval": 1
}

symmetry_cfg = {
"use_data_augmentation": True,
"use_mirror_loss": True,
"data_augmentation_func": "my_module.mirror_observations"
}

# Create MMPPO algorithm
mmppo = MMPPO(
actor_critic=actor_critic,
# PPO hyperparameters
num_learning_epochs=5,
num_mini_batches=4,
clip_param=0.2,
gamma=0.99,
lam=0.95,
learning_rate=3e-4,

# DAgger configuration
teacher_coef=0.8, # DAgger coefficient
teacher_loss_coef=0.5, # Imitation loss weight
teacher_lr=1e-4, # DAgger network learning rate
teacher_coef_mode="kl", # Use KL divergence mode
teacher_apply_interval=5, # Apply imitation loss every 5 iterations

# Auxiliary modules
rnd_cfg=rnd_cfg,
amp_cfg=amp_cfg,
symmetry_cfg=symmetry_cfg,

# Training optimization
auto_mix_precision=True,
device="cuda"
)

# Initialize storage
mmppo.init_storage(
training_type="motion_mimic",
num_envs=4096,
num_transitions_per_env=24,
actor_obs_shape=(sum(term_dict.values()) * 5,), # With history
actor_ref_obs_shape=(sum(ref_term_dict.values()),),
critic_obs_shape=(sum(term_dict.values()) * 5,),
critic_ref_obs_shape=(sum(ref_term_dict.values()),),
action_shape=(19,)
)

# Training loop
for epoch in range(10000):
# Collect rollouts
for step in range(24): # num_transitions_per_env
actions = mmppo.act(obs, ref_obs, critic_obs, ref_critic_obs)
obs, rewards, dones, infos = env.step(actions)
mmppo.process_env_step(rewards, dones, infos)

# Compute returns and update
mmppo.compute_returns(final_critic_obs)
loss_dict = mmppo.update(epoch)

print(f"Epoch {epoch}:")
print(f" Surrogate Loss: {loss_dict['mean_surrogate_loss']:.4f}")
print(f" Value Loss: {loss_dict['mean_value_loss']:.4f}")
print(f" Imitation Loss: {loss_dict['mean_imitation_loss']:.4f}")
print(f" DAgger Loss: {loss_dict['mean_dagger_loss']:.4f}")

if 'mean_amp_loss' in loss_dict:
print(f" AMP Loss: {loss_dict['mean_amp_loss']:.4f}")

๐ŸŽฏ Key Featuresโ€‹

๐Ÿ”ง Auxiliary Learning Modulesโ€‹

Random Network Distillation (RND):

  • Intrinsic motivation through prediction error
  • Encourages exploration in sparse reward environments
  • Configurable reward scaling and update frequency

Adversarial Motion Prior (AMP):

  • Style-based motion learning through discriminator
  • Automatic motion quality assessment
  • Gradient penalty for training stability

Symmetry Learning:

  • Data augmentation with mirrored observations
  • Mirror loss for consistent symmetric behaviors
  • Automatic handling of symmetric robot configurations

๐Ÿš€ Performance Optimizationsโ€‹

Mixed Precision Training:

  • Automatic mixed precision for faster training
  • Gradient scaling for numerical stability
  • Memory efficiency improvements

Multi-GPU Support:

  • Distributed training across multiple GPUs
  • Gradient synchronization and parameter broadcasting
  • Scalable to large-scale robot learning

Memory Management:

  • GPU memory monitoring and optimization
  • Efficient batch processing and storage
  • Automatic garbage collection integration

๐Ÿ”ง Implementation Notesโ€‹

โš ๏ธ Important Considerationsโ€‹

Stage Transition:

  • Stage 1 focuses on imitation with simplified physics
  • Stage 2 introduces full environment complexity
  • Proper initialization crucial for stable training

Hyperparameter Tuning:

  • teacher_coef: Start with 0.5-0.8, anneal gradually
  • teacher_loss_coef: 0.1-0.5 depending on task complexity
  • teacher_apply_interval: 5-10 for stable training

DAgger Network:

  • LoRA rank typically 8-16 for efficiency
  • Alpha parameter 16-32 for adaptation strength
  • Regular LoRA updates prevent overfitting

๐Ÿšจ Common Issuesโ€‹

Training Instability:

  • Monitor DAgger coefficient annealing schedule
  • Ensure proper gradient clipping (max_grad_norm=1.0)
  • Check for proper learning rate scheduling

Memory Issues:

  • Use auto_mix_precision=True for large models
  • Adjust num_mini_batches based on GPU memory
  • Monitor GPU memory usage with built-in debugging

Convergence Problems:

  • Verify expert data quality and coverage
  • Check state normalization consistency
  • Ensure proper reward balancing between task and imitation

๐Ÿ“– Referencesโ€‹

Original Research:

MMPPO: Multi-Modal Proximal Policy Optimization for whole-body humanoid robot learning with expert demonstrations and multi-modal observations.

Key Algorithms:

  • PPO: Proximal Policy Optimization for stable on-policy learning
  • DAgger: Dataset Aggregation for expert-guided policy learning
  • LoRA: Low-Rank Adaptation for efficient network fine-tuning
  • GAE: Generalized Advantage Estimation for variance reduction

Related Methods:

  • AMP: Adversarial Motion Priors for stylized motion learning
  • RND: Random Network Distillation for exploration
  • Symmetry Learning: Data augmentation for symmetric behaviors

๐Ÿ’ก Best Practice: Start with Stage 1 training in simplified environment, then gradually transition to Stage 2 with full physics. Monitor both imitation and task performance throughout training.