Skip to main content

On Policy Runner for Motion Mimic PPO (OnPolicyRunnerMM)

Module: rsl_rl.runners.on_policy_runner_mm

🎯 Overview

OnPolicyRunnerMM is the core training orchestrator for Multi-Modal Proximal Policy Optimization (MMPPO) and distillation algorithms. It serves as the bridge between the environment, policy networks, and training algorithms, managing the complete lifecycle of training including data collection, learning updates, logging, and model persistence.

🏗️ Core Functionality

🎭 Multi-Modal Training Support

Multi-Modal Observations:

  • Actor Observations: Primary observations for policy decision making
  • Reference Observations: Expert demonstration data with masking support
  • Critic Observations: Privileged information for value function learning
  • Reference Critic Observations: Expert data for critic training

Training Types:

  • RL Training: Reinforcement learning with MMPPO algorithm
  • Distillation Training: Teacher-student knowledge distillation

🔧 Key Features

  • Multi-GPU Support: Distributed training across multiple GPUs with parameter synchronization
  • Empirical Normalization: Adaptive observation normalization for stable training
  • Auxiliary Learning: Integration with RND, AMP, and symmetry learning modules
  • Flexible Logging: Support for TensorBoard, Wandb, and Neptune logging
  • Robust Checkpointing: Complete state saving and loading for training resumption

🧠 Core Components

🤖 OnPolicyRunnerMM Class

Module Name: rsl_rl.runners.on_policy_runner_mm.OnPolicyRunnerMM

Definition:

class OnPolicyRunnerMM:
def __init__(self, env: MMVecEnv, train_cfg, log_dir=None, device="cpu")

📥 Parameters:

  • env (MMVecEnv): Multi-modal vectorized environment
  • train_cfg (dict): Training configuration dictionary
  • log_dir (str | None): Directory for logging and checkpoints. Default is None
  • device (str): Compute device ("cpu" or "cuda"). Default is "cpu"

🔧 Functionality: Complete training orchestrator that handles environment interaction, model training, logging, and checkpointing for multi-modal reinforcement learning.


🎯 Core Methods

🔄 Training Lifecycle

🚀 Main Training Loop

def learn(self, num_learning_iterations: int, init_at_random_ep_len: bool = False)

📥 Input:

  • num_learning_iterations (int): Number of training iterations to run
  • init_at_random_ep_len (bool): Initialize episodes at random lengths. Default is False

🔧 Function: Main training loop that orchestrates:

  • Environment rollout collection with multi-modal observations
  • Learning updates using MMPPO or distillation algorithms
  • Performance logging and metrics tracking
  • Model checkpointing and state management
  • Multi-GPU synchronization

💡 Key Features:

  • Multi-Modal Data Collection: Collects actor/critic observations and reference data
  • Auxiliary Learning Integration: Supports RND, AMP, and symmetry learning
  • Performance Monitoring: Tracks rewards, episode lengths, and learning metrics
  • Distributed Training: Handles multi-GPU parameter synchronization

💾 Model Management

📁 Save Model State

def save(self, path, infos=None)

📥 Input:

  • path (str): File path for saving the model
  • infos (dict | None): Additional information to save. Default is None

🔧 Function: Saves complete training state including:

  • Model parameters (actor-critic, DAgger networks)
  • Optimizer states (PPO, imitation, DAgger optimizers)
  • Observation normalizers
  • Auxiliary modules (RND, AMP)
  • Training iteration counter
  • Mixed precision scaler state

📂 Load Model State

def load(self, path, load_optimizer=True)

📥 Input:

  • path (str): File path to load the model from
  • load_optimizer (bool): Whether to load optimizer states. Default is True

📤 Output: Additional information dictionary from saved state

🔧 Function: Restores complete training state for resuming training or inference.

🎯 Inference Support

🔍 Get Inference Policy

def get_inference_policy(self, device=None)

📥 Input: device (str | None): Target device for inference. Default is None 📤 Output: Inference-ready policy function

🔧 Function: Returns a policy function optimized for inference with:

  • Evaluation mode (disabled dropout/batch norm)
  • Proper observation normalization
  • Device-specific optimization

🎭 Mode Control

🏋️ Training Mode

def train_mode(self)

🔧 Function: Sets all components to training mode (enables dropout, batch norm training, etc.)

🎯 Evaluation Mode

def eval_mode(self)

🔧 Function: Sets all components to evaluation mode (disables dropout, batch norm evaluation, etc.)

📊 Logging and Monitoring

📈 Performance Logging

def log(self, locs: dict, width: int = 80, pad: int = 35)

📥 Input:

  • locs (dict): Local variables from training iteration
  • width (int): Console output width. Default is 80
  • pad (int): Padding for formatted output. Default is 35

🔧 Function: Comprehensive logging including:

  • Loss Metrics: Value, surrogate, imitation, DAgger losses
  • Performance Metrics: FPS, collection/learning times
  • Reward Tracking: Episode rewards, lengths, success rates
  • Auxiliary Metrics: RND rewards, AMP scores, symmetry losses
  • Policy Metrics: Action noise, learning rates

🔧 Utility Methods

📚 Add Git Repository

def add_git_repo_to_log(self, repo_file_path)

📥 Input: repo_file_path (str): Path to git repository for version tracking 🔧 Function: Adds repository to version tracking for reproducibility.


🎮 Training Configuration

📋 Configuration Structure

Training Configuration (train_cfg):

train_cfg = {
"algorithm": {
"class_name": "MMPPO", # or "Distillation"
# MMPPO-specific parameters
"num_learning_epochs": 5,
"num_mini_batches": 4,
"clip_param": 0.2,
"gamma": 0.99,
"lam": 0.95,
"learning_rate": 3e-4,

# DAgger configuration
"teacher_coef": 0.8,
"teacher_loss_coef": 0.5,
"teacher_apply_interval": 5,

# Auxiliary modules
"rnd_cfg": {...},
"amp_cfg": {...},
"symmetry_cfg": {...},

# Multi-GPU configuration
"multi_gpu_cfg": {...}
},
"policy": {
"class_name": "ActorCriticMMTransformerV2",
# Network architecture parameters
"dim_model": 256,
"num_layers": 6,
"num_heads": 8,
"history_length": 5,
"enable_lora": True
},

# Training parameters
"num_steps_per_env": 24,
"save_interval": 500,
"empirical_normalization": True,

# Logging configuration
"logger": "tensorboard" # or "wandb", "neptune"
}

🔧 Multi-Modal Environment Integration

Actor-Critic Dimension Mapping:

# Standard MM Transformer
actor_critic_dim = {
"num_actor_obs": num_obs,
"num_actor_ref_obs": num_ref_obs,
"num_critic_obs": num_critic_obs,
"num_critic_ref_obs": num_critic_ref_obs,
}

# MM Transformer V2 (with term dictionaries)
actor_critic_dim = {
"term_dict": {
"base_lin_vel": 3,
"joint_pos": 19,
# ... other observation terms
},
"ref_term_dict": {
"ref_base_lin_vel": 3,
"ref_joint_pos": 19,
# ... other reference terms
}
}

🚀 Usage Example

import torch
from rsl_rl.runners.on_policy_runner_mm import OnPolicyRunnerMM
from rsl_rl.env import MMVecEnv

# Environment setup
env = MMVecEnv(
num_envs=4096,
observation_manager=observation_manager,
ref_observation_manager=ref_observation_manager,
# ... other environment configuration
)

# Training configuration
train_cfg = {
"algorithm": {
"class_name": "MMPPO",
"num_learning_epochs": 5,
"num_mini_batches": 4,
"clip_param": 0.2,
"gamma": 0.99,
"lam": 0.95,
"learning_rate": 3e-4,

# DAgger configuration
"teacher_coef": 0.8,
"teacher_loss_coef": 0.5,
"teacher_lr": 1e-4,
"teacher_apply_interval": 5,

# Auxiliary modules
"rnd_cfg": {
"learning_rate": 1e-3,
"weight": 0.1
},
"amp_cfg": {
"net_cfg": {
"backbone_input_dim": 96,
"backbone_output_dim": 256
},
"learning_rate": 1e-3,
"gradient_penalty_coeff": 10.0
},
"symmetry_cfg": {
"use_data_augmentation": True,
"use_mirror_loss": True
}
},
"policy": {
"class_name": "ActorCriticMMTransformerV2",
"dim_model": 256,
"num_layers": 6,
"num_heads": 8,
"history_length": 5,
"enable_lora": True
},
"num_steps_per_env": 24,
"save_interval": 500,
"empirical_normalization": True,
"logger": "tensorboard"
}

# Create runner
runner = OnPolicyRunnerMM(
env=env,
train_cfg=train_cfg,
log_dir="./logs/mmppo_experiment",
device="cuda"
)

# Add version tracking
runner.add_git_repo_to_log("/path/to/your/code")

# Training
runner.learn(num_learning_iterations=10000)

# Save final model
runner.save("./logs/final_model.pt")

# Get inference policy
policy = runner.get_inference_policy(device="cuda")

# Use policy for inference
obs = torch.randn(1, num_obs)
ref_obs = (torch.randn(1, num_ref_obs), torch.ones(1, dtype=torch.bool))
actions = policy(obs, ref_obs)

🎯 Multi-Modal Features

🔍 Observation Management

Multi-Modal Observation Types:

  • Actor Observations: Primary sensory input for policy decisions
  • Reference Observations: Expert demonstration data with availability masks
  • Critic Observations: Privileged information (e.g., true state, ground truth)
  • Reference Critic Observations: Expert privileged information

Observation Normalization:

  • Empirical Normalization: Adaptive running statistics normalization
  • Separate Normalizers: Independent normalization for each observation type
  • Training/Evaluation Modes: Proper mode switching for inference

🎭 Training Type Support

RL Training (MMPPO):

  • On-policy reinforcement learning with expert guidance
  • DAgger integration for expert action supervision
  • Multi-modal reward composition (task + imitation + auxiliary)

Distillation Training:

  • Teacher-student knowledge transfer
  • Compact student networks learning from large teacher models
  • Privileged information utilization during training

🔧 Auxiliary Learning Integration

Random Network Distillation (RND):

  • Intrinsic motivation through prediction error
  • Automatic reward scaling based on environment timestep
  • Integration with extrinsic reward tracking

Adversarial Motion Prior (AMP):

  • Style-aware motion learning through discriminator
  • Gradient penalty for training stability
  • Motion quality assessment and reward generation

Symmetry Learning:

  • Data augmentation through observation mirroring
  • Symmetry loss for consistent symmetric behaviors
  • Environment-specific symmetry function integration

🔧 Performance Optimizations

🚀 Multi-GPU Training

Distributed Training Features:

  • Parameter Synchronization: Broadcast parameters from rank 0
  • Gradient Aggregation: Collect and average gradients across GPUs
  • Logging Coordination: Single-GPU logging to avoid conflicts
  • NCCL Backend: Optimized GPU-to-GPU communication

Configuration:

# Environment variables for multi-GPU
# WORLD_SIZE: Total number of GPUs
# RANK: Global rank of current process
# LOCAL_RANK: Local rank within the node

# Automatic configuration in runner
multi_gpu_cfg = {
"global_rank": int(os.getenv("RANK", "0")),
"local_rank": int(os.getenv("LOCAL_RANK", "0")),
"world_size": int(os.getenv("WORLD_SIZE", "1"))
}

💾 Memory Management

Efficient Storage:

  • Multi-modal rollout storage with proper memory layout
  • Separate storage for different observation types
  • Memory-efficient batch processing

Mixed Precision Support:

  • Automatic mixed precision training integration
  • Gradient scaling for numerical stability
  • Memory usage optimization

📊 Logging and Monitoring

📈 Comprehensive Metrics

Training Metrics:

  • Loss Components: Value, surrogate, imitation, DAgger losses
  • Performance: FPS, collection/learning times, iteration progress
  • Policy Behavior: Action noise std, learning rates, entropy

Environment Metrics:

  • Episode Statistics: Mean rewards, episode lengths, success rates
  • Custom Metrics: Environment-specific performance indicators
  • Reward Decomposition: Task, imitation, and auxiliary reward tracking

Auxiliary Module Metrics:

  • RND: Extrinsic/intrinsic rewards, exploration weights
  • AMP: Discriminator accuracy, gradient penalties, motion scores
  • Symmetry: Symmetry loss values, augmentation effects

🔧 Logger Support

TensorBoard:

  • Standard scalar/histogram logging
  • Real-time training monitoring
  • Model graph visualization

Weights & Biases:

  • Cloud-based experiment tracking
  • Advanced visualization and comparison
  • Model artifact management

Neptune:

  • Experiment management and collaboration
  • Advanced hyperparameter tracking
  • Integration with model versioning

🔧 Implementation Notes

⚠️ Important Considerations

Environment Compatibility:

  • Requires MMVecEnv with multi-modal observation support
  • Proper observation manager configuration essential
  • Reference observation availability masking

Training Stability:

  • Empirical normalization recommended for stable training
  • Proper learning rate scheduling for long training runs
  • Regular checkpointing for training robustness

Multi-GPU Setup:

  • Proper environment variable configuration required
  • Device consistency across all components
  • Memory synchronization considerations

🚨 Common Issues

Memory Problems:

  • Large batch sizes with multi-modal observations
  • Multiple auxiliary modules increasing memory usage
  • Solution: Reduce batch size, enable mixed precision

Convergence Issues:

  • Improper observation normalization
  • Conflicting auxiliary learning objectives
  • Solution: Monitor individual loss components, adjust weights

Multi-GPU Synchronization:

  • Parameter broadcasting failures
  • Gradient synchronization errors
  • Solution: Check NCCL setup, verify environment variables

📖 References

Core Algorithms:

  • MMPPO: Multi-Modal Proximal Policy Optimization for humanoid robot learning
  • DAgger: Dataset Aggregation for expert-guided policy learning
  • Distillation: Teacher-student knowledge transfer for model compression

Auxiliary Methods:

  • RND: Random Network Distillation for exploration
  • AMP: Adversarial Motion Priors for motion style learning
  • Symmetry Learning: Data augmentation for symmetric behaviors

Infrastructure:

  • Multi-GPU Training: Distributed training with NCCL backend
  • Mixed Precision: Automatic mixed precision for memory efficiency
  • Empirical Normalization: Adaptive observation normalization

💡 Best Practice: Start with single-GPU training to verify configuration, then scale to multi-GPU. Monitor all loss components individually to ensure balanced learning across different objectives.