On Policy Runner for Motion Mimic PPO (OnPolicyRunnerMM)
Module: rsl_rl.runners.on_policy_runner_mm
🎯 Overview
OnPolicyRunnerMM is the core training orchestrator for Multi-Modal Proximal Policy Optimization (MMPPO) and distillation algorithms. It serves as the bridge between the environment, policy networks, and training algorithms, managing the complete lifecycle of training including data collection, learning updates, logging, and model persistence.
🏗️ Core Functionality
🎭 Multi-Modal Training Support
Multi-Modal Observations:
- Actor Observations: Primary observations for policy decision making
- Reference Observations: Expert demonstration data with masking support
- Critic Observations: Privileged information for value function learning
- Reference Critic Observations: Expert data for critic training
Training Types:
- RL Training: Reinforcement learning with MMPPO algorithm
- Distillation Training: Teacher-student knowledge distillation
🔧 Key Features
- Multi-GPU Support: Distributed training across multiple GPUs with parameter synchronization
- Empirical Normalization: Adaptive observation normalization for stable training
- Auxiliary Learning: Integration with RND, AMP, and symmetry learning modules
- Flexible Logging: Support for TensorBoard, Wandb, and Neptune logging
- Robust Checkpointing: Complete state saving and loading for training resumption
🧠 Core Components
🤖 OnPolicyRunnerMM Class
Module Name: rsl_rl.runners.on_policy_runner_mm.OnPolicyRunnerMM
Definition:
class OnPolicyRunnerMM:
def __init__(self, env: MMVecEnv, train_cfg, log_dir=None, device="cpu")
📥 Parameters:
env
(MMVecEnv): Multi-modal vectorized environmenttrain_cfg
(dict): Training configuration dictionarylog_dir
(str | None): Directory for logging and checkpoints. Default isNone
device
(str): Compute device ("cpu" or "cuda"). Default is"cpu"
🔧 Functionality: Complete training orchestrator that handles environment interaction, model training, logging, and checkpointing for multi-modal reinforcement learning.
🎯 Core Methods
🔄 Training Lifecycle
🚀 Main Training Loop
def learn(self, num_learning_iterations: int, init_at_random_ep_len: bool = False)
📥 Input:
num_learning_iterations
(int): Number of training iterations to runinit_at_random_ep_len
(bool): Initialize episodes at random lengths. Default isFalse
🔧 Function: Main training loop that orchestrates:
- Environment rollout collection with multi-modal observations
- Learning updates using MMPPO or distillation algorithms
- Performance logging and metrics tracking
- Model checkpointing and state management
- Multi-GPU synchronization
💡 Key Features:
- Multi-Modal Data Collection: Collects actor/critic observations and reference data
- Auxiliary Learning Integration: Supports RND, AMP, and symmetry learning
- Performance Monitoring: Tracks rewards, episode lengths, and learning metrics
- Distributed Training: Handles multi-GPU parameter synchronization
💾 Model Management
📁 Save Model State
def save(self, path, infos=None)
📥 Input:
path
(str): File path for saving the modelinfos
(dict | None): Additional information to save. Default isNone
🔧 Function: Saves complete training state including:
- Model parameters (actor-critic, DAgger networks)
- Optimizer states (PPO, imitation, DAgger optimizers)
- Observation normalizers
- Auxiliary modules (RND, AMP)
- Training iteration counter
- Mixed precision scaler state
📂 Load Model State
def load(self, path, load_optimizer=True)
📥 Input:
path
(str): File path to load the model fromload_optimizer
(bool): Whether to load optimizer states. Default isTrue
📤 Output: Additional information dictionary from saved state
🔧 Function: Restores complete training state for resuming training or inference.
🎯 Inference Support
🔍 Get Inference Policy
def get_inference_policy(self, device=None)
📥 Input: device
(str | None): Target device for inference. Default is None
📤 Output: Inference-ready policy function
🔧 Function: Returns a policy function optimized for inference with:
- Evaluation mode (disabled dropout/batch norm)
- Proper observation normalization
- Device-specific optimization
🎭 Mode Control
🏋️ Training Mode
def train_mode(self)
🔧 Function: Sets all components to training mode (enables dropout, batch norm training, etc.)
🎯 Evaluation Mode
def eval_mode(self)
🔧 Function: Sets all components to evaluation mode (disables dropout, batch norm evaluation, etc.)
📊 Logging and Monitoring
📈 Performance Logging
def log(self, locs: dict, width: int = 80, pad: int = 35)
📥 Input:
locs
(dict): Local variables from training iterationwidth
(int): Console output width. Default is80
pad
(int): Padding for formatted output. Default is35
🔧 Function: Comprehensive logging including:
- Loss Metrics: Value, surrogate, imitation, DAgger losses
- Performance Metrics: FPS, collection/learning times
- Reward Tracking: Episode rewards, lengths, success rates
- Auxiliary Metrics: RND rewards, AMP scores, symmetry losses
- Policy Metrics: Action noise, learning rates
🔧 Utility Methods
📚 Add Git Repository
def add_git_repo_to_log(self, repo_file_path)
📥 Input: repo_file_path
(str): Path to git repository for version tracking
🔧 Function: Adds repository to version tracking for reproducibility.
🎮 Training Configuration
📋 Configuration Structure
Training Configuration (train_cfg
):
train_cfg = {
"algorithm": {
"class_name": "MMPPO", # or "Distillation"
# MMPPO-specific parameters
"num_learning_epochs": 5,
"num_mini_batches": 4,
"clip_param": 0.2,
"gamma": 0.99,
"lam": 0.95,
"learning_rate": 3e-4,
# DAgger configuration
"teacher_coef": 0.8,
"teacher_loss_coef": 0.5,
"teacher_apply_interval": 5,
# Auxiliary modules
"rnd_cfg": {...},
"amp_cfg": {...},
"symmetry_cfg": {...},
# Multi-GPU configuration
"multi_gpu_cfg": {...}
},
"policy": {
"class_name": "ActorCriticMMTransformerV2",
# Network architecture parameters
"dim_model": 256,
"num_layers": 6,
"num_heads": 8,
"history_length": 5,
"enable_lora": True
},
# Training parameters
"num_steps_per_env": 24,
"save_interval": 500,
"empirical_normalization": True,
# Logging configuration
"logger": "tensorboard" # or "wandb", "neptune"
}
🔧 Multi-Modal Environment Integration
Actor-Critic Dimension Mapping:
# Standard MM Transformer
actor_critic_dim = {
"num_actor_obs": num_obs,
"num_actor_ref_obs": num_ref_obs,
"num_critic_obs": num_critic_obs,
"num_critic_ref_obs": num_critic_ref_obs,
}
# MM Transformer V2 (with term dictionaries)
actor_critic_dim = {
"term_dict": {
"base_lin_vel": 3,
"joint_pos": 19,
# ... other observation terms
},
"ref_term_dict": {
"ref_base_lin_vel": 3,
"ref_joint_pos": 19,
# ... other reference terms
}
}
🚀 Usage Example
import torch
from rsl_rl.runners.on_policy_runner_mm import OnPolicyRunnerMM
from rsl_rl.env import MMVecEnv
# Environment setup
env = MMVecEnv(
num_envs=4096,
observation_manager=observation_manager,
ref_observation_manager=ref_observation_manager,
# ... other environment configuration
)
# Training configuration
train_cfg = {
"algorithm": {
"class_name": "MMPPO",
"num_learning_epochs": 5,
"num_mini_batches": 4,
"clip_param": 0.2,
"gamma": 0.99,
"lam": 0.95,
"learning_rate": 3e-4,
# DAgger configuration
"teacher_coef": 0.8,
"teacher_loss_coef": 0.5,
"teacher_lr": 1e-4,
"teacher_apply_interval": 5,
# Auxiliary modules
"rnd_cfg": {
"learning_rate": 1e-3,
"weight": 0.1
},
"amp_cfg": {
"net_cfg": {
"backbone_input_dim": 96,
"backbone_output_dim": 256
},
"learning_rate": 1e-3,
"gradient_penalty_coeff": 10.0
},
"symmetry_cfg": {
"use_data_augmentation": True,
"use_mirror_loss": True
}
},
"policy": {
"class_name": "ActorCriticMMTransformerV2",
"dim_model": 256,
"num_layers": 6,
"num_heads": 8,
"history_length": 5,
"enable_lora": True
},
"num_steps_per_env": 24,
"save_interval": 500,
"empirical_normalization": True,
"logger": "tensorboard"
}
# Create runner
runner = OnPolicyRunnerMM(
env=env,
train_cfg=train_cfg,
log_dir="./logs/mmppo_experiment",
device="cuda"
)
# Add version tracking
runner.add_git_repo_to_log("/path/to/your/code")
# Training
runner.learn(num_learning_iterations=10000)
# Save final model
runner.save("./logs/final_model.pt")
# Get inference policy
policy = runner.get_inference_policy(device="cuda")
# Use policy for inference
obs = torch.randn(1, num_obs)
ref_obs = (torch.randn(1, num_ref_obs), torch.ones(1, dtype=torch.bool))
actions = policy(obs, ref_obs)
🎯 Multi-Modal Features
🔍 Observation Management
Multi-Modal Observation Types:
- Actor Observations: Primary sensory input for policy decisions
- Reference Observations: Expert demonstration data with availability masks
- Critic Observations: Privileged information (e.g., true state, ground truth)
- Reference Critic Observations: Expert privileged information
Observation Normalization:
- Empirical Normalization: Adaptive running statistics normalization
- Separate Normalizers: Independent normalization for each observation type
- Training/Evaluation Modes: Proper mode switching for inference
🎭 Training Type Support
RL Training (MMPPO):
- On-policy reinforcement learning with expert guidance
- DAgger integration for expert action supervision
- Multi-modal reward composition (task + imitation + auxiliary)
Distillation Training:
- Teacher-student knowledge transfer
- Compact student networks learning from large teacher models
- Privileged information utilization during training
🔧 Auxiliary Learning Integration
Random Network Distillation (RND):
- Intrinsic motivation through prediction error
- Automatic reward scaling based on environment timestep
- Integration with extrinsic reward tracking
Adversarial Motion Prior (AMP):
- Style-aware motion learning through discriminator
- Gradient penalty for training stability
- Motion quality assessment and reward generation
Symmetry Learning:
- Data augmentation through observation mirroring
- Symmetry loss for consistent symmetric behaviors
- Environment-specific symmetry function integration
🔧 Performance Optimizations
🚀 Multi-GPU Training
Distributed Training Features:
- Parameter Synchronization: Broadcast parameters from rank 0
- Gradient Aggregation: Collect and average gradients across GPUs
- Logging Coordination: Single-GPU logging to avoid conflicts
- NCCL Backend: Optimized GPU-to-GPU communication
Configuration:
# Environment variables for multi-GPU
# WORLD_SIZE: Total number of GPUs
# RANK: Global rank of current process
# LOCAL_RANK: Local rank within the node
# Automatic configuration in runner
multi_gpu_cfg = {
"global_rank": int(os.getenv("RANK", "0")),
"local_rank": int(os.getenv("LOCAL_RANK", "0")),
"world_size": int(os.getenv("WORLD_SIZE", "1"))
}
💾 Memory Management
Efficient Storage:
- Multi-modal rollout storage with proper memory layout
- Separate storage for different observation types
- Memory-efficient batch processing
Mixed Precision Support:
- Automatic mixed precision training integration
- Gradient scaling for numerical stability
- Memory usage optimization
📊 Logging and Monitoring
📈 Comprehensive Metrics
Training Metrics:
- Loss Components: Value, surrogate, imitation, DAgger losses
- Performance: FPS, collection/learning times, iteration progress
- Policy Behavior: Action noise std, learning rates, entropy
Environment Metrics:
- Episode Statistics: Mean rewards, episode lengths, success rates
- Custom Metrics: Environment-specific performance indicators
- Reward Decomposition: Task, imitation, and auxiliary reward tracking
Auxiliary Module Metrics:
- RND: Extrinsic/intrinsic rewards, exploration weights
- AMP: Discriminator accuracy, gradient penalties, motion scores
- Symmetry: Symmetry loss values, augmentation effects
🔧 Logger Support
TensorBoard:
- Standard scalar/histogram logging
- Real-time training monitoring
- Model graph visualization
Weights & Biases:
- Cloud-based experiment tracking
- Advanced visualization and comparison
- Model artifact management
Neptune:
- Experiment management and collaboration
- Advanced hyperparameter tracking
- Integration with model versioning
🔧 Implementation Notes
⚠️ Important Considerations
Environment Compatibility:
- Requires MMVecEnv with multi-modal observation support
- Proper observation manager configuration essential
- Reference observation availability masking
Training Stability:
- Empirical normalization recommended for stable training
- Proper learning rate scheduling for long training runs
- Regular checkpointing for training robustness
Multi-GPU Setup:
- Proper environment variable configuration required
- Device consistency across all components
- Memory synchronization considerations
🚨 Common Issues
Memory Problems:
- Large batch sizes with multi-modal observations
- Multiple auxiliary modules increasing memory usage
- Solution: Reduce batch size, enable mixed precision
Convergence Issues:
- Improper observation normalization
- Conflicting auxiliary learning objectives
- Solution: Monitor individual loss components, adjust weights
Multi-GPU Synchronization:
- Parameter broadcasting failures
- Gradient synchronization errors
- Solution: Check NCCL setup, verify environment variables
📖 References
Core Algorithms:
- MMPPO: Multi-Modal Proximal Policy Optimization for humanoid robot learning
- DAgger: Dataset Aggregation for expert-guided policy learning
- Distillation: Teacher-student knowledge transfer for model compression
Auxiliary Methods:
- RND: Random Network Distillation for exploration
- AMP: Adversarial Motion Priors for motion style learning
- Symmetry Learning: Data augmentation for symmetric behaviors
Infrastructure:
- Multi-GPU Training: Distributed training with NCCL backend
- Mixed Precision: Automatic mixed precision for memory efficiency
- Empirical Normalization: Adaptive observation normalization
💡 Best Practice: Start with single-GPU training to verify configuration, then scale to multi-GPU. Monitor all loss components individually to ensure balanced learning across different objectives.