rl_ref_cfg
Module: GBC.gyms.isaaclab_45.lab_tasks.utils.wrappers.rsl_rl.rl_ref_cfg
🎯 Overview
Configuration classes for MMPPO (Multi-Modal PPO) and related algorithms used in reference-based robot training. These dataclasses define the initialization parameters for advanced imitation learning algorithms with features like AMP, DAgger, and symmetry learning.
📋 Configuration Classes
🧠 RslRlRefPpoActorCriticCfg
Purpose: Neural network architecture configuration for actor-critic models.
Core Architecture
- class_name: str- Network class (e.g., "ActorCriticMMTransformerV2")
- max_len: int- Sequence length for transformer models
- dim_model: int- Model dimension
- num_layers: int- Number of transformer layers
- num_heads: int- Number of attention heads
- init_noise_std: float- Initial action noise standard deviation
Advanced Features
- load_dagger: bool = False- Enable DAgger checkpoint loading
- load_dagger_path: str | None- Path to DAgger pre-trained model
- load_actor_path: str | None- Path to actor checkpoint
- apply_mlp_residual: bool = False- Use residual connections
- history_length: int = 1- Observation history length
Observation Processing
- concatenate_term_names: dict[str, list[list[str]]] | None- Policy/critic observation grouping
- concatenate_ref_term_names: dict[str, list[list[str]]] | None- Reference observation grouping
🎭 RslRlRefPpoAmpNetCfg
Purpose: AMP (Adversarial Motion Prior) discriminator network configuration.
Network Structure
- backbone_input_dim: int- Input dimension for AMP observations
- backbone_output_dim: int- Network output dimension
- backbone: str = "mlp"- Network backbone type
- activation: str = "elu"- Hidden layer activation function
- out_activation: str = "tanh"- Output activation function
- net_kwargs: dict | None- Additional network parameters
🏆 RslRlPpoAmpCfg
Purpose: Complete AMP system configuration.
Core Settings
- net_cfg: RslRlRefPpoAmpNetCfg- Network configuration
- learning_rate: float = 1e-3- Discriminator learning rate
- amp_obs_extractor: callable- Policy observation extractor function
- amp_ref_obs_extractor: callable- Reference observation extractor function
Training Parameters
- epsilon: float = 1e-4- Numerical stability parameter
- amp_reward_scale: float = 1.0- AMP reward scaling factor
- gradient_penalty_coeff: float = 10.0- Gradient penalty coefficient
- amp_update_interval: int = 10- Discriminator update frequency
- amp_pretrain_steps: int = 50- Pre-training steps
🚀 RslRlRefPpoAlgorithmCfg
Purpose: MMPPO algorithm configuration with advanced features.
Standard PPO Parameters
- class_name: str- Algorithm class (e.g., "MMPPO")
- value_loss_coef: float- Value function loss coefficient
- clip_param: float- PPO clipping parameter
- use_clipped_value_loss: bool- Enable clipped value loss
- desired_kl: float- Target KL divergence
- entropy_coef: float- Entropy bonus coefficient
- gamma: float- Discount factor
- lam: float- GAE lambda parameter
- max_grad_norm: float- Gradient clipping threshold
Learning Schedule
- learning_rate: float- Base learning rate
- normalize_advantage_per_mini_batch: bool = False- Advantage normalization
- num_learning_epochs: int- Epochs per iteration
- num_mini_batches: int- Mini-batches per epoch
- schedule: str- Learning rate schedule type
Advanced Learning Rate Control
- max_lr: float = 1e-2- Maximum learning rate
- min_lr: float = 1e-4- Minimum learning rate
- max_lr_after_certain_epoch: float = 1e-3- Max LR after restriction epoch
- max_lr_restriction_epoch: int = 5000- Epoch to apply max LR restriction
- min_lr_after_certain_epoch: float = 1e-5- Min LR after restriction epoch
- min_lr_restriction_epoch: int = 5000- Epoch to apply min LR restriction
DAgger (Teacher-Student) Configuration
- teacher_coef: float | None- Teacher coefficient for DAgger
- teacher_coef_range: tuple[float, float] | None- Teacher coefficient range
- teacher_coef_decay: float | None- Teacher coefficient decay rate
- teacher_coef_decay_interval: int = 100- Decay interval
- teacher_coef_mode: str = "kl"- Teacher coefficient mode ("kl" or "norm")
- teacher_loss_coef: float | None- Teacher loss coefficient
- teacher_loss_coef_range: tuple[float, float] | None- Teacher loss coefficient range
- teacher_loss_coef_decay: float | None- Teacher loss decay rate
- teacher_loss_coef_decay_interval: int = 100- Teacher loss decay interval
- teacher_update_interval: int = 1- Teacher update frequency
- teacher_supervising_intervals: int = 0- Supervision duration
- teacher_updating_intervals: int = 0- Teacher update duration
- teacher_lr: float = 5e-4- Teacher learning rate
- teacher_only_interval: int = 0- Teacher-only training steps
Extension Configurations
- rnd_cfg: dict | None- Random Network Distillation configuration
- symmetry_cfg: dict | None- Symmetry learning configuration
- amp_cfg: RslRlPpoAmpCfg | None- AMP configuration
🏃♂️ RslRlRefOnPolicyRunnerCfg
Purpose: Complete training runner configuration.
Training Setup
- seed: int = 42- Random seed
- device: str = "cuda:0"- Training device
- num_steps_per_env: int- Steps per environment per iteration
- max_iterations: int- Total training iterations
- empirical_normalization: bool- Use empirical observation normalization
Component Configurations
- policy: RslRlRefPpoActorCriticCfg- Actor-critic configuration
- algorithm: RslRlRefPpoAlgorithmCfg- Algorithm configuration
Experiment Management
- save_interval: int- Model save frequency
- experiment_name: str- Experiment identifier
- run_name: str = ""- Specific run name
- logger: Literal["tensorboard", "neptune", "wandb"] = "tensorboard"- Logging backend
Checkpoint Management
- resume: bool = False- Resume from checkpoint
- load_run: str = ".*"- Run pattern to load from
- load_checkpoint: str = "model_.*.pt"- Checkpoint file pattern
- abs_checkpoint_path: str = ""- Absolute checkpoint path
🔗 Usage Context
These configuration classes are used in the Basic Guide tutorial for setting up sophisticated robot training experiments. They provide comprehensive control over all aspects of the MMPPO algorithm and its advanced features.