rl_ref_cfg
Module: GBC.gyms.isaaclab_45.lab_tasks.utils.wrappers.rsl_rl.rl_ref_cfg
🎯 Overview
Configuration classes for MMPPO (Multi-Modal PPO) and related algorithms used in reference-based robot training. These dataclasses define the initialization parameters for advanced imitation learning algorithms with features like AMP, DAgger, and symmetry learning.
📋 Configuration Classes
🧠 RslRlRefPpoActorCriticCfg
Purpose: Neural network architecture configuration for actor-critic models.
Core Architecture
class_name: str
- Network class (e.g., "ActorCriticMMTransformerV2")max_len: int
- Sequence length for transformer modelsdim_model: int
- Model dimensionnum_layers: int
- Number of transformer layersnum_heads: int
- Number of attention headsinit_noise_std: float
- Initial action noise standard deviation
Advanced Features
load_dagger: bool = False
- Enable DAgger checkpoint loadingload_dagger_path: str | None
- Path to DAgger pre-trained modelload_actor_path: str | None
- Path to actor checkpointapply_mlp_residual: bool = False
- Use residual connectionshistory_length: int = 1
- Observation history length
Observation Processing
concatenate_term_names: dict[str, list[list[str]]] | None
- Policy/critic observation groupingconcatenate_ref_term_names: dict[str, list[list[str]]] | None
- Reference observation grouping
🎭 RslRlRefPpoAmpNetCfg
Purpose: AMP (Adversarial Motion Prior) discriminator network configuration.
Network Structure
backbone_input_dim: int
- Input dimension for AMP observationsbackbone_output_dim: int
- Network output dimensionbackbone: str = "mlp"
- Network backbone typeactivation: str = "elu"
- Hidden layer activation functionout_activation: str = "tanh"
- Output activation functionnet_kwargs: dict | None
- Additional network parameters
🏆 RslRlPpoAmpCfg
Purpose: Complete AMP system configuration.
Core Settings
net_cfg: RslRlRefPpoAmpNetCfg
- Network configurationlearning_rate: float = 1e-3
- Discriminator learning rateamp_obs_extractor: callable
- Policy observation extractor functionamp_ref_obs_extractor: callable
- Reference observation extractor function
Training Parameters
epsilon: float = 1e-4
- Numerical stability parameteramp_reward_scale: float = 1.0
- AMP reward scaling factorgradient_penalty_coeff: float = 10.0
- Gradient penalty coefficientamp_update_interval: int = 10
- Discriminator update frequencyamp_pretrain_steps: int = 50
- Pre-training steps
🚀 RslRlRefPpoAlgorithmCfg
Purpose: MMPPO algorithm configuration with advanced features.
Standard PPO Parameters
class_name: str
- Algorithm class (e.g., "MMPPO")value_loss_coef: float
- Value function loss coefficientclip_param: float
- PPO clipping parameteruse_clipped_value_loss: bool
- Enable clipped value lossdesired_kl: float
- Target KL divergenceentropy_coef: float
- Entropy bonus coefficientgamma: float
- Discount factorlam: float
- GAE lambda parametermax_grad_norm: float
- Gradient clipping threshold
Learning Schedule
learning_rate: float
- Base learning ratenormalize_advantage_per_mini_batch: bool = False
- Advantage normalizationnum_learning_epochs: int
- Epochs per iterationnum_mini_batches: int
- Mini-batches per epochschedule: str
- Learning rate schedule type
Advanced Learning Rate Control
max_lr: float = 1e-2
- Maximum learning ratemin_lr: float = 1e-4
- Minimum learning ratemax_lr_after_certain_epoch: float = 1e-3
- Max LR after restriction epochmax_lr_restriction_epoch: int = 5000
- Epoch to apply max LR restrictionmin_lr_after_certain_epoch: float = 1e-5
- Min LR after restriction epochmin_lr_restriction_epoch: int = 5000
- Epoch to apply min LR restriction
DAgger (Teacher-Student) Configuration
teacher_coef: float | None
- Teacher coefficient for DAggerteacher_coef_range: tuple[float, float] | None
- Teacher coefficient rangeteacher_coef_decay: float | None
- Teacher coefficient decay rateteacher_coef_decay_interval: int = 100
- Decay intervalteacher_coef_mode: str = "kl"
- Teacher coefficient mode ("kl" or "norm")teacher_loss_coef: float | None
- Teacher loss coefficientteacher_loss_coef_range: tuple[float, float] | None
- Teacher loss coefficient rangeteacher_loss_coef_decay: float | None
- Teacher loss decay rateteacher_loss_coef_decay_interval: int = 100
- Teacher loss decay intervalteacher_update_interval: int = 1
- Teacher update frequencyteacher_supervising_intervals: int = 0
- Supervision durationteacher_updating_intervals: int = 0
- Teacher update durationteacher_lr: float = 5e-4
- Teacher learning rateteacher_only_interval: int = 0
- Teacher-only training steps
Extension Configurations
rnd_cfg: dict | None
- Random Network Distillation configurationsymmetry_cfg: dict | None
- Symmetry learning configurationamp_cfg: RslRlPpoAmpCfg | None
- AMP configuration
🏃♂️ RslRlRefOnPolicyRunnerCfg
Purpose: Complete training runner configuration.
Training Setup
seed: int = 42
- Random seeddevice: str = "cuda:0"
- Training devicenum_steps_per_env: int
- Steps per environment per iterationmax_iterations: int
- Total training iterationsempirical_normalization: bool
- Use empirical observation normalization
Component Configurations
policy: RslRlRefPpoActorCriticCfg
- Actor-critic configurationalgorithm: RslRlRefPpoAlgorithmCfg
- Algorithm configuration
Experiment Management
save_interval: int
- Model save frequencyexperiment_name: str
- Experiment identifierrun_name: str = ""
- Specific run namelogger: Literal["tensorboard", "neptune", "wandb"] = "tensorboard"
- Logging backend
Checkpoint Management
resume: bool = False
- Resume from checkpointload_run: str = ".*"
- Run pattern to load fromload_checkpoint: str = "model_.*.pt"
- Checkpoint file patternabs_checkpoint_path: str = ""
- Absolute checkpoint path
🔗 Usage Context
These configuration classes are used in the Basic Guide tutorial for setting up sophisticated robot training experiments. They provide comprehensive control over all aspects of the MMPPO algorithm and its advanced features.