跳到主要内容

rl_ref_cfg

Module: GBC.gyms.isaaclab_45.lab_tasks.utils.wrappers.rsl_rl.rl_ref_cfg

🎯 Overview

Configuration classes for MMPPO (Multi-Modal PPO) and related algorithms used in reference-based robot training. These dataclasses define the initialization parameters for advanced imitation learning algorithms with features like AMP, DAgger, and symmetry learning.

📋 Configuration Classes

🧠 RslRlRefPpoActorCriticCfg

Purpose: Neural network architecture configuration for actor-critic models.

Core Architecture

  • class_name: str - Network class (e.g., "ActorCriticMMTransformerV2")
  • max_len: int - Sequence length for transformer models
  • dim_model: int - Model dimension
  • num_layers: int - Number of transformer layers
  • num_heads: int - Number of attention heads
  • init_noise_std: float - Initial action noise standard deviation

Advanced Features

  • load_dagger: bool = False - Enable DAgger checkpoint loading
  • load_dagger_path: str | None - Path to DAgger pre-trained model
  • load_actor_path: str | None - Path to actor checkpoint
  • apply_mlp_residual: bool = False - Use residual connections
  • history_length: int = 1 - Observation history length

Observation Processing

  • concatenate_term_names: dict[str, list[list[str]]] | None - Policy/critic observation grouping
  • concatenate_ref_term_names: dict[str, list[list[str]]] | None - Reference observation grouping

🎭 RslRlRefPpoAmpNetCfg

Purpose: AMP (Adversarial Motion Prior) discriminator network configuration.

Network Structure

  • backbone_input_dim: int - Input dimension for AMP observations
  • backbone_output_dim: int - Network output dimension
  • backbone: str = "mlp" - Network backbone type
  • activation: str = "elu" - Hidden layer activation function
  • out_activation: str = "tanh" - Output activation function
  • net_kwargs: dict | None - Additional network parameters

🏆 RslRlPpoAmpCfg

Purpose: Complete AMP system configuration.

Core Settings

  • net_cfg: RslRlRefPpoAmpNetCfg - Network configuration
  • learning_rate: float = 1e-3 - Discriminator learning rate
  • amp_obs_extractor: callable - Policy observation extractor function
  • amp_ref_obs_extractor: callable - Reference observation extractor function

Training Parameters

  • epsilon: float = 1e-4 - Numerical stability parameter
  • amp_reward_scale: float = 1.0 - AMP reward scaling factor
  • gradient_penalty_coeff: float = 10.0 - Gradient penalty coefficient
  • amp_update_interval: int = 10 - Discriminator update frequency
  • amp_pretrain_steps: int = 50 - Pre-training steps

🚀 RslRlRefPpoAlgorithmCfg

Purpose: MMPPO algorithm configuration with advanced features.

Standard PPO Parameters

  • class_name: str - Algorithm class (e.g., "MMPPO")
  • value_loss_coef: float - Value function loss coefficient
  • clip_param: float - PPO clipping parameter
  • use_clipped_value_loss: bool - Enable clipped value loss
  • desired_kl: float - Target KL divergence
  • entropy_coef: float - Entropy bonus coefficient
  • gamma: float - Discount factor
  • lam: float - GAE lambda parameter
  • max_grad_norm: float - Gradient clipping threshold

Learning Schedule

  • learning_rate: float - Base learning rate
  • normalize_advantage_per_mini_batch: bool = False - Advantage normalization
  • num_learning_epochs: int - Epochs per iteration
  • num_mini_batches: int - Mini-batches per epoch
  • schedule: str - Learning rate schedule type

Advanced Learning Rate Control

  • max_lr: float = 1e-2 - Maximum learning rate
  • min_lr: float = 1e-4 - Minimum learning rate
  • max_lr_after_certain_epoch: float = 1e-3 - Max LR after restriction epoch
  • max_lr_restriction_epoch: int = 5000 - Epoch to apply max LR restriction
  • min_lr_after_certain_epoch: float = 1e-5 - Min LR after restriction epoch
  • min_lr_restriction_epoch: int = 5000 - Epoch to apply min LR restriction

DAgger (Teacher-Student) Configuration

  • teacher_coef: float | None - Teacher coefficient for DAgger
  • teacher_coef_range: tuple[float, float] | None - Teacher coefficient range
  • teacher_coef_decay: float | None - Teacher coefficient decay rate
  • teacher_coef_decay_interval: int = 100 - Decay interval
  • teacher_coef_mode: str = "kl" - Teacher coefficient mode ("kl" or "norm")
  • teacher_loss_coef: float | None - Teacher loss coefficient
  • teacher_loss_coef_range: tuple[float, float] | None - Teacher loss coefficient range
  • teacher_loss_coef_decay: float | None - Teacher loss decay rate
  • teacher_loss_coef_decay_interval: int = 100 - Teacher loss decay interval
  • teacher_update_interval: int = 1 - Teacher update frequency
  • teacher_supervising_intervals: int = 0 - Supervision duration
  • teacher_updating_intervals: int = 0 - Teacher update duration
  • teacher_lr: float = 5e-4 - Teacher learning rate
  • teacher_only_interval: int = 0 - Teacher-only training steps

Extension Configurations

  • rnd_cfg: dict | None - Random Network Distillation configuration
  • symmetry_cfg: dict | None - Symmetry learning configuration
  • amp_cfg: RslRlPpoAmpCfg | None - AMP configuration

🏃‍♂️ RslRlRefOnPolicyRunnerCfg

Purpose: Complete training runner configuration.

Training Setup

  • seed: int = 42 - Random seed
  • device: str = "cuda:0" - Training device
  • num_steps_per_env: int - Steps per environment per iteration
  • max_iterations: int - Total training iterations
  • empirical_normalization: bool - Use empirical observation normalization

Component Configurations

  • policy: RslRlRefPpoActorCriticCfg - Actor-critic configuration
  • algorithm: RslRlRefPpoAlgorithmCfg - Algorithm configuration

Experiment Management

  • save_interval: int - Model save frequency
  • experiment_name: str - Experiment identifier
  • run_name: str = "" - Specific run name
  • logger: Literal["tensorboard", "neptune", "wandb"] = "tensorboard" - Logging backend

Checkpoint Management

  • resume: bool = False - Resume from checkpoint
  • load_run: str = ".*" - Run pattern to load from
  • load_checkpoint: str = "model_.*.pt" - Checkpoint file pattern
  • abs_checkpoint_path: str = "" - Absolute checkpoint path

🔗 Usage Context

These configuration classes are used in the Basic Guide tutorial for setting up sophisticated robot training experiments. They provide comprehensive control over all aspects of the MMPPO algorithm and its advanced features.