Training Principles

Master the art and science of robot training with GBC! 🎯 This comprehensive guide distills years of research and practical experience into actionable principles that will accelerate your robot's learning and improve training outcomes. Whether you're training lightweight quadrupeds or full-scale humanoids, these principles will guide you to success.

About Pseudo Codes Here

All codes in this chapter is pseudo and used only for demonstrating the idea. To use our features, please refer to the api documentation

🎯 Core Training Philosophy

The Golden Rule

Start Simple, Scale Smart - Begin with basic capabilities and progressively add complexity. Every successful robot starts with fundamentals before mastering advanced behaviors.

The key to successful robot training lies in understanding the synergy between reinforcement learning (RL) and imitation learning (IL). Rather than diving straight into complex imitation tasks, we follow a proven progression that builds robust foundations before adding sophisticated behaviors.

🏗️ Progressive Training Strategy

🎮 Phase 1: Establish RL Foundation First

Why RL First?

Starting with reinforcement learning before imitation learning is a proven strategy that provides:

Stable Control Foundation: RL creates robust basic locomotion policies
Reward Function Baseline: Well-tuned RL rewards serve as a solid foundation
Convergence Acceleration: Adding IL rewards to stable RL policies speeds up learning
Better Generalization: RL-trained policies handle unexpected situations more gracefully

🔧 Implementation Strategy:

Train Pure RL First: Use only locomotion rewards (velocity tracking, stability, energy efficiency)
Achieve Stable Walking: Ensure robot can walk reliably in various conditions
Save Checkpoint: Create a solid baseline model for future enhancement
Add IL Gradually: Introduce imitation learning rewards to the proven RL foundation

# Example: Progressive reward introduction
# Phase 1: Pure RL
rewards_rl_only = {
    'track_lin_vel_xy': 1.0,
    'track_ang_vel_z': 0.5,
    'feet_air_time': 0.5,
    'orientation': 0.2
}

# Phase 2: RL + IL hybrid
rewards_rl_il_hybrid = {
    **rewards_rl_only,  # Keep RL foundation
    'tracking_target_actions': 2.0,  # Add IL rewards
    'amp_reward': 0.8
}

🤖 Phase 2: Heavy Robots Need DAgger First

Heavy Robot Considerations

For full-scale humanoid robots (>50kg), the training progression requires special attention:

📈 Recommended Progression:

DAgger Training: Learn motion tracking in controlled conditions
Transfer to Environment: Apply learned skills to full physics simulation
Gradual Complexity: Add terrain, disturbances, and real-world challenges
Continuous Refinement: Iterate between DAgger and environment training

# Training sequence for heavy robots
training_sequence = [
    "Isaac-Velocity-Dagger-YourRobot-v0",      # Phase 1: DAgger
    "Isaac-Velocity-Flat-YourRobot-Reference-v0",   # Phase 2: Flat + IL
    "Isaac-Velocity-Rough-YourRobot-Reference-v0"   # Phase 3: Full challenge
]

⚖️ Physics-Assisted Learning

🪄 External Force Assistance for Heavy Robots

Physics Modifier Strategy

For heavier robots, implement a curriculum-based external force system that acts like a "training harness" - providing upward assistance that gradually reduces as the robot learns to support itself.

🔧 Configuration Principles:

# Example: Adaptive external force configuration
external_force_config = {
    "max_force": robot_weight * 0.3,  # Start with 30% weight assistance
    "force_decay_rate": 0.995,        # Gradual reduction per episode
    "min_force": robot_weight * 0.05, # Minimum assistance (5%)
    "spring_like_behavior": True,     # Natural, responsive assistance
    "apply_offset_range": 0.2,        # Height-based activation
}

🎯 Implementation Guidelines:

Force Magnitude: Start with 20-40% of robot weight
Decay Schedule: Reduce by 0.5-1% per successful episode
Activation Conditions: Apply when robot drops below normal height
Real Robot Validation: Test force levels on physical hardware if available
Spring Dynamics: Use natural spring-like forces, not constant assistance

Force Configuration Warnings

Don't Overdo It: Excessive assistance creates dependency and poor real-world transfer
Gradual Reduction: Sudden force removal can cause catastrophic failures
Height-Based Triggering: Only activate when robot actually needs help
Natural Physics: Maintain realistic dynamics even with assistance

🎁 Reward Function Design Principles

🎯 Simplicity and Precision Over Complexity

Core Philosophy

"Reward functions should be precise, not numerous" - Each reward should serve a specific, measurable purpose. Quality trumps quantity every time.

🔧 Design Process:

Start Minimal: Begin with 3-5 essential rewards
Add Incrementally: Introduce one new reward at a time
Measure Impact: Quantify the effect of each addition
Remove Ineffective: Delete rewards that don't improve performance
Iterate Carefully: Change only one parameter at a time

Reward Function Hierarchy

# Priority order for reward implementation
reward_priority = [
    "survival_rewards",      # 1. Keep robot alive and stable
    "task_rewards",          # 2. Primary objective (velocity tracking)
    "efficiency_rewards",    # 3. Energy and smoothness
    "style_rewards",        # 4. Motion quality and naturalness
    "auxiliary_rewards"     # 5. Secondary objectives
]

📊 Testing Protocol:

# Systematic reward testing approach
def test_new_reward(base_config, new_reward, test_episodes=1000):
    """
    Test impact of adding a new reward component
    """
    baseline_performance = train_and_evaluate(base_config)
    enhanced_config = base_config.copy()
    enhanced_config.rewards[new_reward.name] = new_reward
    enhanced_performance = train_and_evaluate(enhanced_config)
    
    improvement = enhanced_performance - baseline_performance
    if improvement > threshold:
        return "keep_reward"
    else:
        return "discard_reward"

📈 Curriculum Learning Integration

Smart Reward Scheduling

Implement curriculum learning for rewards to progressively increase task difficulty as the robot improves.

🎯 Curriculum Examples:

# Tracking precision curriculum
tracking_curriculum = {
    "initial_sigma": 1.0,      # Lenient early tracking
    "final_sigma": 0.1,        # Precise late tracking
    "decay_schedule": "exponential",
    "performance_threshold": 0.8  # Trigger for progression
}

# Contact timing curriculum
contact_curriculum = {
    "initial_tolerance": 0.2,   # 20% timing tolerance
    "final_tolerance": 0.05,    # 5% precision requirement
    "adaptation_rate": "adaptive"  # Based on success rate
}

👁️ Observation Design Strategy

🧠 Policy vs Critic Observation Distribution

Observation Architecture Principle

"Keep Policy Simple, Make Critic Informed" - Give complex, detailed observations to the critic while keeping policy observations clean and essential.

📊 Distribution Strategy:

# Optimal observation distribution
policy_observations = [
    "base_lin_vel",           # Essential for control
    "base_ang_vel", 
    "projected_gravity",
    "joint_pos",
    "joint_vel",
    "foot_contacts_continuous", # Processed to continuous values
    "command_velocity"
]

critic_observations = [
    *policy_observations,      # Include all policy obs
    "terrain_height_scan",     # Complex environmental info
    "robot_height_map",
    "previous_actions_history",
    "detailed_contact_forces",
    "joint_torques",
    "imu_raw_data"
]

🌊 Continuous Value Processing

Continuity Principle

Neural networks learn more effectively from continuous signals. Convert discrete states to continuous representations whenever possible.

🔧 Continuous Conversion Examples:

# Foot contact: 0/1 → sin/cos representation
def discrete_to_continuous_contact(contact_binary):
    """Convert binary contact to continuous phase signal"""
    phase = contact_binary * np.pi  # Map to phase space
    return np.stack([np.sin(phase), np.cos(phase)], axis=-1)

# Joint limits: clipped → normalized sigmoid
def joint_limit_normalization(joint_pos, joint_limits):
    """Smooth joint limit representation"""
    normalized = (joint_pos - joint_limits.min) / (joint_limits.max - joint_limits.min)
    return torch.sigmoid(6 * (normalized - 0.5))  # Smooth S-curve

# Gait phase: discrete → continuous
def gait_phase_encoding(phase_discrete, num_phases=4):
    """Convert discrete gait phase to smooth encoding"""
    phase_rad = phase_discrete * 2 * np.pi / num_phases
    return np.array([np.sin(phase_rad), np.cos(phase_rad)])

🎯 Preprocessing Pipeline:

# Comprehensive observation preprocessing
preprocessing_pipeline = [
    "normalization",         # Scale to standard ranges
    "continuization",        # Convert discrete to continuous
    "history_stacking",      # Add temporal context
    "noise_filtering",       # Remove sensor noise
    "outlier_clipping"       # Handle extreme values
]

🎭 Advanced Learning Techniques

🤖 Adversarial Motion Prior (AMP) Integration

AMP Enhancement

For complex, lifelike motions, integrate AMP to learn from motion capture data while maintaining reactive control capabilities.

🔧 AMP Setup Requirements:

# Observation matching for AMP
amp_observation_pairs = {
    # Policy obs → Reference obs mapping (must match exactly!)
    "base_lin_vel": "ref_base_lin_vel",
    "base_ang_vel": "ref_base_ang_vel", 
    "projected_gravity": "ref_projected_gravity",
    "joint_pos": "ref_joint_pos",
    "joint_vel": "ref_joint_vel",
    "foot_contacts": "ref_foot_contacts"
}

# Ensure dimensional consistency
def validate_amp_observations(policy_obs, ref_obs):
    """Validate that policy and reference observations match"""
    for policy_key, ref_key in amp_observation_pairs.items():
        assert policy_obs[policy_key].shape == ref_obs[ref_key].shape, \
            f"Dimension mismatch: {policy_key} vs {ref_key}"
    return True

🎯 AMP Training Strategy:

Data Preparation: Ensure high-quality reference motion data
Observation Alignment: Perfect match between policy and reference observations
Discriminator Training: Balance between discriminator and generator learning
Reward Integration: Blend AMP rewards with task-specific rewards

📚 Incremental Skill Development

🎓 Multi-Stage Learning Curriculum

Progressive Skill Building

Build complex behaviors by layering skills progressively, starting with fundamental locomotion and adding advanced maneuvers.

🚀 Recommended Learning Progression:

# Skill development timeline
skill_progression = {
    "stage_1_foundation": {
        "skills": ["standing", "weight_shifting", "basic_walking"],
        "duration": "500k steps",
        "success_criteria": "stable_locomotion"
    },
    "stage_2_locomotion": {
        "skills": ["walking", "turning", "speed_control"],
        "duration": "1M steps", 
        "checkpoint_save": True  # Save foundation checkpoint
    },
    "stage_3_advanced": {
        "skills": ["running", "jumping", "complex_gaits"],
        "load_checkpoint": "stage_2_best.pt",  # Load as DAgger base
        "reward_weights": "adjusted_for_complexity"
    },
    "stage_4_specialized": {
        "skills": ["boxing", "dancing", "manipulation"],
        "task_specific_rewards": True,
        "incremental_addition": True
    }
}

🔧 Implementation Strategy:

# Progressive training implementation
def progressive_training_pipeline():
    # Stage 1: Foundation
    foundation_policy = train_basic_locomotion()
    save_checkpoint(foundation_policy, "foundation.pt")
    
    # Stage 2: Enhanced locomotion  
    enhanced_policy = train_with_foundation(
        base_checkpoint="foundation.pt",
        new_skills=["running", "jumping"]
    )
    save_checkpoint(enhanced_policy, "enhanced.pt")
    
    # Stage 3: Specialized skills
    specialized_policy = train_specialized_tasks(
        base_checkpoint="enhanced.pt",
        tasks=["boxing", "dancing"],
        task_specific_weights=True
    )

🎯 Task-Specific Reward Tuning

Adaptive Reward Weighting

Different skills require different reward emphasis. Adjust reward weights based on the specific task being learned.

📊 Task-Specific Configurations:

# Reward weight profiles for different tasks
reward_profiles = {
    "walking": {
        "velocity_tracking": 2.0,
        "stability": 1.5,
        "energy_efficiency": 1.0,
        "foot_clearance": 0.5
    },
    "running": {
        "velocity_tracking": 3.0,
        "air_time": 2.0,
        "landing_stability": 1.5,
        "energy_efficiency": 0.5  # Less important for running
    },
    "boxing": {
        "arm_tracking": 3.0,
        "balance_maintenance": 2.0,
        "reaction_speed": 1.5,
        "power_generation": 1.0
    }
}

⚙️ Training Optimization

🚀 Environment and Performance Tuning

Optimal Environment Configuration

Balance training effectiveness with computational efficiency. More environments aren't always better.

📊 Environment Scaling Guidelines:

# Recommended environment counts
environment_recommendations = {
    "lightweight_quadruped": {
        "environments": 4096,      # Can handle more parallel envs
        "reasoning": "Fast simulation, simple dynamics"
    },
    "humanoid_robot": {
        "environments": 2048,      # Sweet spot for most GPUs
        "reasoning": "Balance between speed and memory"
    },
    "heavy_humanoid": {
        "environments": 1024,      # Conservative for complex dynamics
        "reasoning": "Complex physics, more computation per step"
    }
}

🔧 Performance Optimization:

# Training acceleration techniques
optimization_config = {
    "auto_mixed_precision": True,    # NOT amp (Adversarial Motion Prior)
    "gradient_accumulation": 2,      # For large batch sizes
    "environment_reset_distribution": "staggered",
    "observation_normalization": "running_mean",
    "action_smoothing": "exponential_moving_average"
}

# Memory management
memory_config = {
    "replay_buffer_size": "adaptive",  # Based on available memory
    "observation_history_length": 4,   # Balance context vs memory
    "checkpoint_frequency": 100,       # Regular saves without overhead
    "tensorboard_logging_interval": 10 # Reduce logging overhead
}

📈 Curriculum Learning Best Practices

Smart Curriculum Design

Implement adaptive curriculum learning that responds to robot performance rather than following fixed schedules.

🎯 Adaptive Curriculum Implementation:

# Performance-based curriculum adaptation
class AdaptiveCurriculum:
    def __init__(self, initial_difficulty=0.1, target_success_rate=0.8):
        self.difficulty = initial_difficulty
        self.target_success_rate = target_success_rate
        self.success_history = []
        
    def update_difficulty(self, recent_success_rate):
        """Adapt difficulty based on performance"""
        if recent_success_rate > self.target_success_rate + 0.1:
            self.difficulty = min(1.0, self.difficulty * 1.05)  # Increase difficulty
        elif recent_success_rate < self.target_success_rate - 0.1:
            self.difficulty = max(0.1, self.difficulty * 0.95)  # Decrease difficulty
        
        return self.difficulty

# Example: Terrain difficulty curriculum
terrain_curriculum = AdaptiveCurriculum()
current_terrain_difficulty = terrain_curriculum.update_difficulty(success_rate)

🎯 Implementation Checklist

✅ Training Progression Checklist

Before You Begin Training

Ensure you have all components properly configured:

🏗️ Foundation Setup:

Robot configuration validated and tested
Environment variants created (flat, rough, DAgger)
Agent configurations prepared (standard and reference)
Reward functions designed and tested individually

🎮 Progressive Training Plan:

Phase 1: Pure RL training strategy defined
Phase 2: IL integration plan prepared
Phase 3: Advanced skills roadmap created
Checkpoint saving and loading tested

⚙️ Technical Configuration:

Observation preprocessing validated
AMP observation matching verified (if using)
Curriculum learning parameters tuned
Performance monitoring setup complete

🚀 Optimization Ready:

Environment count optimized for hardware
Mixed precision training configured
Memory usage profiled and optimized
Backup and recovery procedures tested

🔧 Troubleshooting Quick Reference

Common Issues and Solutions

🚫 Training Not Converging:

Reduce reward complexity
Check observation normalization
Verify environment stability
Lower learning rates

🤖 Robot Falling/Unstable:

Add physics modifier assistance
Increase stability rewards
Check initial pose configuration
Verify joint limits

📊 Poor Sample Efficiency:

Implement curriculum learning
Use better exploration strategies
Check reward signal quality
Consider AMP integration

💾 Memory Issues:

Reduce environment count
Enable mixed precision training
Optimize observation history length
Use gradient accumulation

🎪 Final Words: Mastering the Art of Robot Training

The Journey Continues

Robot training is both an art and a science. These principles provide your foundation, but remember that each robot, each task, and each environment presents unique challenges and opportunities.

🔑 Key Takeaways:

Start Simple: Build robust foundations before adding complexity
Be Patient: Great policies take time to develop
Iterate Thoughtfully: Change one thing at a time and measure impact
Trust the Process: Follow proven progressions but adapt to your specific needs

🚀 The Path Forward: With these principles as your guide, you're equipped to train robots that don't just move—they move with purpose, grace, and intelligence. Whether you're creating the next generation of service robots, exploring new frontiers, or pushing the boundaries of what's possible, you now have the tools and knowledge to succeed.

The future of robotics awaits your contribution! 🤖✨

Happy Training, and may your robots learn swiftly and move beautifully! 🎯🚀

🎯 Core Training Philosophy​

🏗️ Progressive Training Strategy​

🎮 Phase 1: Establish RL Foundation First​

🤖 Phase 2: Heavy Robots Need DAgger First​

⚖️ Physics-Assisted Learning​

🪄 External Force Assistance for Heavy Robots​

🎁 Reward Function Design Principles​

🎯 Simplicity and Precision Over Complexity​

📈 Curriculum Learning Integration​

👁️ Observation Design Strategy​

🧠 Policy vs Critic Observation Distribution​

🌊 Continuous Value Processing​

🎭 Advanced Learning Techniques​

🤖 Adversarial Motion Prior (AMP) Integration​

📚 Incremental Skill Development​

🎓 Multi-Stage Learning Curriculum​

🎯 Task-Specific Reward Tuning​

⚙️ Training Optimization​

🚀 Environment and Performance Tuning​

📈 Curriculum Learning Best Practices​

🎯 Implementation Checklist​

✅ Training Progression Checklist​

🔧 Troubleshooting Quick Reference​

🎪 Final Words: Mastering the Art of Robot Training​