pose_transformer
Module: GBC.utils.data_preparation.pose_transformer
This module provides neural network architectures for mapping SMPL pose representations to humanoid robot joint actions. The core PoseTransformer class implements a Vision Transformer (ViT) inspired approach that achieves superior performance in pose-to-action translation tasks.
📚 Dependencies
This module requires the following Python packages:
- torch- PyTorch framework for neural networks
- torch.nn- Neural network layers and functions
- torch.nn.functional- Functional interface for operations
- typing- Type annotations and hints
🏗️ Core Architecture
🤖 PoseTransformer
Module Name: GBC.utils.data_preparation.pose_transformer.PoseTransformer
Definition:
class PoseTransformer(nn.Module):
    def __init__(self, 
                 load_hands: bool = False,
                 num_actions: int = 29,
                 embedding_dim: int = 64,
                 joint_embedding_dim: int = 256,
                 num_heads: int = 4,
                 num_layers: int = 4
                ):
📥 Input Parameters:
- load_hands(bool): Whether to include hand joints (21 vs 51 joints). Default:- False
- num_actions(int): Number of output robot DOF actions. Default:- 29
- embedding_dim(int): Dimension of feature embeddings. Default:- 64
- joint_embedding_dim(int): Dimension of joint space embedding. Default:- 256
- num_heads(int): Number of attention heads in transformer. Default:- 4
- num_layers(int): Number of transformer encoder layers. Default:- 4
🔧 Functionality: Neural network for converting SMPL pose parameters (angle-axis representation) to humanoid robot joint actions through inverse kinematics learning.
🧠 Design Philosophy - ViT-Inspired Architecture
🎯 Core Innovation: Joint Tokenization Strategy
The PoseTransformer employs a revolutionary approach inspired by Vision Transformers (ViT) that treats pose conversion as a sequence modeling problem with joint tokenization:
🔬 Key Design Principle: Instead of treating pose data as a monolithic vector, the model partitions joint motions into discrete tokens and projects them into a high-dimensional latent space to extract intrinsic joint relationships and movement patterns.
📊 Architecture Components
1. Joint Space Tokenization
x = x.view(N, self.num_joints, self.input_dim)  # Reshape to joint tokens
x = x.permute(0, 2, 1)                         # Prepare for embedding
x = self.joint_embedding(x)                    # Project to joint space
🔧 Functionality:
- Input Reshaping: (N, num_joints * 3)→(N, num_joints, 3)
- Joint Tokenization: Each joint's 3D angle-axis becomes a token
- Spatial Projection: Joint dimensions mapped to high-dimensional space
2. Feature Embedding & Positional Encoding
x = self.embedding(x)              # Map to embedding dimension
x += self.positional_encoding      # Add learnable position information
🔧 Functionality:
- Feature Mapping: Project joint tokens to transformer-compatible dimensions
- Positional Awareness: Learnable encoding preserves joint ordering and relationships
3. Transformer Processing
x = self.transformer(x)            # Multi-head self-attention processing
x = x.mean(dim=1)                 # Global feature aggregation
🔧 Functionality:
- Self-Attention: Captures complex inter-joint dependencies
- Multi-Head Processing: Parallel attention patterns for different motion aspects
- Global Aggregation: Mean pooling for final pose representation
4. Action Decoding
x = self.decoder(x)               # Map to robot action space
🔧 Functionality:
- Dimensionality Mapping: Transform global features to robot DOF actions
- Action Space Conversion: SMPL angle-axis → Robot joint commands
⚡ Superior Performance Analysis
🏆 Benchmark Comparisons
The ViT-inspired joint tokenization approach demonstrates exceptional performance advantages over traditional architectures:
📈 Performance Hierarchy (Best to Worst):
- 🥇 PoseTransformer (ViT-inspired) - Joint tokenization with transformer processing
- 🥈 Standard Transformers - CLS token-based prediction transformers
- 🥉 Multi-Layer Perceptrons (MLPs) - Deep feedforward networks
- 🏃 Convolutional Neural Networks (CNNs) - Spatial feature extraction approaches
- 📉 Other Network Structures - Alternative pose processing architectures
🔬 Why ViT Design Excels
✨ Joint Relationship Modeling
Traditional Approach Problems:
- Treats pose as flat vector, losing joint structure
- Limited capacity for modeling complex joint interactions
- Poor generalization to novel pose configurations
ViT-Inspired Solution:
- Preserves joint structure through tokenization
- Captures complex dependencies via self-attention
- Learns intrinsic movement patterns across joint sequences
🧩 Attention Mechanism Advantages
# Multi-head attention captures diverse joint relationships
encoder = nn.TransformerEncoderLayer(d_model=embedding_dim, nhead=num_heads)
self.transformer = nn.TransformerEncoder(encoder, num_layers=num_layers)
🔧 Benefits:
- Parallel Processing: Multiple attention heads capture different interaction patterns
- Long-Range Dependencies: Direct connections between any joint pairs
- Adaptive Focus: Dynamic attention weights based on pose context
🎯 Latent Space Projection Benefits
self.joint_embedding = nn.Linear(self.num_joints, joint_embedding_dim)
self.embedding = nn.Linear(self.input_dim, embedding_dim)
🔧 Advantages:
- Rich Representation: High-dimensional space captures subtle joint features
- Learned Features: Automatic discovery of optimal joint representations
- Generalization: Better adaptation to unseen pose variations
🔧 Model Architecture Details
📊 Input/Output Specifications
Input Shape: (N, num_joints * 3)
- N: Batch size
- num_joints: 21 (body only) or 51 (with hands)
- 3: Angle-axis representation (x, y, z rotations)
Output Shape: (N, num_actions)
- N: Batch size
- num_actions: Robot DOF count (typically 29 for humanoids)
🔄 Forward Pass Pipeline
def forward(self, x: torch.Tensor) -> torch.Tensor:
    N = x.shape[0]
    
    # 1. Joint Tokenization
    x = x.view(N, self.num_joints, self.input_dim)
    x = x.permute(0, 2, 1)
    x = self.joint_embedding(x)
    
    # 2. Feature Embedding + Positional Encoding
    x = x.permute(0, 2, 1)
    x = self.embedding(x) 
    x += self.positional_encoding
    
    # 3. Transformer Processing
    x = self.transformer(x)
    
    # 4. Global Aggregation + Decoding
    x = x.mean(dim=1)
    x = self.decoder(x)
    return x
⚙️ Configuration Parameters
Architecture Scaling:
- Small Model: embedding_dim=64, num_heads=4, num_layers=4
- Medium Model: embedding_dim=128, num_heads=8, num_layers=6
- Large Model: embedding_dim=256, num_heads=16, num_layers=8
Joint Configuration:
- Body Only: load_hands=False(21 joints)
- Full Body: load_hands=True(51 joints including fingers)
💡 Usage Examples
🚀 Basic Model Creation
from GBC.utils.data_preparation.pose_transformer import PoseTransformer
# Create transformer for body-only poses
model = PoseTransformer(
    load_hands=False,      # Body joints only (21 joints)
    num_actions=29,        # Unitree H1 DOF count
    embedding_dim=64,      # Feature dimension
    joint_embedding_dim=256, # Joint space dimension
    num_heads=4,           # Attention heads
    num_layers=4           # Transformer layers
)
🔧 Advanced Configuration
# High-capacity model for complex robots
advanced_model = PoseTransformer(
    load_hands=True,       # Include hand joints (51 total)
    num_actions=42,        # More complex robot (arms + legs + torso)
    embedding_dim=128,     # Higher feature dimension
    joint_embedding_dim=512, # Richer joint representation
    num_heads=8,           # More attention patterns
    num_layers=6           # Deeper processing
)
Model Persistence
# Save trained model
torch.save({
    'model_state_dict': model.state_dict(),
    'config': {
        'load_hands': False,
        'num_actions': 29,
        'embedding_dim': 64,
        # ... other config
    }
}, 'pose_transformer.pt')
# Load model
checkpoint = torch.load('pose_transformer.pt')
model = PoseTransformer(**checkpoint['config'])
model.load_state_dict(checkpoint['model_state_dict'])
🔍 Technical Insights
🧠 Why Joint Tokenization Works
Traditional Vector Approach:
# Flattened pose vector loses structural information
pose_vector = torch.randn(batch_size, 63)  # 21 joints * 3 dims
output = mlp(pose_vector)  # Limited joint interaction modeling
ViT-Inspired Tokenization:
# Preserve joint structure for better modeling
pose_tokens = pose_vector.view(batch_size, 21, 3)  # Joint tokens
output = transformer(pose_tokens)  # Rich joint interaction modeling
📈 Performance Scaling
Model Capacity vs. Performance:
- Embedding Dimension: Linear performance gains up to 256-512
- Attention Heads: Optimal range 4-16 depending on complexity
- Layer Depth: Diminishing returns after 6-8 layers for most tasks
🎯 Attention Pattern Analysis
The transformer learns meaningful attention patterns:
- Kinematic Chains: Strong attention between connected joints
- Symmetry: Left-right joint correspondence
- Coordination: Complex multi-joint movement patterns
This ViT-inspired design represents a breakthrough in pose-to-action translation, achieving unprecedented accuracy through intelligent joint tokenization and transformer-based relationship modeling.