amass_loader

Module: GBC.utils.data_preparation.amass_loader

This module provides a comprehensive suite of PyTorch Dataset classes for loading and processing AMASS (Archive of Motion Capture as Surface Shapes) motion capture data. The module offers multiple specialized loaders optimized for different GBC framework tasks, including sequence-based processing, frame rate interpolation, single-frame training, and high-performance optimized loading with caching and parallel processing capabilities.

📚 Dependencies

This module requires the following Python packages:

torch - PyTorch framework for dataset handling
numpy - Numerical computing for motion data processing
glob - File pattern matching for dataset discovery
pickle - Data serialization for caching
multiprocessing - Parallel processing for dataset indexing
tqdm - Progress bars for long-running operations
functools - LRU cache implementation
GBC.utils.base.math_utils - Mathematical utilities for interpolation
GBC.utils.data_preparation.data_preparation_cfg - Configuration management

🎯 Dataset Classes Overview

📊 AMASSDataset (Sequence Loader)

Module Name: GBC.utils.data_preparation.amass_loader.AMASSDataset

Definition:

class AMASSDataset(Dataset):
    def __init__(self,
                 root_dir: str,
                 num_betas: int,
                 num_dmpls: int,
                 load_hands: bool = False,
                 secondary_dir: Optional[str] = None
                ):

📥 Initialization Parameters:

root_dir (str): Root directory of AMASS dataset
num_betas (int): Number of body shape parameters (typically 16)
num_dmpls (int): Number of muscle displacement parameters (typically 8)
load_hands (bool): Include hand pose data (default: False, loads only body)
secondary_dir (Optional[str]): Specific subdirectory to load (e.g., "ACCAD/Female1Walking_c3d")

🔧 Functionality: Loads complete motion sequences from AMASS dataset, preserving original frame rates and sequence structure. Ideal for sequence-based analysis and validation tasks.

📤 Output Format:

{
    "poses": torch.Tensor,     # [num_frames, 66] (body only) or [num_frames, 156] (with hands)
    "betas": torch.Tensor,     # [num_betas] - body shape parameters
    "fps": torch.Tensor,       # [1] - original mocap frame rate
    "trans": torch.Tensor      # [num_frames, 3] - root translation
}

💡 Usage Example:

from GBC.utils.data_preparation.amass_loader import AMASSDataset
from torch.utils.data import DataLoader

# Initialize dataset for sequence loading
dataset = AMASSDataset(
    root_dir="/path/to/AMASS",
    num_betas=16,
    num_dmpls=8,
    load_hands=False,
    secondary_dir="ACCAD/Female1Walking_c3d"
)

# Create dataloader for batch processing
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

for batch in dataloader:
    poses = batch['poses']      # [batch_size, num_frames, 66]
    fps = batch['fps']          # [batch_size]
    print(f"Loaded {len(poses)} sequences with {poses.shape[1]} frames each")

🔄 AMASSDatasetInterpolate (Frame Rate Standardization)

Module Name: GBC.utils.data_preparation.amass_loader.AMASSDatasetInterpolate

Definition:

class AMASSDatasetInterpolate(Dataset):
    def __init__(self,
                 root_dir: str,
                 num_betas: int,
                 num_dmpls: int,
                 load_hands: bool = False,
                 specialize_dir: Optional[Union[str, List[str]]] = None,
                 interpolate_fps: int = 1000,
                 seq_min_duration: Optional[int] = None
                ):

📥 Initialization Parameters:

root_dir (str): Root directory of AMASS dataset
num_betas (int): Number of body shape parameters
num_dmpls (int): Number of muscle displacement parameters
load_hands (bool): Include hand pose data
specialize_dir (Union[str, List[str]]): Single directory or list of directories to process
interpolate_fps (int): Target frame rate for interpolation (default: 1000Hz)
seq_min_duration (Optional[int]): Minimum sequence duration in seconds (filters short sequences)

🔧 Functionality: Interpolates motion sequences to a standardized frame rate using advanced mathematical interpolation for both pose (angle-axis) and translation data. Essential for consistent temporal resolution across different mocap sources.

⚙️ Interpolation Process:

# Pose interpolation (angle-axis representation)
pose_interpolated = interpolate_angle_axis(pose, target_fps=interpolate_fps, source_fps=original_fps)

# Translation interpolation (linear)
trans_interpolated = interpolate_trans(trans, target_fps=interpolate_fps, source_fps=original_fps)

📤 Output Format:

{
    "poses": torch.Tensor,     # [interpolated_frames, 66] at target FPS
    "betas": torch.Tensor,     # [num_betas] - body shape parameters
    "fps": torch.Tensor,       # [1] - original mocap frame rate
    "trans": torch.Tensor,     # [interpolated_frames, 3] at target FPS
    "title": str,              # Motion sequence name
    "fpath": str               # Relative file path
}

💡 Usage Example:

# High-frequency interpolation for precise motion analysis
dataset = AMASSDatasetInterpolate(
    root_dir="/path/to/AMASS",
    num_betas=16,
    num_dmpls=8,
    specialize_dir=["ACCAD/Female1Walking_c3d", "BMLmovi/Subject_1_F_MoSh"],
    interpolate_fps=120,        # 120Hz for smooth robot control
    seq_min_duration=2          # Only sequences longer than 2 seconds
)

for data in dataset:
    poses = data['poses']       # Interpolated to 120Hz
    title = data['title']       # Motion name for identification
    print(f"Motion '{title}': {poses.shape[0]} frames at 120Hz")

🎯 AMASSDatasetSingleFrame (Training Loader)

Module Name: GBC.utils.data_preparation.amass_loader.AMASSDatasetSingleFrame

Definition:

class AMASSDatasetSingleFrame(Dataset):
    def __init__(self,
                 root_dir: str,
                 transform: Optional[callable] = None,
                 load_hands: bool = False,
                 secondary_dir: str = "ACCAD",
                 sample_steps: int = 1
                ):

📥 Initialization Parameters:

root_dir (str): Root directory of AMASS dataset
transform (Optional[callable]): Data augmentation transform function
load_hands (bool): Include hand pose data
secondary_dir (str): Specific dataset subdirectory (default: "ACCAD")
sample_steps (int): Frame sampling interval (1=every frame, 2=every other frame)

🔧 Functionality: Provides individual pose frames for training neural networks. Implements basic caching for frequently accessed files and automatic cache management to prevent memory overflow.

🧠 Caching Strategy:

# Automatic file caching with progress tracking
self.cached_data = {}       # LRU-style cache for .npz files
self.files_progress = {}    # Track frame access per file

# Cache management during data access
if file_path not in self.cached_data:
    data = np.load(file_path)
    self.cached_data[file_path] = data

# Automatic cache cleanup when file is fully processed
if self.files_progress[file_path] >= data['poses'].shape[0]:
    del self.cached_data[file_path]
    del self.files_progress[file_path]

📤 Output Format:

torch.Tensor  # [66] or [156] - single pose frame

💡 Usage Example:

# Single-frame loader for pose transformer training
dataset = AMASSDatasetSingleFrame(
    root_dir="/path/to/AMASS",
    secondary_dir="ACCAD",
    sample_steps=5,           # Sample every 5th frame
    transform=None            # No augmentation
)

# Training dataloader
train_loader = DataLoader(
    dataset, 
    batch_size=512,           # Large batches for single frames
    shuffle=True,
    num_workers=8
)

for poses in train_loader:
    # poses: [batch_size, 66] - ready for model training
    predictions = model(poses)

⚡ OptimizedAMASSDatasetSingleFrame (High-Performance Loader)

Module Name: GBC.utils.data_preparation.amass_loader.OptimizedAMASSDatasetSingleFrame

Definition:

class OptimizedAMASSDatasetSingleFrame(Dataset):
    def __init__(self,
                 root_dir: str,
                 load_hands: bool = False,
                 secondary_dir: str = "ACCAD",
                 sample_steps: int = 1,
                 transform: Optional[callable] = None,
                 cache_size: int = 128
                ):

📥 Initialization Parameters:

root_dir (str): Root directory of AMASS dataset
load_hands (bool): Include hand pose data
secondary_dir (str): Dataset subdirectory to process
sample_steps (int): Frame sampling interval
transform (Optional[callable]): Data augmentation function
cache_size (int): Number of .npz files to keep in LRU cache

🔧 Advanced Features:

1. Parallel Dataset Indexing

def _build_index_parallel(self, secondary_dir):
    """One-time parallel indexing for fast startup"""
    pattern = os.path.join(self.root_dir, secondary_dir, "*", "*.npz")
    file_paths = glob.glob(pattern)
    
    # Use all CPU cores for maximum speed
    num_processes = multiprocessing.cpu_count()
    
    with Pool(processes=num_processes) as pool:
        results = pool.imap_unordered(_read_metadata_worker, file_paths)
        
        metadata = []
        for result in tqdm(results, total=len(file_paths), desc="Building index"):
            if result is not None:
                metadata.append(result)
    
    return metadata

2. Persistent Index Caching

# Automatic index file creation and loading
index_filename = f"amass_index_v2_{secondary_dir}_{'hands' if load_hands else 'nohands'}.pkl"
index_path = os.path.join(self.root_dir, index_filename)

if os.path.exists(index_path):
    # Load pre-computed index (instant startup)
    with open(index_path, 'rb') as f:
        self.file_metadata = pickle.load(f)
else:
    # Build index once and save for future use
    self.file_metadata = self._build_index_parallel(secondary_dir)
    with open(index_path, 'wb') as f:
        pickle.dump(self.file_metadata, f)

3. LRU Cache for File Access

# Dynamic LRU cache creation
self._get_data_from_file = lru_cache(maxsize=cache_size)(self._load_file)

def _load_file(self, file_path: str):
    """Cached file loading with automatic memory management"""
    return np.load(file_path)

💡 Performance Benefits:

First Run: Parallel indexing builds persistent cache for instant future startups
Subsequent Runs: Instant dataset initialization using pre-computed index
Memory Efficient: LRU cache automatically manages memory usage
Scalable: Handles datasets with millions of frames efficiently

💡 Usage Example:

# High-performance loader for large-scale training
dataset = OptimizedAMASSDatasetSingleFrame(
    root_dir="/path/to/AMASS",
    secondary_dir="ACCAD",
    cache_size=256,           # Cache up to 256 files
    sample_steps=1            # Use every frame
)

# First run: builds index (one-time cost)
# Subsequent runs: instant startup
print(f"Dataset ready with {len(dataset)} frames")

# High-throughput training
train_loader = DataLoader(
    dataset,
    batch_size=1024,          # Very large batches
    shuffle=True,
    num_workers=16,           # More workers for I/O
    pin_memory=True           # GPU optimization
)

🚀 InMemoryAMASSDatasetSingleFrame (Maximum Performance)

Module Name: GBC.utils.data_preparation.amass_loader.InMemoryAMASSDatasetSingleFrame

Definition:

class InMemoryAMASSDatasetSingleFrame(Dataset):
    def __init__(self,
                 root_dir: str,
                 load_hands: bool = False,
                 secondary_dir: str = "ACCAD",
                 sample_steps: int = 1,
                 transform: Optional[callable] = None
                ):

🔧 Ultimate Performance Strategy:

1. Complete Dataset Preloading

def _preload_data_parallel(self, secondary_dir):
    """Load entire dataset into RAM using parallel processing"""
    file_paths = glob.glob(pattern)
    
    num_processes = max(1, multiprocessing.cpu_count() - 1)
    all_data = []
    
    with Pool(processes=num_processes) as pool:
        results = pool.imap_unordered(_load_data_worker, file_paths)
        
        for result in tqdm(results, total=len(file_paths), desc="Preloading"):
            if result is not None:
                all_data.append(result)
    
    return all_data

2. In-Memory Frame Storage

# Preprocess and store all frames in memory
self.frames = []
for file_path, poses_array in all_data:
    if not self.load_hands:
        poses_array = poses_array[:, :66]  # Slice once for efficiency
    
    for frame_idx in range(0, num_frames, self.sample_steps):
        # Store actual tensor data, not file references
        self.frames.append(torch.tensor(poses_array[frame_idx], dtype=torch.float32))

3. Zero I/O Data Access

def __getitem__(self, idx):
    """Lightning-fast data access - no I/O operations"""
    pose = self.frames[idx]  # Direct memory access
    
    if self.transform:
        pose = self.transform(pose)
    
    return pose

⚠️ Memory Requirements:

# Automatic memory usage calculation
total_size_gb = sum(t.nelement() * t.element_size() for t in self.frames) / (1024**3)
print(f"Dataset loaded: {len(self.frames)} frames")
print(f"RAM usage: {total_size_gb:.2f} GB")

💡 Usage Example:

# Maximum performance for subset training
dataset = InMemoryAMASSDatasetSingleFrame(
    root_dir="/path/to/AMASS",
    secondary_dir="ACCAD",      # Specific subset that fits in RAM
    sample_steps=1              # Use every frame
)

# Output: Dataset loaded: 2,450,000 frames, RAM usage: 4.2 GB

# Ultra-fast training with zero I/O latency
train_loader = DataLoader(
    dataset,
    batch_size=2048,            # Very large batches
    shuffle=True,
    num_workers=0               # No I/O workers needed
)

# Training loop with maximum throughput
for epoch in range(num_epochs):
    for poses in train_loader:  # Zero I/O latency
        loss = train_step(poses)

🔧 Configuration Integration

📋 Configuration-Based Initialization

All dataset classes support initialization from configuration objects:

from GBC.utils.data_preparation.data_preparation_cfg import BaseCfg

# Configuration-driven setup
cfg = BaseCfg(
    dataset_path="/path/to/AMASS",
    num_betas=16,
    num_dmpls=8,
    load_hands=False,
    secondary_dir="ACCAD"
)

# Initialize any dataset class from config
dataset = AMASSDataset.from_cfg(cfg)
interpolated = AMASSDatasetInterpolate.from_cfg(cfg)
single_frame = AMASSDatasetSingleFrame.from_cfg(cfg)

💡 Use Case Guidelines

🎯 Dataset Selection Matrix

Use Case	Recommended Dataset	Key Benefits
Sequence Analysis	`AMASSDataset`	Preserves temporal structure
Robot Control	`AMASSDatasetInterpolate`	Consistent frame rates
Model Training	`AMASSDatasetSingleFrame`	Basic single-frame access
Large-Scale Training	`OptimizedAMASSDatasetSingleFrame`	Optimized I/O and caching
Maximum Speed	`InMemoryAMASSDatasetSingleFrame`	Zero I/O latency

🚀 Performance Optimization Strategies

For Different System Configurations:

# Memory-constrained systems
dataset = OptimizedAMASSDatasetSingleFrame(
    root_dir=dataset_path,
    cache_size=32,              # Smaller cache
    sample_steps=5              # Reduce data volume
)

# High-memory systems
dataset = InMemoryAMASSDatasetSingleFrame(
    root_dir=dataset_path,
    secondary_dir="ACCAD"       # Load specific subset
)

# Balanced approach
dataset = OptimizedAMASSDatasetSingleFrame(
    root_dir=dataset_path,
    cache_size=128,             # Moderate cache
    sample_steps=2              # Some data reduction
)

🔄 Data Augmentation Integration

def pose_augmentation(pose):
    """Example augmentation function"""
    # Add noise
    noise = torch.randn_like(pose) * 0.01
    pose_augmented = pose + noise
    
    # Random temporal shifts (for sequence data)
    # Symmetry transformations
    # Joint limit clamping
    
    return pose_augmented

# Apply to any dataset
dataset = OptimizedAMASSDatasetSingleFrame(
    root_dir=dataset_path,
    transform=pose_augmentation
)

🚨 Best Practices

✅ Dataset Optimization Guidelines

Index Persistence: Use OptimizedAMASSDatasetSingleFrame for repeated access to same dataset
Memory Management: Choose appropriate cache sizes based on available RAM
Parallel Processing: Leverage multiple CPU cores for initial dataset indexing
Subset Selection: Use secondary_dir to focus on relevant motion types
Frame Sampling: Use sample_steps to balance data volume with training quality

🔧 Performance Monitoring

import time
import psutil

def benchmark_dataset(dataset_class, **kwargs):
    """Benchmark dataset loading performance"""
    start_time = time.time()
    
    # Initialize dataset
    dataset = dataset_class(**kwargs)
    init_time = time.time() - start_time
    
    # Measure data access speed
    start_time = time.time()
    for i in range(min(1000, len(dataset))):
        _ = dataset[i]
    access_time = (time.time() - start_time) / 1000
    
    # Memory usage
    memory_mb = psutil.Process().memory_info().rss / 1024 / 1024
    
    print(f"Init time: {init_time:.2f}s")
    print(f"Access time: {access_time*1000:.2f}ms per frame")
    print(f"Memory usage: {memory_mb:.1f}MB")

# Compare different loaders
benchmark_dataset(AMASSDatasetSingleFrame, root_dir=dataset_path)
benchmark_dataset(OptimizedAMASSDatasetSingleFrame, root_dir=dataset_path)

🐛 Troubleshooting

Common Issues:

Memory Overflow:

# Reduce cache size or use frame sampling
dataset = OptimizedAMASSDatasetSingleFrame(cache_size=32, sample_steps=5)

Slow Initialization:

# Check for existing index files
index_files = glob.glob(os.path.join(root_dir, "amass_index_*.pkl"))
print(f"Found {len(index_files)} index files")

Missing Data:

# Verify dataset structure
pattern = os.path.join(root_dir, "*", "*", "*.npz")
files = glob.glob(pattern)
print(f"Found {len(files)} .npz files")

This comprehensive loader suite provides optimal data access patterns for all stages of the GBC pipeline, from initial research and validation through large-scale production training.

📚 Dependencies​

🎯 Dataset Classes Overview​

📊 AMASSDataset (Sequence Loader)​

🔄 AMASSDatasetInterpolate (Frame Rate Standardization)​

🎯 AMASSDatasetSingleFrame (Training Loader)​

⚡ OptimizedAMASSDatasetSingleFrame (High-Performance Loader)​

1. Parallel Dataset Indexing​

2. Persistent Index Caching​

3. LRU Cache for File Access​

🚀 InMemoryAMASSDatasetSingleFrame (Maximum Performance)​

1. Complete Dataset Preloading​

2. In-Memory Frame Storage​

3. Zero I/O Data Access​

🔧 Configuration Integration​

📋 Configuration-Based Initialization​

💡 Use Case Guidelines​

🎯 Dataset Selection Matrix​

🚀 Performance Optimization Strategies​

🔄 Data Augmentation Integration​

🚨 Best Practices​

✅ Dataset Optimization Guidelines​

🔧 Performance Monitoring​

🐛 Troubleshooting​

📚 Dependencies

🎯 Dataset Classes Overview

📊 AMASSDataset (Sequence Loader)

🔄 AMASSDatasetInterpolate (Frame Rate Standardization)

🎯 AMASSDatasetSingleFrame (Training Loader)

⚡ OptimizedAMASSDatasetSingleFrame (High-Performance Loader)

1. Parallel Dataset Indexing

2. Persistent Index Caching

3. LRU Cache for File Access

🚀 InMemoryAMASSDatasetSingleFrame (Maximum Performance)

1. Complete Dataset Preloading

2. In-Memory Frame Storage

3. Zero I/O Data Access

🔧 Configuration Integration

📋 Configuration-Based Initialization

💡 Use Case Guidelines

🎯 Dataset Selection Matrix

🚀 Performance Optimization Strategies

🔄 Data Augmentation Integration

🚨 Best Practices

✅ Dataset Optimization Guidelines

🔧 Performance Monitoring

🐛 Troubleshooting