Skip to main content

amass_loader

Module: GBC.utils.data_preparation.amass_loader

This module provides a comprehensive suite of PyTorch Dataset classes for loading and processing AMASS (Archive of Motion Capture as Surface Shapes) motion capture data. The module offers multiple specialized loaders optimized for different GBC framework tasks, including sequence-based processing, frame rate interpolation, single-frame training, and high-performance optimized loading with caching and parallel processing capabilities.

📚 Dependencies

This module requires the following Python packages:

  • torch - PyTorch framework for dataset handling
  • numpy - Numerical computing for motion data processing
  • glob - File pattern matching for dataset discovery
  • pickle - Data serialization for caching
  • multiprocessing - Parallel processing for dataset indexing
  • tqdm - Progress bars for long-running operations
  • functools - LRU cache implementation
  • GBC.utils.base.math_utils - Mathematical utilities for interpolation
  • GBC.utils.data_preparation.data_preparation_cfg - Configuration management

🎯 Dataset Classes Overview

📊 AMASSDataset (Sequence Loader)

Module Name: GBC.utils.data_preparation.amass_loader.AMASSDataset

Definition:

class AMASSDataset(Dataset):
def __init__(self,
root_dir: str,
num_betas: int,
num_dmpls: int,
load_hands: bool = False,
secondary_dir: Optional[str] = None
):

📥 Initialization Parameters:

  • root_dir (str): Root directory of AMASS dataset
  • num_betas (int): Number of body shape parameters (typically 16)
  • num_dmpls (int): Number of muscle displacement parameters (typically 8)
  • load_hands (bool): Include hand pose data (default: False, loads only body)
  • secondary_dir (Optional[str]): Specific subdirectory to load (e.g., "ACCAD/Female1Walking_c3d")

🔧 Functionality: Loads complete motion sequences from AMASS dataset, preserving original frame rates and sequence structure. Ideal for sequence-based analysis and validation tasks.

📤 Output Format:

{
"poses": torch.Tensor, # [num_frames, 66] (body only) or [num_frames, 156] (with hands)
"betas": torch.Tensor, # [num_betas] - body shape parameters
"fps": torch.Tensor, # [1] - original mocap frame rate
"trans": torch.Tensor # [num_frames, 3] - root translation
}

💡 Usage Example:

from GBC.utils.data_preparation.amass_loader import AMASSDataset
from torch.utils.data import DataLoader

# Initialize dataset for sequence loading
dataset = AMASSDataset(
root_dir="/path/to/AMASS",
num_betas=16,
num_dmpls=8,
load_hands=False,
secondary_dir="ACCAD/Female1Walking_c3d"
)

# Create dataloader for batch processing
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

for batch in dataloader:
poses = batch['poses'] # [batch_size, num_frames, 66]
fps = batch['fps'] # [batch_size]
print(f"Loaded {len(poses)} sequences with {poses.shape[1]} frames each")

🔄 AMASSDatasetInterpolate (Frame Rate Standardization)

Module Name: GBC.utils.data_preparation.amass_loader.AMASSDatasetInterpolate

Definition:

class AMASSDatasetInterpolate(Dataset):
def __init__(self,
root_dir: str,
num_betas: int,
num_dmpls: int,
load_hands: bool = False,
specialize_dir: Optional[Union[str, List[str]]] = None,
interpolate_fps: int = 1000,
seq_min_duration: Optional[int] = None
):

📥 Initialization Parameters:

  • root_dir (str): Root directory of AMASS dataset
  • num_betas (int): Number of body shape parameters
  • num_dmpls (int): Number of muscle displacement parameters
  • load_hands (bool): Include hand pose data
  • specialize_dir (Union[str, List[str]]): Single directory or list of directories to process
  • interpolate_fps (int): Target frame rate for interpolation (default: 1000Hz)
  • seq_min_duration (Optional[int]): Minimum sequence duration in seconds (filters short sequences)

🔧 Functionality: Interpolates motion sequences to a standardized frame rate using advanced mathematical interpolation for both pose (angle-axis) and translation data. Essential for consistent temporal resolution across different mocap sources.

⚙️ Interpolation Process:

# Pose interpolation (angle-axis representation)
pose_interpolated = interpolate_angle_axis(pose, target_fps=interpolate_fps, source_fps=original_fps)

# Translation interpolation (linear)
trans_interpolated = interpolate_trans(trans, target_fps=interpolate_fps, source_fps=original_fps)

📤 Output Format:

{
"poses": torch.Tensor, # [interpolated_frames, 66] at target FPS
"betas": torch.Tensor, # [num_betas] - body shape parameters
"fps": torch.Tensor, # [1] - original mocap frame rate
"trans": torch.Tensor, # [interpolated_frames, 3] at target FPS
"title": str, # Motion sequence name
"fpath": str # Relative file path
}

💡 Usage Example:

# High-frequency interpolation for precise motion analysis
dataset = AMASSDatasetInterpolate(
root_dir="/path/to/AMASS",
num_betas=16,
num_dmpls=8,
specialize_dir=["ACCAD/Female1Walking_c3d", "BMLmovi/Subject_1_F_MoSh"],
interpolate_fps=120, # 120Hz for smooth robot control
seq_min_duration=2 # Only sequences longer than 2 seconds
)

for data in dataset:
poses = data['poses'] # Interpolated to 120Hz
title = data['title'] # Motion name for identification
print(f"Motion '{title}': {poses.shape[0]} frames at 120Hz")

🎯 AMASSDatasetSingleFrame (Training Loader)

Module Name: GBC.utils.data_preparation.amass_loader.AMASSDatasetSingleFrame

Definition:

class AMASSDatasetSingleFrame(Dataset):
def __init__(self,
root_dir: str,
transform: Optional[callable] = None,
load_hands: bool = False,
secondary_dir: str = "ACCAD",
sample_steps: int = 1
):

📥 Initialization Parameters:

  • root_dir (str): Root directory of AMASS dataset
  • transform (Optional[callable]): Data augmentation transform function
  • load_hands (bool): Include hand pose data
  • secondary_dir (str): Specific dataset subdirectory (default: "ACCAD")
  • sample_steps (int): Frame sampling interval (1=every frame, 2=every other frame)

🔧 Functionality: Provides individual pose frames for training neural networks. Implements basic caching for frequently accessed files and automatic cache management to prevent memory overflow.

🧠 Caching Strategy:

# Automatic file caching with progress tracking
self.cached_data = {} # LRU-style cache for .npz files
self.files_progress = {} # Track frame access per file

# Cache management during data access
if file_path not in self.cached_data:
data = np.load(file_path)
self.cached_data[file_path] = data

# Automatic cache cleanup when file is fully processed
if self.files_progress[file_path] >= data['poses'].shape[0]:
del self.cached_data[file_path]
del self.files_progress[file_path]

📤 Output Format:

torch.Tensor  # [66] or [156] - single pose frame

💡 Usage Example:

# Single-frame loader for pose transformer training
dataset = AMASSDatasetSingleFrame(
root_dir="/path/to/AMASS",
secondary_dir="ACCAD",
sample_steps=5, # Sample every 5th frame
transform=None # No augmentation
)

# Training dataloader
train_loader = DataLoader(
dataset,
batch_size=512, # Large batches for single frames
shuffle=True,
num_workers=8
)

for poses in train_loader:
# poses: [batch_size, 66] - ready for model training
predictions = model(poses)

⚡ OptimizedAMASSDatasetSingleFrame (High-Performance Loader)

Module Name: GBC.utils.data_preparation.amass_loader.OptimizedAMASSDatasetSingleFrame

Definition:

class OptimizedAMASSDatasetSingleFrame(Dataset):
def __init__(self,
root_dir: str,
load_hands: bool = False,
secondary_dir: str = "ACCAD",
sample_steps: int = 1,
transform: Optional[callable] = None,
cache_size: int = 128
):

📥 Initialization Parameters:

  • root_dir (str): Root directory of AMASS dataset
  • load_hands (bool): Include hand pose data
  • secondary_dir (str): Dataset subdirectory to process
  • sample_steps (int): Frame sampling interval
  • transform (Optional[callable]): Data augmentation function
  • cache_size (int): Number of .npz files to keep in LRU cache

🔧 Advanced Features:

1. Parallel Dataset Indexing

def _build_index_parallel(self, secondary_dir):
"""One-time parallel indexing for fast startup"""
pattern = os.path.join(self.root_dir, secondary_dir, "*", "*.npz")
file_paths = glob.glob(pattern)

# Use all CPU cores for maximum speed
num_processes = multiprocessing.cpu_count()

with Pool(processes=num_processes) as pool:
results = pool.imap_unordered(_read_metadata_worker, file_paths)

metadata = []
for result in tqdm(results, total=len(file_paths), desc="Building index"):
if result is not None:
metadata.append(result)

return metadata

2. Persistent Index Caching

# Automatic index file creation and loading
index_filename = f"amass_index_v2_{secondary_dir}_{'hands' if load_hands else 'nohands'}.pkl"
index_path = os.path.join(self.root_dir, index_filename)

if os.path.exists(index_path):
# Load pre-computed index (instant startup)
with open(index_path, 'rb') as f:
self.file_metadata = pickle.load(f)
else:
# Build index once and save for future use
self.file_metadata = self._build_index_parallel(secondary_dir)
with open(index_path, 'wb') as f:
pickle.dump(self.file_metadata, f)

3. LRU Cache for File Access

# Dynamic LRU cache creation
self._get_data_from_file = lru_cache(maxsize=cache_size)(self._load_file)

def _load_file(self, file_path: str):
"""Cached file loading with automatic memory management"""
return np.load(file_path)

💡 Performance Benefits:

  • First Run: Parallel indexing builds persistent cache for instant future startups
  • Subsequent Runs: Instant dataset initialization using pre-computed index
  • Memory Efficient: LRU cache automatically manages memory usage
  • Scalable: Handles datasets with millions of frames efficiently

💡 Usage Example:

# High-performance loader for large-scale training
dataset = OptimizedAMASSDatasetSingleFrame(
root_dir="/path/to/AMASS",
secondary_dir="ACCAD",
cache_size=256, # Cache up to 256 files
sample_steps=1 # Use every frame
)

# First run: builds index (one-time cost)
# Subsequent runs: instant startup
print(f"Dataset ready with {len(dataset)} frames")

# High-throughput training
train_loader = DataLoader(
dataset,
batch_size=1024, # Very large batches
shuffle=True,
num_workers=16, # More workers for I/O
pin_memory=True # GPU optimization
)

🚀 InMemoryAMASSDatasetSingleFrame (Maximum Performance)

Module Name: GBC.utils.data_preparation.amass_loader.InMemoryAMASSDatasetSingleFrame

Definition:

class InMemoryAMASSDatasetSingleFrame(Dataset):
def __init__(self,
root_dir: str,
load_hands: bool = False,
secondary_dir: str = "ACCAD",
sample_steps: int = 1,
transform: Optional[callable] = None
):

🔧 Ultimate Performance Strategy:

1. Complete Dataset Preloading

def _preload_data_parallel(self, secondary_dir):
"""Load entire dataset into RAM using parallel processing"""
file_paths = glob.glob(pattern)

num_processes = max(1, multiprocessing.cpu_count() - 1)
all_data = []

with Pool(processes=num_processes) as pool:
results = pool.imap_unordered(_load_data_worker, file_paths)

for result in tqdm(results, total=len(file_paths), desc="Preloading"):
if result is not None:
all_data.append(result)

return all_data

2. In-Memory Frame Storage

# Preprocess and store all frames in memory
self.frames = []
for file_path, poses_array in all_data:
if not self.load_hands:
poses_array = poses_array[:, :66] # Slice once for efficiency

for frame_idx in range(0, num_frames, self.sample_steps):
# Store actual tensor data, not file references
self.frames.append(torch.tensor(poses_array[frame_idx], dtype=torch.float32))

3. Zero I/O Data Access

def __getitem__(self, idx):
"""Lightning-fast data access - no I/O operations"""
pose = self.frames[idx] # Direct memory access

if self.transform:
pose = self.transform(pose)

return pose

⚠️ Memory Requirements:

# Automatic memory usage calculation
total_size_gb = sum(t.nelement() * t.element_size() for t in self.frames) / (1024**3)
print(f"Dataset loaded: {len(self.frames)} frames")
print(f"RAM usage: {total_size_gb:.2f} GB")

💡 Usage Example:

# Maximum performance for subset training
dataset = InMemoryAMASSDatasetSingleFrame(
root_dir="/path/to/AMASS",
secondary_dir="ACCAD", # Specific subset that fits in RAM
sample_steps=1 # Use every frame
)

# Output: Dataset loaded: 2,450,000 frames, RAM usage: 4.2 GB

# Ultra-fast training with zero I/O latency
train_loader = DataLoader(
dataset,
batch_size=2048, # Very large batches
shuffle=True,
num_workers=0 # No I/O workers needed
)

# Training loop with maximum throughput
for epoch in range(num_epochs):
for poses in train_loader: # Zero I/O latency
loss = train_step(poses)

🔧 Configuration Integration

📋 Configuration-Based Initialization

All dataset classes support initialization from configuration objects:

from GBC.utils.data_preparation.data_preparation_cfg import BaseCfg

# Configuration-driven setup
cfg = BaseCfg(
dataset_path="/path/to/AMASS",
num_betas=16,
num_dmpls=8,
load_hands=False,
secondary_dir="ACCAD"
)

# Initialize any dataset class from config
dataset = AMASSDataset.from_cfg(cfg)
interpolated = AMASSDatasetInterpolate.from_cfg(cfg)
single_frame = AMASSDatasetSingleFrame.from_cfg(cfg)

💡 Use Case Guidelines

🎯 Dataset Selection Matrix

Use CaseRecommended DatasetKey Benefits
Sequence AnalysisAMASSDatasetPreserves temporal structure
Robot ControlAMASSDatasetInterpolateConsistent frame rates
Model TrainingAMASSDatasetSingleFrameBasic single-frame access
Large-Scale TrainingOptimizedAMASSDatasetSingleFrameOptimized I/O and caching
Maximum SpeedInMemoryAMASSDatasetSingleFrameZero I/O latency

🚀 Performance Optimization Strategies

For Different System Configurations:

# Memory-constrained systems
dataset = OptimizedAMASSDatasetSingleFrame(
root_dir=dataset_path,
cache_size=32, # Smaller cache
sample_steps=5 # Reduce data volume
)

# High-memory systems
dataset = InMemoryAMASSDatasetSingleFrame(
root_dir=dataset_path,
secondary_dir="ACCAD" # Load specific subset
)

# Balanced approach
dataset = OptimizedAMASSDatasetSingleFrame(
root_dir=dataset_path,
cache_size=128, # Moderate cache
sample_steps=2 # Some data reduction
)

🔄 Data Augmentation Integration

def pose_augmentation(pose):
"""Example augmentation function"""
# Add noise
noise = torch.randn_like(pose) * 0.01
pose_augmented = pose + noise

# Random temporal shifts (for sequence data)
# Symmetry transformations
# Joint limit clamping

return pose_augmented

# Apply to any dataset
dataset = OptimizedAMASSDatasetSingleFrame(
root_dir=dataset_path,
transform=pose_augmentation
)

🚨 Best Practices

✅ Dataset Optimization Guidelines

  1. Index Persistence: Use OptimizedAMASSDatasetSingleFrame for repeated access to same dataset
  2. Memory Management: Choose appropriate cache sizes based on available RAM
  3. Parallel Processing: Leverage multiple CPU cores for initial dataset indexing
  4. Subset Selection: Use secondary_dir to focus on relevant motion types
  5. Frame Sampling: Use sample_steps to balance data volume with training quality

🔧 Performance Monitoring

import time
import psutil

def benchmark_dataset(dataset_class, **kwargs):
"""Benchmark dataset loading performance"""
start_time = time.time()

# Initialize dataset
dataset = dataset_class(**kwargs)
init_time = time.time() - start_time

# Measure data access speed
start_time = time.time()
for i in range(min(1000, len(dataset))):
_ = dataset[i]
access_time = (time.time() - start_time) / 1000

# Memory usage
memory_mb = psutil.Process().memory_info().rss / 1024 / 1024

print(f"Init time: {init_time:.2f}s")
print(f"Access time: {access_time*1000:.2f}ms per frame")
print(f"Memory usage: {memory_mb:.1f}MB")

# Compare different loaders
benchmark_dataset(AMASSDatasetSingleFrame, root_dir=dataset_path)
benchmark_dataset(OptimizedAMASSDatasetSingleFrame, root_dir=dataset_path)

🐛 Troubleshooting

Common Issues:

  1. Memory Overflow:

    # Reduce cache size or use frame sampling
    dataset = OptimizedAMASSDatasetSingleFrame(cache_size=32, sample_steps=5)
  2. Slow Initialization:

    # Check for existing index files
    index_files = glob.glob(os.path.join(root_dir, "amass_index_*.pkl"))
    print(f"Found {len(index_files)} index files")
  3. Missing Data:

    # Verify dataset structure
    pattern = os.path.join(root_dir, "*", "*", "*.npz")
    files = glob.glob(pattern)
    print(f"Found {len(files)} .npz files")

This comprehensive loader suite provides optimal data access patterns for all stages of the GBC pipeline, from initial research and validation through large-scale production training.