Overview

xpm-torch is a PyTorch training framework built on experimaestro and Lightning Fabric. It bridges experimaestro’s configuration and experiment management system with PyTorch model training, checkpointing, and HuggingFace Hub integration.

What’s in the package

Module system

The core abstraction is Module, which combines experimaestro.Config (declarative parameters, hashing, serialization) with torch.nn.Module (parameters, forward pass, device management).

Module — Base class for all models. Declares parameters with Param[T], initializes structure in __initialize__(), saves/loads weights with safetensors via save_model() / load_model(). Subclasses override loader_config() to control how the model is loaded from a checkpoint.
ModuleLoader — Lightweight task that initializes a model and loads its weights from a checkpoint directory. Produced by Module.loader_config(path).
ModuleContainer — A plain nn.Module container that auto-detects which children have state and wraps them with Lightning Fabric via setup_with_fabric().

See Module System for the full API reference.

Training

A complete training loop with checkpointing, validation, and distributed training support:

Learner — The main training task. Configures the model, optimizer(s), trainer, and validation listeners. The main loop runs in execute().
Trainer / LossTrainer — Defines how batches are produced and processed.
TrainState — Serializes epoch/step counters, model weights (safetensors), trainer state, and optimizer state via save() / load().
TrainerContext — Passed through the training loop; holds the current state, TensorBoard writer, Fabric instance, and provides add_loss() / add_metric() for regularization and logging.
LearnerListener — Hook called after each epoch (e.g. for validation and early stopping). Produces ModuleLoader instances for the best checkpoint via init_task().
ModuleLoader carries optional settings for metadata (e.g. CheckpointSettings for epoch, ValidationSettings for validation key).

See Training for the full API reference.

Optimization

Configurable optimizers and schedulers:

ParameterOptimizer — Associates an Optimizer (e.g. Adam, AdamW, SGD, Adafactor) with a Scheduler and optional ParameterFilter for per-group learning rates.
GradientClippingHook / GradientLogHook — Hooks for gradient management.

See Optimization for the full API reference.

Export actions

After training, models can be exported to HuggingFace Hub or a local directory via experimaestro’s action system. The Learner automatically registers ExportAction instances during submission, which can be executed interactively after the experiment completes. Subclass ExportAction and override export_action() to customize the export behavior for your models.

See HuggingFace Hub Integration for details on actions, pushing models, and customizing the checkpoint format.

HuggingFace Hub

Utility functions for cache checking and downloading from HuggingFace Hub. Model upload/download is handled by TorchHFHub (which extends ExperimaestroHFHub from the experimaestro package) to provide better integration for xpm-torch models and their loaders.

Experiment results

TrainingResults — A serializable Config holding trained model configs and TensorBoard log paths. Saved by experiments for later retrieval.

Batching

Batcher / PowerAdaptativeBatcher — Handles micro-batching with OOM recovery. Automatically reduces batch size on RecoverableOOMError and replays the failed batch.

Fabric configuration

FabricConfiguration — Wraps Lightning Fabric settings (precision, devices, strategy, accelerator) as an experimaestro Config for declarative experiment setup.

How it fits together

YAML config
  → experimaestro deserializes Config objects
    → Learner task is submitted (ExportActions registered)
      → Fabric launches training
        → Trainer produces batches, Module computes forward/backward
          → TrainerContext collects losses and metrics
            → LearnerListeners validate and checkpoint
              → ModuleLoader configs point to best checkpoints
  → After completion, ExportActions execute interactively
      → TorchHFHub serializes for Hub upload or local save