Overview

xpm-torch is a PyTorch training framework built on experimaestro and Lightning Fabric. It bridges experimaestro’s configuration and experiment management system with PyTorch model training, checkpointing, and HuggingFace Hub integration.

What’s in the package

Module system

The core abstraction is Module, which combines experimaestro.Config (declarative parameters, hashing, serialization) with torch.nn.Module (parameters, forward pass, device management).

  • Module — Base class for all models. Declares parameters with Param[T], initializes structure in __initialize__(), saves/loads weights with safetensors via save_model() / load_model(). Subclasses override loader_config() to control how the model is loaded from a checkpoint.

  • ModuleLoader — Lightweight task that initializes a model and loads its weights from a checkpoint directory. Produced by Module.loader_config(path).

  • ModuleContainer — A plain nn.Module container that auto-detects which children have state and wraps them with Lightning Fabric via setup_with_fabric().

See Module System for the full API reference.

Training

A complete training loop with checkpointing, validation, and distributed training support:

  • Learner — The main training task. Configures the model, optimizer(s), trainer, and validation listeners. The main loop runs in execute().

  • Trainer / LossTrainer — Defines how batches are produced and processed.

  • TrainState — Serializes epoch/step counters, model weights (safetensors), trainer state, and optimizer state via save() / load().

  • TrainerContext — Passed through the training loop; holds the current state, TensorBoard writer, Fabric instance, and provides add_loss() / add_metric() for regularization and logging.

  • LearnerListener — Hook called after each epoch (e.g. for validation and early stopping). Produces ModuleLoader instances for the best checkpoint via init_task().

  • ModuleLoader carries optional settings for metadata (e.g. CheckpointSettings for epoch, ValidationSettings for validation key).

See Training for the full API reference.

Optimization

Configurable optimizers and schedulers:

See Optimization for the full API reference.

Export actions

After training, models can be exported to HuggingFace Hub or a local directory via experimaestro’s action system. The Learner automatically registers ExportAction instances during submission, which can be executed interactively after the experiment completes. Subclass ExportAction and override export_action() to customize the export behavior for your models.

See HuggingFace Hub Integration for details on actions, pushing models, and customizing the checkpoint format.

HuggingFace Hub

Utility functions for cache checking and downloading from HuggingFace Hub. Model upload/download is handled by TorchHFHub (which extends ExperimaestroHFHub from the experimaestro package) to provide better integration for xpm-torch models and their loaders.

Experiment results

  • TrainingResults — A serializable Config holding trained model configs and TensorBoard log paths. Saved by experiments for later retrieval.

Batching

  • Batcher / PowerAdaptativeBatcher — Handles micro-batching with OOM recovery. Automatically reduces batch size on RecoverableOOMError and replays the failed batch.

Fabric configuration

  • FabricConfiguration — Wraps Lightning Fabric settings (precision, devices, strategy, accelerator) as an experimaestro Config for declarative experiment setup.

How it fits together

YAML config
  → experimaestro deserializes Config objects
    → Learner task is submitted (ExportActions registered)
      → Fabric launches training
        → Trainer produces batches, Module computes forward/backward
          → TrainerContext collects losses and metrics
            → LearnerListeners validate and checkpoint
              → ModuleLoader configs point to best checkpoints
  → After completion, ExportActions execute interactively
      → TorchHFHub serializes for Hub upload or local save