Overview
xpm-torch is a PyTorch training framework built on experimaestro and Lightning Fabric. It bridges experimaestro’s configuration and experiment management system with PyTorch model training, checkpointing, and HuggingFace Hub integration.
What’s in the package
Module system
The core abstraction is Module, which combines
experimaestro.Config (declarative parameters, hashing, serialization)
with torch.nn.Module (parameters, forward pass, device management).
Module— Base class for all models. Declares parameters withParam[T], initializes structure in__initialize__(), saves/loads weights with safetensors viasave_model()/load_model(). Subclasses overrideloader_config()to control how the model is loaded from a checkpoint.ModuleLoader— Lightweight task that initializes a model and loads its weights from a checkpoint directory. Produced byModule.loader_config(path).ModuleContainer— A plainnn.Modulecontainer that auto-detects which children have state and wraps them with Lightning Fabric viasetup_with_fabric().
See Module System for the full API reference.
Training
A complete training loop with checkpointing, validation, and distributed training support:
Learner— The main training task. Configures the model, optimizer(s), trainer, and validation listeners. The main loop runs inexecute().Trainer/LossTrainer— Defines how batches are produced and processed.TrainState— Serializes epoch/step counters, model weights (safetensors), trainer state, and optimizer state viasave()/load().TrainerContext— Passed through the training loop; holds the current state, TensorBoard writer, Fabric instance, and providesadd_loss()/add_metric()for regularization and logging.LearnerListener— Hook called after each epoch (e.g. for validation and early stopping). ProducesModuleLoaderinstances for the best checkpoint viainit_task().ModuleLoadercarries optionalsettingsfor metadata (e.g.CheckpointSettingsfor epoch,ValidationSettingsfor validation key).
See Training for the full API reference.
Optimization
Configurable optimizers and schedulers:
ParameterOptimizer— Associates anOptimizer(e.g.Adam,AdamW,SGD,Adafactor) with aSchedulerand optionalParameterFilterfor per-group learning rates.GradientClippingHook/GradientLogHook— Hooks for gradient management.
See Optimization for the full API reference.
Export actions
After training, models can be exported to HuggingFace Hub or a local directory
via experimaestro’s action system. The Learner
automatically registers ExportAction instances
during submission, which can be executed interactively after the experiment
completes. Subclass ExportAction and override
export_action() to customize the export
behavior for your models.
See HuggingFace Hub Integration for details on actions, pushing models, and customizing the checkpoint format.
HuggingFace Hub
Utility functions for cache checking and downloading from HuggingFace Hub.
Model upload/download is handled by TorchHFHub
(which extends ExperimaestroHFHub from the experimaestro package)
to provide better integration for xpm-torch models and their loaders.
Experiment results
TrainingResults— A serializableConfigholding trained model configs and TensorBoard log paths. Saved by experiments for later retrieval.
Batching
Batcher/PowerAdaptativeBatcher— Handles micro-batching with OOM recovery. Automatically reduces batch size onRecoverableOOMErrorand replays the failed batch.
Fabric configuration
FabricConfiguration— Wraps Lightning Fabric settings (precision, devices, strategy, accelerator) as an experimaestroConfigfor declarative experiment setup.
How it fits together
YAML config
→ experimaestro deserializes Config objects
→ Learner task is submitted (ExportActions registered)
→ Fabric launches training
→ Trainer produces batches, Module computes forward/backward
→ TrainerContext collects losses and metrics
→ LearnerListeners validate and checkpoint
→ ModuleLoader configs point to best checkpoints
→ After completion, ExportActions execute interactively
→ TorchHFHub serializes for Hub upload or local save