Overview
========
**xpm-torch** is a PyTorch training framework built on
`experimaestro `_ and
`Lightning Fabric `_. It bridges
experimaestro's configuration and experiment management system with
PyTorch model training, checkpointing, and HuggingFace Hub integration.
What's in the package
---------------------
Module system
~~~~~~~~~~~~~
The core abstraction is :class:`~xpm_torch.module.Module`, which combines
:class:`experimaestro.Config` (declarative parameters, hashing, serialization)
with ``torch.nn.Module`` (parameters, forward pass, device management).
- :class:`~xpm_torch.module.Module` — Base class for all models. Declares
parameters with ``Param[T]``, initializes structure in
:meth:`~xpm_torch.module.Module.__initialize__`, saves/loads weights with
safetensors via :meth:`~xpm_torch.module.Module.save_model` /
:meth:`~xpm_torch.module.Module.load_model`. Subclasses override
:meth:`~xpm_torch.module.Module.loader_config` to control how the model
is loaded from a checkpoint.
- :class:`~xpm_torch.module.ModuleLoader` — Lightweight task that initializes
a model and loads its weights from a checkpoint directory. Produced by
:meth:`Module.loader_config(path) `.
- :class:`~xpm_torch.module.ModuleContainer` — A plain ``nn.Module`` container
that auto-detects which children have state and wraps them with Lightning
Fabric via :meth:`~xpm_torch.module.ModuleContainer.setup_with_fabric`.
See :doc:`module` for the full API reference.
Training
~~~~~~~~
A complete training loop with checkpointing, validation, and
distributed training support:
- :class:`~xpm_torch.learner.Learner` — The main training task. Configures the
model, optimizer(s), trainer, and validation listeners. The main loop runs in
:meth:`~xpm_torch.learner.Learner.execute`.
- :class:`~xpm_torch.trainers.Trainer` / :class:`~xpm_torch.trainers.LossTrainer`
— Defines how batches are produced and processed.
- :class:`~xpm_torch.trainers.context.TrainState` — Serializes epoch/step
counters, model weights (safetensors), trainer state, and optimizer state
via :meth:`~xpm_torch.trainers.context.TrainState.save` /
:meth:`~xpm_torch.trainers.context.TrainState.load`.
- :class:`~xpm_torch.trainers.context.TrainerContext` — Passed through the
training loop; holds the current state, TensorBoard writer, Fabric instance,
and provides :meth:`~xpm_torch.trainers.context.TrainerContext.add_loss` /
:meth:`~xpm_torch.trainers.context.TrainerContext.add_metric` for
regularization and logging.
- :class:`~xpm_torch.learner.LearnerListener` — Hook called after each epoch
(e.g. for validation and early stopping). Produces
:class:`~xpm_torch.module.ModuleLoader` instances for the best checkpoint
via :meth:`~xpm_torch.learner.LearnerListener.init_task`.
- :class:`~xpm_torch.module.ModuleLoader` carries optional
:attr:`~xpm_torch.module.ModuleLoader.settings` for metadata
(e.g. :class:`~xpm_torch.learner.CheckpointSettings` for epoch,
:class:`~xpm_torch.validation.ValidationSettings` for validation key).
See :doc:`training` for the full API reference.
Optimization
~~~~~~~~~~~~
Configurable optimizers and schedulers:
- :class:`~xpm_torch.optim.ParameterOptimizer` — Associates an
:class:`~xpm_torch.optim.Optimizer` (e.g.
:class:`~xpm_torch.optim.Adam`,
:class:`~xpm_torch.optim.AdamW`,
:class:`~xpm_torch.optim.SGD`,
:class:`~xpm_torch.optim.Adafactor`) with a
:class:`~xpm_torch.schedulers.Scheduler` and optional
:class:`~xpm_torch.optim.ParameterFilter` for per-group learning rates.
- :class:`~xpm_torch.optim.GradientClippingHook` /
:class:`~xpm_torch.optim.GradientLogHook` — Hooks for gradient management.
See :doc:`optimization` for the full API reference.
Export actions
~~~~~~~~~~~~~~
After training, models can be exported to HuggingFace Hub or a local directory
via experimaestro's action system. The :class:`~xpm_torch.learner.Learner`
automatically registers :class:`~xpm_torch.actions.ExportAction` instances
during submission, which can be executed interactively after the experiment
completes. Subclass :class:`~xpm_torch.actions.ExportAction` and override
:meth:`~xpm_torch.module.Module.export_action` to customize the export
behavior for your models.
See :doc:`huggingface` for details on actions, pushing models, and customizing
the checkpoint format.
HuggingFace Hub
~~~~~~~~~~~~~~~
Utility functions for cache checking and downloading from HuggingFace Hub.
Model upload/download is handled by :class:`~xpm_torch.huggingface.TorchHFHub`
(which extends ``ExperimaestroHFHub`` from the experimaestro package)
to provide better integration for xpm-torch models and their loaders.
Experiment results
~~~~~~~~~~~~~~~~~~
- :class:`~xpm_torch.results.TrainingResults` — A serializable ``Config``
holding trained model configs and TensorBoard log paths. Saved by experiments
for later retrieval.
Batching
~~~~~~~~
- :class:`~xpm_torch.batchers.Batcher` /
:class:`~xpm_torch.batchers.PowerAdaptativeBatcher` — Handles micro-batching
with OOM recovery. Automatically reduces batch size on
``RecoverableOOMError`` and replays the failed batch.
Fabric configuration
~~~~~~~~~~~~~~~~~~~~
- :class:`~xpm_torch.configuration.FabricConfiguration` — Wraps Lightning
Fabric settings (precision, devices, strategy, accelerator) as an
experimaestro ``Config`` for declarative experiment setup.
How it fits together
--------------------
::
YAML config
→ experimaestro deserializes Config objects
→ Learner task is submitted (ExportActions registered)
→ Fabric launches training
→ Trainer produces batches, Module computes forward/backward
→ TrainerContext collects losses and metrics
→ LearnerListeners validate and checkpoint
→ ModuleLoader configs point to best checkpoints
→ After completion, ExportActions execute interactively
→ TorchHFHub serializes for Hub upload or local save
Related packages
----------------
- `experimaestro `_ —
Configuration framework, task scheduling, workspace management
- `xpmir (experimaestro-IR) `_ —
Information Retrieval models and experiments built on xpm-torch