Overview ======== **xpm-torch** is a PyTorch training framework built on `experimaestro `_ and `Lightning Fabric `_. It bridges experimaestro's configuration and experiment management system with PyTorch model training, checkpointing, and HuggingFace Hub integration. What's in the package --------------------- Module system ~~~~~~~~~~~~~ The core abstraction is :class:`~xpm_torch.module.Module`, which combines :class:`experimaestro.Config` (declarative parameters, hashing, serialization) with ``torch.nn.Module`` (parameters, forward pass, device management). - :class:`~xpm_torch.module.Module` — Base class for all models. Declares parameters with ``Param[T]``, initializes structure in :meth:`~xpm_torch.module.Module.__initialize__`, saves/loads weights with safetensors via :meth:`~xpm_torch.module.Module.save_model` / :meth:`~xpm_torch.module.Module.load_model`. Subclasses override :meth:`~xpm_torch.module.Module.loader_config` to control how the model is loaded from a checkpoint. - :class:`~xpm_torch.module.ModuleLoader` — Lightweight task that initializes a model and loads its weights from a checkpoint directory. Produced by :meth:`Module.loader_config(path) `. - :class:`~xpm_torch.module.ModuleContainer` — A plain ``nn.Module`` container that auto-detects which children have state and wraps them with Lightning Fabric via :meth:`~xpm_torch.module.ModuleContainer.setup_with_fabric`. See :doc:`module` for the full API reference. Training ~~~~~~~~ A complete training loop with checkpointing, validation, and distributed training support: - :class:`~xpm_torch.learner.Learner` — The main training task. Configures the model, optimizer(s), trainer, and validation listeners. The main loop runs in :meth:`~xpm_torch.learner.Learner.execute`. - :class:`~xpm_torch.trainers.Trainer` / :class:`~xpm_torch.trainers.LossTrainer` — Defines how batches are produced and processed. - :class:`~xpm_torch.trainers.context.TrainState` — Serializes epoch/step counters, model weights (safetensors), trainer state, and optimizer state via :meth:`~xpm_torch.trainers.context.TrainState.save` / :meth:`~xpm_torch.trainers.context.TrainState.load`. - :class:`~xpm_torch.trainers.context.TrainerContext` — Passed through the training loop; holds the current state, TensorBoard writer, Fabric instance, and provides :meth:`~xpm_torch.trainers.context.TrainerContext.add_loss` / :meth:`~xpm_torch.trainers.context.TrainerContext.add_metric` for regularization and logging. - :class:`~xpm_torch.learner.LearnerListener` — Hook called after each epoch (e.g. for validation and early stopping). Produces :class:`~xpm_torch.module.ModuleLoader` instances for the best checkpoint via :meth:`~xpm_torch.learner.LearnerListener.init_task`. - :class:`~xpm_torch.module.ModuleLoader` carries optional :attr:`~xpm_torch.module.ModuleLoader.settings` for metadata (e.g. :class:`~xpm_torch.learner.CheckpointSettings` for epoch, :class:`~xpm_torch.validation.ValidationSettings` for validation key). See :doc:`training` for the full API reference. Optimization ~~~~~~~~~~~~ Configurable optimizers and schedulers: - :class:`~xpm_torch.optim.ParameterOptimizer` — Associates an :class:`~xpm_torch.optim.Optimizer` (e.g. :class:`~xpm_torch.optim.Adam`, :class:`~xpm_torch.optim.AdamW`, :class:`~xpm_torch.optim.SGD`, :class:`~xpm_torch.optim.Adafactor`) with a :class:`~xpm_torch.schedulers.Scheduler` and optional :class:`~xpm_torch.optim.ParameterFilter` for per-group learning rates. - :class:`~xpm_torch.optim.GradientClippingHook` / :class:`~xpm_torch.optim.GradientLogHook` — Hooks for gradient management. See :doc:`optimization` for the full API reference. Export actions ~~~~~~~~~~~~~~ After training, models can be exported to HuggingFace Hub or a local directory via experimaestro's action system. The :class:`~xpm_torch.learner.Learner` automatically registers :class:`~xpm_torch.actions.ExportAction` instances during submission, which can be executed interactively after the experiment completes. Subclass :class:`~xpm_torch.actions.ExportAction` and override :meth:`~xpm_torch.module.Module.export_action` to customize the export behavior for your models. See :doc:`huggingface` for details on actions, pushing models, and customizing the checkpoint format. HuggingFace Hub ~~~~~~~~~~~~~~~ Utility functions for cache checking and downloading from HuggingFace Hub. Model upload/download is handled by :class:`~xpm_torch.huggingface.TorchHFHub` (which extends ``ExperimaestroHFHub`` from the experimaestro package) to provide better integration for xpm-torch models and their loaders. Experiment results ~~~~~~~~~~~~~~~~~~ - :class:`~xpm_torch.results.TrainingResults` — A serializable ``Config`` holding trained model configs and TensorBoard log paths. Saved by experiments for later retrieval. Batching ~~~~~~~~ - :class:`~xpm_torch.batchers.Batcher` / :class:`~xpm_torch.batchers.PowerAdaptativeBatcher` — Handles micro-batching with OOM recovery. Automatically reduces batch size on ``RecoverableOOMError`` and replays the failed batch. Fabric configuration ~~~~~~~~~~~~~~~~~~~~ - :class:`~xpm_torch.configuration.FabricConfiguration` — Wraps Lightning Fabric settings (precision, devices, strategy, accelerator) as an experimaestro ``Config`` for declarative experiment setup. How it fits together -------------------- :: YAML config → experimaestro deserializes Config objects → Learner task is submitted (ExportActions registered) → Fabric launches training → Trainer produces batches, Module computes forward/backward → TrainerContext collects losses and metrics → LearnerListeners validate and checkpoint → ModuleLoader configs point to best checkpoints → After completion, ExportActions execute interactively → TorchHFHub serializes for Hub upload or local save Related packages ---------------- - `experimaestro `_ — Configuration framework, task scheduling, workspace management - `xpmir (experimaestro-IR) `_ — Information Retrieval models and experiments built on xpm-torch