Phorm.ai
OpenAccess-AI-Collective/axolotl

    how do I resume from checkpoint when previously training a lors?

    To resume training from a checkpoint when previously training with LoRA (Low-Rank Adaptation), you would typically follow these steps, assuming you are using the ReLoRACallback from the OpenAccess-AI-Collective/axolotl codebase for managing checkpoints during training. The ReLoRACallback class provides functionality to save and load checkpoints at specified intervals during training.

    1. Initialize the Training Environment: Set up your model, data loaders, optimizer, and any other components required for training as you normally would.

    2. Configure ReLoRACallback: When setting up your training loop, instantiate the ReLoRACallback with the appropriate configuration. Make sure to specify the resume_from_checkpoint parameter with the path to the checkpoint you want to resume from. This path should point to the directory containing the checkpoint files saved by ReLoRACallback.

    3. Load the Checkpoint: Before starting the training loop, ensure that the ReLoRACallback is properly configured to load the checkpoint. This involves setting the resume_from_checkpoint attribute of the callback to the path where your checkpoint is stored.

    4. Start Training: Proceed with training as usual. The ReLoRACallback will automatically load the weights and any other necessary state from the specified checkpoint and resume training from there.

    Here's a simplified example code snippet to illustrate these steps:

    from transformers import TrainingArguments, Trainer from axolotl.monkeypatch.relora import ReLoRACallback from axolotl.utils.callbacks import load_weight_checkpoint import os # Assuming `model`, `train_dataset`, `eval_dataset`, and `tokenizer` are already defined training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=8, per_device_eval_batch_size=8, evaluation_strategy="epoch", save_strategy="epoch", logging_dir="./logs", ) # Path to the checkpoint you want to resume from checkpoint_path = "./path/to/your/checkpoint" # Initialize the ReLoRACallback with resume_from_checkpoint relora_callback = ReLoRACallback({"resume_from_checkpoint": checkpoint_path}) # Initialize the Trainer with the ReLoRACallback trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, callbacks=[relora_callback], tokenizer=tokenizer, ) # Start training, the model will resume from the specified checkpoint trainer.train(resume_from_checkpoint=checkpoint_path)

    In this example, replace "./path/to/your/checkpoint" with the actual path to your checkpoint directory. The ReLoRACallback is responsible for loading the checkpoint and resuming training from where it left off. Ensure that the resume_from_checkpoint parameter in both the ReLoRACallback configuration and the trainer.train() method call is set to the correct checkpoint path.

    This approach assumes you are using the Axolotl framework with Hugging Face's Transformers library for training a model with LoRA. Adjustments may be needed based on your specific setup and the version of the libraries you are using.

    Search results
    [huggingface/transformers] docs/source/en/run_scripts.md

    Resume training from checkpoint

    Another helpful option to enable is resuming training from a previous checkpoint. This will ensure you can pick up where you left off without starting over if your training gets interrupted. There are two methods to resume training from a checkpoint.

    The first method uses the output_dir previous_output_dir argument to resume training from the latest checkpoint stored in output_dir. In this case, you should remove overwrite_output_dir:

    python examples/pytorch/summarization/run_summarization.py --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --output_dir previous_output_dir \ --predict_with_generate

    The second method uses the resume_from_checkpoint path_to_specific_checkpoint argument to resume training from a specific checkpoint folder.

    python examples/pytorch/summarization/run_summarization.py --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --resume_from_checkpoint path_to_specific_checkpoint \ --predict_with_generate
    [huggingface/transformers] docs/source/pt/run_scripts.md

    Retomar o treinamento a partir de um checkpoint

    Outra opção útil para habilitar é retomar o treinamento de um checkpoint anterior. Isso garantirá que você possa continuar de onde parou sem recomeçar se o seu treinamento for interrompido. Existem dois métodos para retomar o treinamento a partir de um checkpoint.

    O primeiro método usa o argumento output_dir previous_output_dir para retomar o treinamento do último checkpoint armazenado em output_dir. Neste caso, você deve remover overwrite_output_dir:

    python examples/pytorch/summarization/run_summarization.py --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --output_dir previous_output_dir \ --predict_with_generate

    O segundo método usa o argumento resume_from_checkpoint path_to_specific_checkpoint para retomar o treinamento de uma pasta de checkpoint específica.

    python examples/pytorch/summarization/run_summarization.py --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --resume_from_checkpoint path_to_specific_checkpoint \ --predict_with_generate
    [huggingface/transformers] docs/source/it/run_scripts.md

    Riavviare addestramento da un checkpoint

    Un'altra utile opzione è riavviare un addestramento da un checkpoint precedente. Questo garantirà che tu possa riprendere da dove hai interrotto senza ricominciare se l'addestramento viene interrotto. Ci sono due metodi per riavviare l'addestramento da un checkpoint:

    Il primo metodo usa l'argomento output_dir previous_output_dir per riavviare l'addestramento dall'ultima versione del checkpoint contenuto in output_dir. In questo caso, dovresti rimuovere overwrite_output_dir:

    python examples/pytorch/summarization/run_summarization.py --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --output_dir previous_output_dir \ --predict_with_generate

    Il secondo metodo usa l'argomento resume_from_checkpoint path_to_specific_checkpoint per riavviare un addestramento da una specifica cartella di checkpoint.

    python examples/pytorch/summarization/run_summarization.py --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --resume_from_checkpoint path_to_specific_checkpoint \ --predict_with_generate
    [huggingface/transformers] tests/trainer/test_trainer.py
    def test_resume_training_with_safe_checkpoint(self): # This test will fail for more than 2 GPUs since the batch size will get bigger and with the number of # save_steps, the checkpoint will resume training at epoch 2 or more (so the data seen by the model # won't be the same since the training dataloader is shuffled). for initial_safe in [False, True]: for loaded_safe in [False, True]: with tempfile.TemporaryDirectory() as tmpdir: trainer = get_regression_trainer( output_dir=tmpdir, train_len=128, save_steps=5, learning_rate=0.1, save_safetensors=initial_safe, ) trainer.train() (a, b) = trainer.model.a.item(), trainer.model.b.item() state = dataclasses.asdict(trainer.state) checkpoint = os.path.join(tmpdir, "checkpoint-5") self.convert_to_sharded_checkpoint(checkpoint, load_safe=initial_safe, save_safe=loaded_safe) # Reinitialize trainer trainer = get_regression_trainer( output_dir=tmpdir, train_len=128, save_steps=5, learning_rate=0.1, save_safetensors=loaded_safe ) trainer.train(resume_from_checkpoint=checkpoint) (a1, b1) = trainer.model.a.item(), trainer.model.b.item() state1 = dataclasses.asdict(trainer.state) self.assertEqual(a, a1) self.assertEqual(b, b1) self.check_trainer_state_are_the_same(state, state1)
    [huggingface/accelerate] tests/test_state_checkpointing.py
    def test_can_resume_training_with_folder(self): with tempfile.TemporaryDirectory() as tmpdir: set_seed(42) model = DummyModel() optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-3) train_dataloader, valid_dataloader = dummy_dataloaders() # Train baseline accelerator = Accelerator() model, optimizer, train_dataloader, valid_dataloader = accelerator.prepare( model, optimizer, train_dataloader, valid_dataloader ) # Save initial initial = os.path.join(tmpdir, "initial") accelerator.save_state(initial, safe_serialization=self.use_safetensors) (a, b) = model.a.item(), model.b.item() opt_state = optimizer.state_dict() ground_truth_rands = train(3, model, train_dataloader, optimizer, accelerator) (a1, b1) = model.a.item(), model.b.item() opt_state1 = optimizer.state_dict() # Train partially set_seed(42) model = DummyModel() optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-3) train_dataloader, valid_dataloader = dummy_dataloaders() accelerator = Accelerator() model, optimizer, train_dataloader, valid_dataloader = accelerator.prepare( model, optimizer, train_dataloader, valid_dataloader ) accelerator.load_state(initial) (a2, b2) = model.a.item(), model.b.item() opt_state2 = optimizer.state_dict() self.assertEqual(a, a2) self.assertEqual(b, b2) assert a == a2 assert b == b2 self.check_adam_state(opt_state, opt_state2, accelerator.distributed_type) test_rands = train(2, model, train_dataloader, optimizer, accelerator) # Save everything checkpoint = os.path.join(tmpdir, "checkpoint") accelerator.save_state(checkpoint, safe_serialization=self.use_safetensors) # Load everything back in and make sure all states work accelerator.load_state(checkpoint) test_rands += train(1, model, train_dataloader, optimizer, accelerator) (a3, b3) = model.a.item(), model.b.item() opt_state3 = optimizer.state_dict() assert a1 == a3 assert b1 == b3 self.check_adam_state(opt_state1, opt_state3, accelerator.distributed_type) assert ground_truth_rands == test_rands
    [huggingface/transformers] tests/trainer/test_trainer.py
    def test_can_resume_training(self): # This test will fail for more than 2 GPUs since the batch size will get bigger and with the number of # save_steps, the checkpoint will resume training at epoch 2 or more (so the data seen by the model # won't be the same since the training dataloader is shuffled). with tempfile.TemporaryDirectory() as tmpdir: kwargs = { "output_dir": tmpdir, "train_len": 128, "save_steps": 5, "learning_rate": 0.1, "logging_steps": 5, } trainer = get_regression_trainer(**kwargs) trainer.train() (a, b) = trainer.model.a.item(), trainer.model.b.item() state = dataclasses.asdict(trainer.state) checkpoint = os.path.join(tmpdir, "checkpoint-5") # Reinitialize trainer trainer = get_regression_trainer(**kwargs) trainer.train(resume_from_checkpoint=checkpoint) (a1, b1) = trainer.model.a.item(), trainer.model.b.item() state1 = dataclasses.asdict(trainer.state) self.assertEqual(a, a1) self.assertEqual(b, b1) self.check_trainer_state_are_the_same(state, state1) # Now check with a later checkpoint that it also works when we span over one epoch checkpoint = os.path.join(tmpdir, "checkpoint-15") # Reinitialize trainer and load model trainer = get_regression_trainer(**kwargs) trainer.train(resume_from_checkpoint=checkpoint) (a1, b1) = trainer.model.a.item(), trainer.model.b.item() state1 = dataclasses.asdict(trainer.state) self.assertEqual(a, a1) self.assertEqual(b, b1) self.check_trainer_state_are_the_same(state, state1) # With a regular model that is not a PreTrainedModel with tempfile.TemporaryDirectory() as tmpdir: kwargs = { "output_dir": tmpdir, "train_len": 128, "save_steps": 5, "learning_rate": 0.1, "pretrained": False, } trainer = get_regression_trainer(**kwargs) trainer.train() (a, b) = trainer.model.a.item(), trainer.model.b.item() state = dataclasses.asdict(trainer.state) checkpoint = os.path.join(tmpdir, "checkpoint-5") # Reinitialize trainer and load model trainer = get_regression_trainer(**kwargs) trainer.train(resume_from_checkpoint=checkpoint) (a1, b1) = trainer.model.a.item(), trainer.model.b.item() state1 = dataclasses.asdict(trainer.state) self.assertEqual(a, a1) self.assertEqual(b, b1) self.check_trainer_state_are_the_same(state, state1) # Now check with a later checkpoint that it also works when we span over one epoch checkpoint = os.path.join(tmpdir, "checkpoint-15") # Reinitialize trainer and load model trainer = get_regression_trainer(**kwargs) trainer.train(resume_from_checkpoint=checkpoint) (a1, b1) = trainer.model.a.item(), trainer.model.b.item() state1 = dataclasses.asdict(trainer.state) self.assertEqual(a, a1) self.assertEqual(b, b1) self.check_trainer_state_are_the_same(state, state1) # Now check failures # 1. fail to find a bogus checkpoint trainer = get_regression_trainer() with self.assertRaises(Exception) as context: trainer.train(resume_from_checkpoint=f"{checkpoint}-bogus") self.assertTrue("Can't find a valid checkpoint at" in str(context.exception)) # 2. fail to find any checkpoint - due a fresh output_dir output_dir2 = self.get_auto_remove_tmp_dir() trainer = get_regression_trainer(output_dir=output_dir2) with self.assertRaises(Exception) as context: trainer.train(resume_from_checkpoint=True) self.assertTrue("No valid checkpoint found in output directory" in str(context.exception))
    [huggingface/accelerate] tests/test_state_checkpointing.py
    def test_can_resume_training(self): with tempfile.TemporaryDirectory() as tmpdir: set_seed(42) model = DummyModel() optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-3) train_dataloader, valid_dataloader = dummy_dataloaders() project_config = ProjectConfiguration(automatic_checkpoint_naming=True) # Train baseline accelerator = Accelerator(project_dir=tmpdir, project_config=project_config) model, optimizer, train_dataloader, valid_dataloader = accelerator.prepare( model, optimizer, train_dataloader, valid_dataloader ) # Save initial accelerator.save_state(safe_serialization=self.use_safetensors) (a, b) = model.a.item(), model.b.item() opt_state = optimizer.state_dict() ground_truth_rands = train(3, model, train_dataloader, optimizer, accelerator) (a1, b1) = model.a.item(), model.b.item() opt_state1 = optimizer.state_dict() # Train partially set_seed(42) model = DummyModel() optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-3) train_dataloader, valid_dataloader = dummy_dataloaders() project_config = ProjectConfiguration(iteration=1, automatic_checkpoint_naming=True) accelerator = Accelerator(project_dir=tmpdir, project_config=project_config) model, optimizer, train_dataloader, valid_dataloader = accelerator.prepare( model, optimizer, train_dataloader, valid_dataloader ) accelerator.load_state(os.path.join(tmpdir, "checkpoints", "checkpoint_0")) (a2, b2) = model.a.item(), model.b.item() opt_state2 = optimizer.state_dict() assert a == a2 assert b == b2 self.check_adam_state(opt_state, opt_state2, accelerator.distributed_type) test_rands = train(2, model, train_dataloader, optimizer, accelerator) # Save everything accelerator.save_state(safe_serialization=self.use_safetensors) # Load everything back in and make sure all states work accelerator.load_state(os.path.join(tmpdir, "checkpoints", "checkpoint_1")) test_rands += train(1, model, train_dataloader, optimizer, accelerator) (a3, b3) = model.a.item(), model.b.item() opt_state3 = optimizer.state_dict() assert a1 == a3 assert b1 == b3 self.check_adam_state(opt_state1, opt_state3, accelerator.distributed_type) assert ground_truth_rands == test_rands
    [huggingface/transformers] docs/source/de/run_scripts.md

    Training vom Kontrollpunkt fortsetzen

    Eine weitere hilfreiche Option, die Sie aktivieren können, ist die Wiederaufnahme des Trainings von einem früheren Kontrollpunkt aus. Auf diese Weise können Sie im Falle einer Unterbrechung Ihres Trainings dort weitermachen, wo Sie aufgehört haben, ohne von vorne beginnen zu müssen. Es gibt zwei Methoden, um das Training von einem Kontrollpunkt aus wieder aufzunehmen.

    Die erste Methode verwendet das Argument output_dir previous_output_dir, um das Training ab dem letzten in output_dir gespeicherten Kontrollpunkt wieder aufzunehmen. In diesem Fall sollten Sie overwrite_output_dir entfernen:

    python examples/pytorch/summarization/run_summarization.py --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --output_dir previous_output_dir \ --predict_with_generate

    Die zweite Methode verwendet das Argument Resume_from_checkpoint path_to_specific_checkpoint, um das Training ab einem bestimmten Checkpoint-Ordner wieder aufzunehmen.

    python examples/pytorch/summarization/run_summarization.py --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --resume_from_checkpoint path_to_specific_checkpoint \ --predict_with_generate
    [huggingface/accelerate] tests/test_state_checkpointing.py
    def test_can_resume_training_checkpoints_relative_path(self): # See #1983 # This test is like test_can_resume_training but uses a relative path for the checkpoint and automatically # infers the checkpoint path when loading. @contextmanager def temporary_relative_directory(): # This is equivalent to tempfile.TemporaryDirectory() except that it returns a relative path rand_dir = f"test_path_{uuid.uuid4()}" os.mkdir(rand_dir) try: yield rand_dir finally: shutil.rmtree(rand_dir) with temporary_relative_directory() as tmpdir: set_seed(42) model = DummyModel() optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-3) train_dataloader, valid_dataloader = dummy_dataloaders() project_config = ProjectConfiguration(automatic_checkpoint_naming=True) # Train baseline accelerator = Accelerator(project_dir=tmpdir, project_config=project_config) model, optimizer, train_dataloader, valid_dataloader = accelerator.prepare( model, optimizer, train_dataloader, valid_dataloader ) # Save initial accelerator.save_state(safe_serialization=self.use_safetensors) (a, b) = model.a.item(), model.b.item() opt_state = optimizer.state_dict() ground_truth_rands = train(3, model, train_dataloader, optimizer, accelerator) (a1, b1) = model.a.item(), model.b.item() opt_state1 = optimizer.state_dict() # Train partially set_seed(42) model = DummyModel() optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-3) train_dataloader, valid_dataloader = dummy_dataloaders() project_config = ProjectConfiguration(iteration=1, automatic_checkpoint_naming=True) accelerator = Accelerator(project_dir=tmpdir, project_config=project_config) model, optimizer, train_dataloader, valid_dataloader = accelerator.prepare( model, optimizer, train_dataloader, valid_dataloader ) accelerator.load_state() # <= infer the directory automatically (a2, b2) = model.a.item(), model.b.item() opt_state2 = optimizer.state_dict() assert a == a2 assert b == b2 self.check_adam_state(opt_state, opt_state2, accelerator.distributed_type) assert opt_state == opt_state2 test_rands = train(2, model, train_dataloader, optimizer, accelerator) # Save everything accelerator.save_state(safe_serialization=self.use_safetensors) # Load everything back in and make sure all states work accelerator.load_state(os.path.join(tmpdir, "checkpoints", "checkpoint_1")) test_rands += train(1, model, train_dataloader, optimizer, accelerator) (a3, b3) = model.a.item(), model.b.item() opt_state3 = optimizer.state_dict() assert a1 == a3 assert b1 == b3 self.check_adam_state(opt_state1, opt_state3, accelerator.distributed_type) assert ground_truth_rands == test_rands
    [openaccess-ai-collective/axolotl] src/axolotl/monkeypatch/relora.py
    def on_train_begin( self, _args: TrainingArguments, _state: TrainerState, control: TrainerControl, model: peft.LoraModel, **_kwargs, ): if self.resume_from_checkpoint: weight_path = os.path.join(self.resume_from_checkpoint, "relora") if not os.path.exists(weight_path): LOG.warning( "Resuming ReLoRA from checkpoint, but no full-weight save found" ) else: LOG.info(f"Loading adjusted base weights from {weight_path}") load_weight_checkpoint(model, weight_path) return control
    [openaccess-ai-collective/axolotl] src/axolotl/monkeypatch/relora.py
    def on_step_begin( self, args: TrainingArguments, state: TrainerState, control: TrainerControl, model: peft.LoraModel, optimizer: torch.optim.Optimizer, **_kwargs, ): if state.global_step > 0 and state.global_step % self.relora_steps == 0: checkpoint_folder = os.path.join( args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}", "relora", ) if "adam" in args.optim.lower(): optimizer_state_keys = ["exp_avg", "exp_avg_sq"] else: raise ValueError(f"Optimizer {args.optim} not supported with ReLoRA") lora_params = [ n for n, p in model.named_parameters() if p.requires_grad and "lora_" in n ] model.save_pretrained( os.path.join( args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}", "adapter", ), safe_serialization=True, ) with torch.no_grad(): merge_and_save( model, self.last_full_model, checkpoint_folder, reinit=True, quantized=self.quantized, actually_save=is_main_process(), cpu_offload=self.cpu_offload, ) reset_optimizer( optimizer, reset_params=lora_params, optimizer_state_keys=optimizer_state_keys, prune_ratio=args.relora_prune_ratio, ) if self.quantized: self.last_full_model = checkpoint_folder self.num_lora_restarts += 1 return control
    [huggingface/accelerate] docs/source/usage_guides/checkpoint.md

    Checkpointing

    When training a PyTorch model with 🤗 Accelerate, you may often want to save and continue a state of training. Doing so requires saving and loading the model, optimizer, RNG generators, and the GradScaler. Inside 🤗 Accelerate are two convenience functions to achieve this quickly:

    • Use [~Accelerator.save_state] for saving everything mentioned above to a folder location
    • Use [~Accelerator.load_state] for loading everything stored from an earlier save_state

    To further customize where and how states are saved through [~Accelerator.save_state] the [~utils.ProjectConfiguration] class can be used. For example if automatic_checkpoint_naming is enabled each saved checkpoint will be located then at Accelerator.project_dir/checkpoints/checkpoint_{checkpoint_number}.

    It should be noted that the expectation is that those states come from the same training script, they should not be from two separate scripts.

    • By using [~Accelerator.register_for_checkpointing], you can register custom objects to be automatically stored or loaded from the two prior functions, so long as the object has a state_dict and a load_state_dict functionality. This could include objects such as a learning rate scheduler.

    Below is a brief example using checkpointing to save and reload a state during training:

    from accelerate import Accelerator import torch accelerator = Accelerator(project_dir="my/save/path") my_scheduler = torch.optim.lr_scheduler.StepLR(my_optimizer, step_size=1, gamma=0.99) my_model, my_optimizer, my_training_dataloader = accelerator.prepare(my_model, my_optimizer, my_training_dataloader) # Register the LR scheduler accelerator.register_for_checkpointing(my_scheduler) # Save the starting state accelerator.save_state() device = accelerator.device my_model.to(device) # Perform training for epoch in range(num_epochs): for batch in my_training_dataloader: my_optimizer.zero_grad() inputs, targets = batch inputs = inputs.to(device) targets = targets.to(device) outputs = my_model(inputs) loss = my_loss_function(outputs, targets) accelerator.backward(loss) my_optimizer.step() my_scheduler.step() # Restore the previous state accelerator.load_state("my/save/path/checkpointing/checkpoint_0")

    Restoring the state of the DataLoader

    After resuming from a checkpoint, it may also be desirable to resume from a particular point in the active DataLoader if the state was saved during the middle of an epoch. You can use [~Accelerator.skip_first_batches] to do so.

    from accelerate import Accelerator accelerator = Accelerator(project_dir="my/save/path") train_dataloader = accelerator.prepare(train_dataloader) accelerator.load_state("my_state") # Assume the checkpoint was saved 100 steps into the epoch skipped_dataloader = accelerator.skip_first_batches(train_dataloader, 100) # After the first iteration, go back to `train_dataloader` # First epoch for batch in skipped_dataloader: # Do something pass # Second epoch for batch in train_dataloader: # Do something pass
    [openaccess-ai-collective/axolotl] src/axolotl/monkeypatch/relora.py
    class ReLoRACallback(TrainerCallback): """Callback to merge LoRA weights into the base model and save full-weight checkpoints""" def __init__(self, cfg: DictDefault): self.relora_steps = cfg.relora_steps self.cpu_offload = cfg.relora_cpu_offload self.quantized = cfg.load_in_4bit or cfg.load_in_8bit self.last_full_model = cfg.base_model self.resume_from_checkpoint = cfg.resume_from_checkpoint if not os.path.exists(self.last_full_model): self.last_full_model = str(Path(snapshot_download(cfg.base_model))) assert os.path.exists( self.last_full_model ), "for ReLORA base_model must be a local path" self.num_lora_restarts = 0 self.need_full_save = False def on_train_begin( self, _args: TrainingArguments, _state: TrainerState, control: TrainerControl, model: peft.LoraModel, **_kwargs, ): if self.resume_from_checkpoint: weight_path = os.path.join(self.resume_from_checkpoint, "relora") if not os.path.exists(weight_path): LOG.warning( "Resuming ReLoRA from checkpoint, but no full-weight save found" ) else: LOG.info(f"Loading adjusted base weights from {weight_path}") load_weight_checkpoint(model, weight_path) return control def on_step_begin( self, args: TrainingArguments, state: TrainerState, control: TrainerControl, model: peft.LoraModel, optimizer: torch.optim.Optimizer, **_kwargs, ): if state.global_step > 0 and state.global_step % self.relora_steps == 0: checkpoint_folder = os.path.join( args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}", "relora", ) if "adam" in args.optim.lower(): optimizer_state_keys = ["exp_avg", "exp_avg_sq"] else: raise ValueError(f"Optimizer {args.optim} not supported with ReLoRA") lora_params = [ n for n, p in model.named_parameters() if p.requires_grad and "lora_" in n ] model.save_pretrained( os.path.join( args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}", "adapter", ), safe_serialization=True, ) with torch.no_grad(): merge_and_save( model, self.last_full_model, checkpoint_folder, reinit=True, quantized=self.quantized, actually_save=is_main_process(), cpu_offload=self.cpu_offload, ) reset_optimizer( optimizer, reset_params=lora_params, optimizer_state_keys=optimizer_state_keys, prune_ratio=args.relora_prune_ratio, ) if self.quantized: self.last_full_model = checkpoint_folder self.num_lora_restarts += 1 return control def on_save( self, args: TrainingArguments, state: TrainerState, control: TrainerControl, model: peft.LoraModel, **_kwargs, ): checkpoint_folder = os.path.join( args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}", "relora" ) if ( state.global_step >= self.relora_steps and state.global_step % self.relora_steps != 0 ): if self.quantized: if is_main_process() and self.last_full_model != checkpoint_folder: # ensure the latest full parameter save is in the latest checkpoint # folder, so that automatic pruning of checkpoints does not remove it LOG.info(f"moving last full parameter save to {checkpoint_folder}") os.makedirs(checkpoint_folder, exist_ok=True) chunks = glob.glob( f"{self.last_full_model}/model*.safetensors" ) + glob.glob(f"{self.last_full_model}/model*.index.json") for path in chunks: new_path = os.path.abspath(shutil.move(path, checkpoint_folder)) try: os.symlink(new_path, path) except OSError: # probably on windows without permission to symlink pass self.last_full_model = checkpoint_folder else: model.model.save_pretrained(checkpoint_folder, safe_serialization=True) return control def on_log( self, _args: TrainingArguments, _state: TrainerState, control: TrainerControl, logs: Dict[str, float], **_kwargs, ): logs["num_lora_restarts"] = self.num_lora_restarts return control def on_train_end( self, args: TrainingArguments, _state: TrainerState, control: TrainerControl, model: peft.LoraModel, **_kwargs, ): if self.quantized: # perform final merge and save with torch.no_grad(): merge_and_save( model, self.last_full_model, args.output_dir, reinit=False, quantized=self.quantized, actually_save=is_main_process(), cpu_offload=self.cpu_offload, ) # no need to save if unquantized, as finetune.py will call merge_and_unload() return control
    [openaccess-ai-collective/axolotl] src/axolotl/monkeypatch/relora.py
    def on_save( self, args: TrainingArguments, state: TrainerState, control: TrainerControl, model: peft.LoraModel, **_kwargs, ): checkpoint_folder = os.path.join( args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}", "relora" ) if ( state.global_step >= self.relora_steps and state.global_step % self.relora_steps != 0 ): if self.quantized: if is_main_process() and self.last_full_model != checkpoint_folder: # ensure the latest full parameter save is in the latest checkpoint # folder, so that automatic pruning of checkpoints does not remove it LOG.info(f"moving last full parameter save to {checkpoint_folder}") os.makedirs(checkpoint_folder, exist_ok=True) chunks = glob.glob( f"{self.last_full_model}/model*.safetensors" ) + glob.glob(f"{self.last_full_model}/model*.index.json") for path in chunks: new_path = os.path.abspath(shutil.move(path, checkpoint_folder)) try: os.symlink(new_path, path) except OSError: # probably on windows without permission to symlink pass self.last_full_model = checkpoint_folder else: model.model.save_pretrained(checkpoint_folder, safe_serialization=True) return control
    [openaccess-ai-collective/axolotl] src/axolotl/monkeypatch/relora.py
    def on_train_end( self, args: TrainingArguments, _state: TrainerState, control: TrainerControl, model: peft.LoraModel, **_kwargs, ): if self.quantized: # perform final merge and save with torch.no_grad(): merge_and_save( model, self.last_full_model, args.output_dir, reinit=False, quantized=self.quantized, actually_save=is_main_process(), cpu_offload=self.cpu_offload, ) # no need to save if unquantized, as finetune.py will call merge_and_unload() return control
    [openaccess-ai-collective/axolotl] src/axolotl/core/trainers/trl.py
    def train( self, reward_pipe, resume_from_checkpoint=None, # pylint: disable=unused-argument ): generation_kwargs = { "min_length": -1, "top_k": 0.0, "top_p": 1.0, "do_sample": True, "pad_token_id": self.tokenizer.eos_token_id, "max_new_tokens": 32, } sent_kwargs = { "return_all_scores": True, "function_to_apply": "none", "batch_size": 16, } for epoch, batch in tqdm( # pylint: disable=unused-variable enumerate(self.dataloader) ): query_tensors = batch["input_ids"] # generate model response response_tensors, ref_response_tensors = self.generate( query_tensors, return_prompt=False, generate_ref_response=True, **generation_kwargs ) batch["response"] = self.tokenizer.batch_decode(response_tensors) batch["ref_response"] = self.tokenizer.batch_decode(ref_response_tensors) # Compute sentiment score texts = [q + r for q, r in zip(batch["query"], batch["response"])] pipe_outputs = reward_pipe(texts, **sent_kwargs) rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs] ref_texts = [q + r for q, r in zip(batch["query"], batch["ref_response"])] ref_pipe_outputs = reward_pipe(ref_texts, **sent_kwargs) ref_rewards = [ torch.tensor(output[1]["score"]) for output in ref_pipe_outputs ] batch["ref_rewards"] = ref_rewards # Run PPO step stats = self.step(query_tensors, response_tensors, rewards) self.log_stats( stats, batch, rewards, columns_to_log=["query", "response", "ref_response", "ref_rewards"], )
    [huggingface/peft] examples/boft_controlnet/train_controlnet.py
    if accelerator.sync_gradients: progress_bar.update(1) if args.report_to == "wandb": accelerator.print(progress_bar) global_step += 1 step_save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") if accelerator.is_main_process: if global_step % args.validation_steps == 0 or global_step == 1: logger.info(f"Running validation... \n Generating {args.num_validation_images} images.") logger.info("Running validation... ") with torch.no_grad(): log_validation(val_dataset, text_encoder, unet, controlnet, args, accelerator) if global_step % args.checkpointing_steps == 0: save_adaptor(accelerator, step_save_path, {"controlnet": controlnet, "unet": unet}) # save text_encoder if any if args.train_text_encoder: save_adaptor(accelerator, step_save_path, {"text_encoder": text_encoder}) accelerator.save_state(step_save_path) logger.info(f"Saved {global_step} state to {step_save_path}") logger.info(f"Saved current state to {step_save_path}") logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0]} progress_bar.set_postfix(**logs) accelerator.log(logs, step=global_step) if global_step >= args.max_train_steps: break # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage accelerator.print(f"GPU Memory before entering the train : {b2mb(tracemalloc.begin)}") accelerator.print(f"GPU Memory consumed at the end of the train (end-begin): {tracemalloc.used}") accelerator.print(f"GPU Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}") accelerator.print( f"GPU Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}" ) accelerator.print(f"CPU Memory before entering the train : {b2mb(tracemalloc.cpu_begin)}") accelerator.print(f"CPU Memory consumed at the end of the train (end-begin): {tracemalloc.cpu_used}") accelerator.print(f"CPU Peak Memory consumed during the train (max-begin): {tracemalloc.cpu_peaked}") accelerator.print( f"CPU Total Peak Memory consumed during the train (max): {tracemalloc.cpu_peaked + b2mb(tracemalloc.cpu_begin)}" ) # Create the pipeline using using the trained modules and save it. accelerator.wait_for_everyone() accelerator.end_training()
    [huggingface/peft] examples/lora_dreambooth/train_dreambooth.py
    # Compute prior loss prior_loss = F.mse_loss(model_pred_prior.float(), target_prior.float(), reduction="mean") # Add the prior loss to the instance loss. loss = loss + args.prior_loss_weight * prior_loss else: loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean") accelerator.backward(loss) if accelerator.sync_gradients: params_to_clip = ( itertools.chain(unet.parameters(), text_encoder.parameters()) if args.train_text_encoder else unet.parameters() ) accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) if args.report_to == "wandb": accelerator.print(progress_bar) global_step += 1 # if global_step % args.checkpointing_steps == 0: # if accelerator.is_main_process: # save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") # accelerator.save_state(save_path) # logger.info(f"Saved state to {save_path}") logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0]} progress_bar.set_postfix(**logs) accelerator.log(logs, step=global_step) if ( args.validation_prompt is not None and (step + num_update_steps_per_epoch * epoch) % args.validation_steps == 0 ): logger.info( f"Running validation... \n Generating {args.num_validation_images} images with prompt:" f" {args.validation_prompt}." ) # create pipeline pipeline = DiffusionPipeline.from_pretrained( args.pretrained_model_name_or_path, safety_checker=None, revision=args.revision, ) # set `keep_fp32_wrapper` to True because we do not want to remove # mixed precision hooks while we are still training pipeline.unet = accelerator.unwrap_model(unet, keep_fp32_wrapper=True) pipeline.text_encoder = accelerator.unwrap_model(text_encoder, keep_fp32_wrapper=True) pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) pipeline = pipeline.to(accelerator.device) pipeline.set_progress_bar_config(disable=True) # run inference if args.seed is not None: generator = torch.Generator(device=accelerator.device).manual_seed(args.seed) else: generator = None images = [] for _ in range(args.num_validation_images): image = pipeline(args.validation_prompt, num_inference_steps=25, generator=generator).images[0] images.append(image) for tracker in accelerator.trackers: if tracker.name == "tensorboard": np_images = np.stack([np.asarray(img) for img in images]) tracker.writer.add_images("validation", np_images, epoch, dataformats="NHWC") if tracker.name == "wandb": import wandb tracker.log( { "validation": [ wandb.Image(image, caption=f"{i}: {args.validation_prompt}") for i, image in enumerate(images) ] } ) del pipeline torch.cuda.empty_cache() if global_step >= args.max_train_steps: break # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage if not args.no_tracemalloc: accelerator.print(f"GPU Memory before entering the train : {b2mb(tracemalloc.begin)}") accelerator.print(f"GPU Memory consumed at the end of the train (end-begin): {tracemalloc.used}") accelerator.print(f"GPU Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}") accelerator.print( f"GPU Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}" ) accelerator.print(f"CPU Memory before entering the train : {b2mb(tracemalloc.cpu_begin)}") accelerator.print(f"CPU Memory consumed at the end of the train (end-begin): {tracemalloc.cpu_used}") accelerator.print(f"CPU Peak Memory consumed during the train (max-begin): {tracemalloc.cpu_peaked}") accelerator.print( f"CPU Total Peak Memory consumed during the train (max): {tracemalloc.cpu_peaked + b2mb(tracemalloc.cpu_begin)}" ) # Create the pipeline using using the trained modules and save it. accelerator.wait_for_everyone() if accelerator.is_main_process: if args.use_lora: unwarpped_unet = accelerator.unwrap_model(unet) unwarpped_unet.save_pretrained( os.path.join(args.output_dir, "unet"), state_dict=accelerator.get_state_dict(unet) ) if args.train_text_encoder: unwarpped_text_encoder = accelerator.unwrap_model(text_encoder) unwarpped_text_encoder.save_pretrained( os.path.join(args.output_dir, "text_encoder"), state_dict=accelerator.get_state_dict(text_encoder), ) else: pipeline = DiffusionPipeline.from_pretrained( args.pretrained_model_name_or_path, unet=accelerator.unwrap_model(unet), text_encoder=accelerator.unwrap_model(text_encoder), revision=args.revision, ) pipeline.save_pretrained(args.output_dir) if args.push_to_hub: api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message="End of training", run_as_future=True, ) accelerator.end_training()
    [huggingface/peft] examples/loftq_finetuning/train_gsm8k_llama.py
    completed_steps = starting_epoch * num_update_steps_per_epoch else: # need to multiply `gradient_accumulation_steps` to reflect real steps resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) resume_step -= starting_epoch * len(train_dataloader) completed_steps = resume_step // args.gradient_accumulation_steps # update the progress_bar if load from checkpoint progress_bar.update(completed_steps) for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: # We skip the first `n` batches in the dataloader when resuming from a checkpoint active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) else: active_dataloader = train_dataloader for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch if args.with_tracking: total_loss += loss.detach().float() accelerator.backward(loss) if completed_steps % 50: accelerator.print(f"Epoch: {epoch} | Step: {completed_steps} | Loss: {loss}") optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) completed_steps += 1 if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if completed_steps >= args.max_train_steps: break model.eval() gen_kwargs = { "max_new_tokens": args.max_target_length, "temperature": args.temperature, "top_k": args.k, "top_p": args.p, "do_sample": True, } ans_pred_list = [] ans_gold_list = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): gen_kwargs["input_ids"] = batch["input_ids"] gen_kwargs["attention_mask"] = batch["attention_mask"] generated_tokens = accelerator.unwrap_model(model).generate(**gen_kwargs) pred_tokens = generated_tokens[:, args.max_source_length :] pred_tokens = accelerator.pad_across_processes(pred_tokens, dim=1, pad_index=tokenizer.pad_token_id) gold_tokens = batch["labels"] if not args.pad_to_max_length: # If we did not pad to max length, we need to pad the labels too gold_tokens = accelerator.pad_across_processes( batch["labels"], dim=1, pad_index=tokenizer.pad_token_id ) pred_tokens, gold_tokens = accelerator.gather_for_metrics((pred_tokens, gold_tokens)) pred_tokens, gold_tokens = pred_tokens.cpu().numpy(), gold_tokens.cpu().numpy() if isinstance(pred_tokens, tuple): pred_tokens = pred_tokens[0] decoded_pred = tokenizer.batch_decode(pred_tokens, skip_special_tokens=True) decoded_gold = tokenizer.batch_decode(gold_tokens, skip_special_tokens=True) # Extract the numbers in sentences accelerator.print(decoded_pred) ans_pred_list += [extract_answer_number(sentence_pred) for sentence_pred in decoded_pred] ans_gold_list += [extract_answer_number(sentence_gold) for sentence_gold in decoded_gold] accelerator.print(ans_pred_list) accelerator.print(ans_gold_list) accuracy = compute_accuracy(ans_gold_list, ans_pred_list) logger.info(f"epoch {epoch}: accuracy: {accuracy}") if args.with_tracking: accelerator.log( { "accuracy": accuracy, "train_loss": total_loss.item() / len(train_dataloader), "epoch": epoch, "step": completed_steps, }, step=completed_steps, ) if args.push_to_hub and epoch < args.num_train_epochs - 1: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message=f"Training in progress epoch {epoch}", run_as_future=True, ) if args.checkpointing_steps == "epoch": output_dir = f"epoch_{epoch}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if args.with_tracking: accelerator.end_training() if args.output_dir is not None: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) if args.push_to_hub: api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message="End of training", )
    [huggingface/peft] docs/source/accelerate/deepspeed.md

    Train

    Run the following command to launch the training script. Earlier, you saved the configuration file to ds_zero3_cpu.yaml, so you'll need to pass the path to the launcher with the --config_file argument like this:

    accelerate launch --config_file ds_zero3_cpu.yaml examples/peft_lora_seq2seq_accelerate_ds_zero3_offload.py

    You'll see some output logs that track memory usage during training, and once it's completed, the script returns the accuracy and compares the predictions to the labels:

    GPU Memory before entering the train : 1916 GPU Memory consumed at the end of the train (end-begin): 66 GPU Peak Memory consumed during the train (max-begin): 7488 GPU Total Peak Memory consumed during the train (max): 9404 CPU Memory before entering the train : 19411 CPU Memory consumed at the end of the train (end-begin): 0 CPU Peak Memory consumed during the train (max-begin): 0 CPU Total Peak Memory consumed during the train (max): 19411 epoch=4: train_ppl=tensor(1.0705, device='cuda:0') train_epoch_loss=tensor(0.0681, device='cuda:0') 100%|████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:27<00:00, 3.92s/it] GPU Memory before entering the eval : 1982 GPU Memory consumed at the end of the eval (end-begin): -66 GPU Peak Memory consumed during the eval (max-begin): 672 GPU Total Peak Memory consumed during the eval (max): 2654 CPU Memory before entering the eval : 19411 CPU Memory consumed at the end of the eval (end-begin): 0 CPU Peak Memory consumed during the eval (max-begin): 0 CPU Total Peak Memory consumed during the eval (max): 19411 accuracy=100.0 eval_preds[:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint'] dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
    [openaccess-ai-collective/axolotl] src/axolotl/utils/callbacks/__init__.py
    def on_step_end( self, _args: TrainingArguments, state: TrainerState, control: TrainerControl, **_kwargs, ): if len(state.log_history) > 0 and "loss" in state.log_history[-1]: if state.log_history[-1]["loss"] > self.threshold: self.violations += 1 if self.violations >= self.patience: LOG.warning( "Loss is too high, stopping training (loss_watchdog_threshold)" ) control.should_training_stop = True else: self.violations = 0 return control
    [openaccess-ai-collective/axolotl] src/axolotl/train.py
    def train( *, cfg: DictDefault, cli_args: TrainerCliArgs, dataset_meta: TrainDatasetMeta ) -> Tuple[Union[PeftModel, PreTrainedModel], PreTrainedTokenizer]: # load the tokenizer first LOG.debug( f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}", main_process_only=True, ) tokenizer = load_tokenizer(cfg) train_dataset = dataset_meta.train_dataset eval_dataset = dataset_meta.eval_dataset total_num_steps = dataset_meta.total_num_steps if cfg.resume_from_checkpoint is None and cfg.auto_resume_from_checkpoints: possible_checkpoints = [ str(cp) for cp in Path(cfg.output_dir).glob("checkpoint-*") ] if len(possible_checkpoints) > 0: sorted_paths = sorted( possible_checkpoints, key=lambda path: int(path.split("-")[-1]), ) cfg.resume_from_checkpoint = sorted_paths[-1] LOG.info( f"Using Auto-resume functionality to start with checkpoint at {cfg.resume_from_checkpoint}" ) resume_from_checkpoint = cfg.resume_from_checkpoint # Load the model and tokenizer msg = "loading model" if cfg.adapter: msg += " and peft_config..." LOG.debug(msg) # we wait unitl the last possible moment to setup Accelerator Accelerator() model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference) model.generation_config.do_sample = True model_ref = None if cfg.rl and cfg.rl != "orpo": if cfg.adapter and not cfg.rl_adapter_ref_model: # use built-in trl autounwrap LOG.debug("Passing model_ref: None to RL trainer") model_ref = None # explicit setting to None else: # load the model again for model_ref/baseline model_ref, _ = load_model( cfg, tokenizer, inference=cli_args.inference, reference_model=True ) safe_serialization = cfg.save_safetensors is True if cfg.unfrozen_parameters: freeze_layers_except(model, cfg.unfrozen_parameters) trainer = setup_trainer( cfg, train_dataset, eval_dataset, (model, model_ref, peft_config), tokenizer, total_num_steps, ) # go ahead and presave, so we have the adapter config available to inspect if peft_config: LOG.info(f"Pre-saving adapter config to {cfg.output_dir}") peft_config.save_pretrained(cfg.output_dir) # additionally presave the tokenizer and model configs if not Path(cfg.output_dir).is_dir(): os.makedirs(cfg.output_dir, exist_ok=True) tokenizer.save_pretrained(str(Path(cfg.output_dir))) if hasattr(model, "config"): model.config.save_pretrained(str(Path(cfg.output_dir))) # In case we want to stop early with ctrl+c, this is a nice to have to save the pretrained model if cfg.local_rank == 0: def terminate_handler(_, __, model_weakref): if model_weakref() is not None: _model = model_weakref() if cfg.flash_optimum and BetterTransformer: _model = BetterTransformer.reverse(_model) _model.save_pretrained( cfg.output_dir, safe_serialization=safe_serialization ) sys.exit(0) _model_weakref = weakref.ref(model) signal.signal( signal.SIGINT, lambda signum, frame: terminate_handler(signum, frame, _model_weakref), ) badge_markdown = """[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)""" transformers.modelcard.AUTOGENERATED_TRAINER_COMMENT += f"\n{badge_markdown}" if getattr(cfg, "axolotl_config_path"): raw_axolotl_cfg = Path(cfg.axolotl_config_path) version = get_distribution("axolotl").version if raw_axolotl_cfg.is_file(): transformers.modelcard.AUTOGENERATED_TRAINER_COMMENT += f"\n<details><summary>See axolotl config</summary>\n\naxolotl version: `{version}`\n```yaml\n{raw_axolotl_cfg.read_text(encoding='utf-8')}\n```\n\n</details><br>\n" LOG.info("Starting trainer...") if cfg.group_by_length: LOG.info("hang tight... sorting dataset for group_by_length") pretrain_hooks(cfg, trainer) if cfg.flash_optimum: with torch.backends.cuda.sdp_kernel( # TODO configure these from the YAML w/ sdp_kernel_kwargs: ... enable_flash=True, enable_math=True, enable_mem_efficient=True, ): trainer.train(resume_from_checkpoint=resume_from_checkpoint) else: trainer.train(resume_from_checkpoint=resume_from_checkpoint) post_train_hooks(cfg, trainer) LOG.info(f"Training Completed!!! Saving pre-trained model to {cfg.output_dir}") # post training for name, module in model.named_modules(): if hasattr(module, "_post_training"): module._post_training(model, name) # pylint: disable=protected-access if trainer.is_fsdp_enabled: trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT") LOG.info("Set FSDP state dict type to FULL_STATE_DICT for saving.") if cfg.relora_steps: if cfg.adapter == "lora" and not (cfg.load_in_4bit or cfg.load_in_8bit): model = model.merge_and_unload() else: # final model weights have already been saved by `ReLoRACallback.on_train_end` return model, tokenizer # TODO do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading # only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file if cfg.fsdp: trainer.save_model(cfg.output_dir) elif cfg.deepspeed and is_deepspeed_zero3_enabled(): # Copied over from: https://github.com/huggingface/accelerate/blob/5ae611118057232f441055f7ef9ba0b0f2b8d533/docs/source/usage_guides/deepspeed.md#saving-and-loading trainer.accelerator.wait_for_everyone() unwrapped_model = trainer.accelerator.unwrap_model(trainer.model_wrapped) # Saves the whole/unpartitioned fp16 model when in ZeRO Stage-3 to the output directory if # `stage3_gather_16bit_weights_on_model_save` is True in DeepSpeed Config file or # `zero3_save_16bit_model` is True in DeepSpeed Plugin. # For Zero Stages 1 and 2, models are saved as usual in the output directory. # The model name saved is `pytorch_model.bin` unwrapped_model.save_pretrained( cfg.output_dir, is_main_process=trainer.accelerator.is_main_process, save_function=trainer.accelerator.save, state_dict=trainer.accelerator.get_state_dict(trainer.model_wrapped), ) elif cfg.local_rank == 0: if cfg.flash_optimum and BetterTransformer: model = BetterTransformer.reverse(model) model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization) if not cfg.hub_model_id: try: trainer.create_model_card(model_name=cfg.output_dir.lstrip("./")) except AttributeError: pass elif cfg.hub_model_id: # defensively push to the hub to ensure the model card is updated trainer.push_to_hub() return model, tokenizer
    [openaccess-ai-collective/axolotl] src/axolotl/monkeypatch/relora.py
    def on_log( self, _args: TrainingArguments, _state: TrainerState, control: TrainerControl, logs: Dict[str, float], **_kwargs, ): logs["num_lora_restarts"] = self.num_lora_restarts return control
    [openaccess-ai-collective/axolotl] src/axolotl/utils/gradient_checkpointing/unsloth.py
    """Unsloth checkpointing"""
    [openaccess-ai-collective/axolotl] README.md

    Advanced Setup

    Environment

    Docker

    docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest

    Or run on the current files for development:

    docker compose up -d

    [!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.

    <details> <summary>Docker advanced</summary>

    A more powerful Docker command to run would be this:

    docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

    It additionally:

    • Prevents memory issues when running e.g. deepspeed (e.g. you could hit SIGBUS/signal 7 error) through --ipc and --ulimit args.
    • Persists the downloaded HF data (models etc.) and your modifications to axolotl code through --mount/-v args.
    • The --name argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...) or in your terminal.
    • The --privileged flag gives all capabilities to the container.
    • The --shm-size 10g argument increases the shared memory size. Use this if you see exitcode: -7 errors using deepspeed.

    More information on nvidia website

    </details>

    Conda/Pip venv

    1. Install python >=3.10

    2. Install pytorch stable https://pytorch.org/get-started/locally/

    3. Install Axolotl along with python dependencies

      pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
    4. (Optional) Login to Huggingface to use gated models/datasets.

      huggingface-cli login

      Get the token at huggingface.co/settings/tokens

    Cloud GPU

    For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest

    Bare Metal Cloud GPU

    LambdaLabs
    <details> <summary>Click to Expand</summary>
    1. Install python
    sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
    1. Install pip
    wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
    1. Install Pytorch https://pytorch.org/get-started/locally/

    2. Follow instructions on quickstart.

    3. Run

    pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
    1. Set path
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
    </details>
    GCP
    <details> <summary>Click to Expand</summary>

    Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.

    Make sure to run the below to uninstall xla.

    pip uninstall -y torch_xla[tpu]
    </details>

    Windows

    Please use WSL or Docker!

    Mac

    Use the below instead of the install method in QuickStart.

    pip3 install -e '.'
    

    More info: mac.md

    Google Colab

    Please use this example notebook.

    Launching on public clouds via SkyPilot

    To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:

    pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check

    Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1:

    git clone https://github.com/skypilot-org/skypilot.git
    cd skypilot/llm/axolotl
    

    Use one command to launch:

    # On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET

    Dataset

    Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.

    See these docs for more information on how to use different dataset formats.

    Config

    See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:

    • model

      base_model: ./llama-7b-hf # local or huggingface repo

      Note: The code will load the right architecture.

    • dataset

      datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
    • loading

      load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP

      Note: Repo does not do 4-bit quantization.

    • lora

      adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj

    All Config Options

    See these docs for all config options.

    Train

    Run

    accelerate launch -m axolotl.cli.train your_config.yml

    [!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

    Preprocess dataset

    You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

    • Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
    • (Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
    • (Optional): Use --debug to see preprocessed examples.
    python -m axolotl.cli.preprocess your_config.yml

    Multi-GPU

    Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

    DeepSpeed

    Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated

    We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.

    deepspeed: deepspeed_configs/zero1.json
    accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
    FSDP
    • llama FSDP
    fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
    FSDP + QLoRA

    Axolotl supports training with FSDP and QLoRA, see these docs for more information.

    Weights & Biases Logging

    Make sure your WANDB_API_KEY environment variable is set (recommended) or you login to wandb with wandb login.

    • wandb options
    wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
    Special Tokens

    It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:

    special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"

    When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.

    Inference Playground

    Axolotl allows you to load your model in an interactive terminal playground for quick experimentation. The config file is the same config file used for training.

    Pass the appropriate flag to the inference command, depending upon what kind of model was trained:

    • Pretrained LORA:
      python -m axolotl.cli.inference examples/your_config.yml --lora_model_dir="./lora-output-dir"
    • Full weights finetune:
      python -m axolotl.cli.inference examples/your_config.yml --base_model="./completed-model"
    • Full weights finetune w/ a prompt from a text file:
      cat /tmp/prompt.txt | python -m axolotl.cli.inference examples/your_config.yml \ --base_model="./completed-model" --prompter=None --load_in_8bit=True

    -- With gradio hosting

    python -m axolotl.cli.inference examples/your_config.yml --gradio

    Please use --sample_packing False if you have it on and receive the error similar to below:

    RuntimeError: stack expects each tensor to be equal size, but got [1, 32, 1, 128] at entry 0 and [1, 32, 8, 128] at entry 1

    Merge LORA to base

    The following command will merge your LORA adapater with your base model. You can optionally pass the argument --lora_model_dir to specify the directory where your LORA adapter was saved, otherwhise, this will be inferred from output_dir in your axolotl config file. The merged model is saved in the sub-directory {lora_model_dir}/merged.

    python3 -m axolotl.cli.merge_lora your_config.yml --lora_model_dir="./completed-model"

    You may need to use the gpu_memory_limit and/or lora_on_cpu config options to avoid running out of memory. If you still run out of CUDA memory, you can try to merge in system RAM with

    CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.merge_lora ...

    although this will be very slow, and using the config options above are recommended instead.

    [openaccess-ai-collective/axolotl] docs/nccl.qmd
    ---
    title: NCCL
    description: Troubleshooting NCCL issues
    ---
    
    NVIDIA NCCL is a library to facilitate and optimize multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce, etc. Broadly, NCCL configuration is highly environment-specific and is configured via several [environment variables](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html). A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
    
    ```text
    Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.
    

    Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised. Nvidia recommends disabling PCI access control services (ACS) as a possible solution if this is available to you.

    Forcing cross-GPU communication via NVLink may help without increasing timeouts. To verify that your configuration is leveraging NVLink run the following command:

    nvidia-smi nvlink --status

    To force NCCL to use NVLink, simply set this in the environment:

    export NCCL_P2P_LEVEL=NVL

    If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL in the table below:

    | NCCL_P2P_LEVEL | Description | | -------------- | ----------- | | PIX | P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication. | | PXB | P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency. | | PHB | P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL) |

    To validate that acceptable data transfer speeds exist for your training job, running NCCL Tests can help pinpoint bottlenecks, for example:

    ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

    It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:

    export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=ALL export TORCH_DISTRIBUTED_DEBUG=INFO export TORCHELASTIC_ERROR_FILE=/PATH/TO/torcherror.log

    Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout value in the Axolotl configuration. See PyTorch init_process_group for documentation on this value.

OpenAccess-AI-Collective/axolotl
huggingface/transformers
huggingface/peft
huggingface/accelerate