Phorm.ai
OpenAccess-AI-Collective/axolotl

    I trained a pre trained model on completion format data now I want to continue with an instruction training by using lora. How I am going to do this?

    To continue training a pre-trained model on instruction format data using LoRA (Low-Rank Adaptation), you will need to follow these steps:

    1. Load the Pre-Trained Model: First, load your pre-trained model. Ensure it is compatible with LoRA adaptation.

    2. Prepare the Instruction Format Data: Ensure your instruction format data is properly prepared and loaded. This typically involves creating a dataset that includes instructions and expected outputs.

    3. Configure LoRA: Define a LoraConfig with the desired parameters for the LoRA adaptation. This includes specifying the rank (r), the scaling factor (lora_alpha), and potentially which layers to adapt (target_modules).

    4. Apply LoRA to the Model: Use the get_peft_model function from the PEFT library to apply the LoRA configuration to your model. This step modifies your model to include LoRA parameters.

    5. Training Setup: Set up your training loop, including the optimizer, learning rate scheduler, and any other training configurations. If you're using Hugging Face's Trainer or Accelerate, ensure your model, data, and training process are compatible.

    6. Train the Model: Train your model on the instruction format data. Monitor the training process to ensure the model is learning as expected.

    Here's a simplified code example to illustrate these steps, assuming you're working within the Hugging Face ecosystem and have a compatible model:

    from transformers import AutoModelForSeq2SeqLM, AutoTokenizer from peft import LoraConfig, get_peft_model from datasets import load_dataset from transformers import TrainingArguments, Trainer # Step 1: Load the pre-trained model and tokenizer model_name = "your-pretrained-model-name" model = AutoModelForSeq2SeqLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Step 2: Prepare your instruction format data dataset = load_dataset("your-dataset-name") # You might need to preprocess your dataset here # Step 3: Configure LoRA lora_config = LoraConfig( r=16, lora_alpha=16, target_modules=["query", "value"], # Adapt based on your model's architecture ) # Step 4: Apply LoRA to the model model, lora_config = get_peft_model(model, lora_config) # Step 5: Training setup training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, ) # Define your data collator, tokenizer, and other training components as needed # Step 6: Train the model trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["validation"], tokenizer=tokenizer, # data_collator=data_collator, ) trainer.train()

    This example is quite generic and might need adjustments based on your specific model, dataset, and training requirements. Ensure you have the PEFT library installed and properly configured in your environment.

    Search results
    [huggingface/peft] docs/source/task_guides/lora_based_methods.md

    Training

    For training, let's use the [~transformers.Trainer] class from Transformers. The [Trainer] contains a PyTorch training loop, and when you're ready, call [~transformers.Trainer.train] to start training. To customize the training run, configure the training hyperparameters in the [~transformers.TrainingArguments] class. With LoRA-like methods, you can afford to use a higher batch size and learning rate.

    [!WARNING] AdaLoRA has an [~AdaLoraModel.update_and_allocate] method that should be called at each training step to update the parameter budget and mask, otherwise the adaptation step is not performed. This requires writing a custom training loop or subclassing the [~transformers.Trainer] to incorporate this method. As an example, take a look at this custom training loop.

    from transformers import TrainingArguments, Trainer account = "stevhliu" peft_model_id = f"{account}/google/vit-base-patch16-224-in21k-lora" batch_size = 128 args = TrainingArguments( peft_model_id, remove_unused_columns=False, evaluation_strategy="epoch", save_strategy="epoch", learning_rate=5e-3, per_device_train_batch_size=batch_size, gradient_accumulation_steps=4, per_device_eval_batch_size=batch_size, fp16=True, num_train_epochs=5, logging_steps=10, load_best_model_at_end=True, label_names=["labels"], )

    Begin training with [~transformers.Trainer.train].

    trainer = Trainer( model, args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=image_processor, data_collator=collate_fn, ) trainer.train()
    [openaccess-ai-collective/axolotl] src/axolotl/train.py
    """Prepare and train a model on a dataset. Can also infer from a model or merge lora"""
    [huggingface/peft] examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py
    def main(): accelerator = Accelerator() model_name_or_path = "bigscience/bloomz-7b1" dataset_name = "twitter_complaints" peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1) text_column = "Tweet text" label_column = "text_label" lr = 3e-3 num_epochs = 20 batch_size = 8 seed = 42 max_length = 64 do_test = False set_seed(seed) dataset = load_dataset("ought/raft", dataset_name) classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names] dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["Label"]]}, batched=True, num_proc=1, ) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) def preprocess_function(examples): batch_size = len(examples[text_column]) inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]] targets = [str(x) for x in examples[label_column]] model_inputs = tokenizer(inputs) labels = tokenizer(targets, add_special_tokens=False) # don't add bos token because we concatenate with inputs for i in range(batch_size): sample_input_ids = model_inputs["input_ids"][i] label_input_ids = labels["input_ids"][i] + [tokenizer.eos_token_id] model_inputs["input_ids"][i] = sample_input_ids + label_input_ids labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i]) for i in range(batch_size): sample_input_ids = model_inputs["input_ids"][i] label_input_ids = labels["input_ids"][i] model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * ( max_length - len(sample_input_ids) ) + sample_input_ids model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[ "attention_mask" ][i] labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length]) model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length]) labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length]) model_inputs["labels"] = labels["input_ids"] return model_inputs def test_preprocess_function(examples): batch_size = len(examples[text_column]) inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]] model_inputs = tokenizer(inputs) # print(model_inputs) for i in range(batch_size): sample_input_ids = model_inputs["input_ids"][i] model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * ( max_length - len(sample_input_ids) ) + sample_input_ids model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[ "attention_mask" ][i] model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length]) model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length]) return model_inputs with accelerator.main_process_first(): processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=True, desc="Running tokenizer on dataset", ) accelerator.wait_for_everyone() train_dataset = processed_datasets["train"] with accelerator.main_process_first(): processed_datasets = dataset.map( test_preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=False, desc="Running tokenizer on dataset", ) eval_dataset = processed_datasets["train"] test_dataset = processed_datasets["test"] train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True ) eval_dataloader = DataLoader( eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True ) test_dataloader = DataLoader( test_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True ) print(next(iter(train_dataloader))) # creating model model = AutoModelForCausalLM.from_pretrained(model_name_or_path) model = get_peft_model(model, peft_config) model.print_trainable_parameters() # optimizer optimizer = torch.optim.AdamW(model.parameters(), lr=lr) # lr scheduler lr_scheduler = get_linear_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=0, num_training_steps=(len(train_dataloader) * num_epochs), ) model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler = accelerator.prepare( model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler ) accelerator.print(model) is_ds_zero_3 = False if getattr(accelerator.state, "deepspeed_plugin", None): is_ds_zero_3 = accelerator.state.deepspeed_plugin.zero_stage == 3 for epoch in range(num_epochs): with TorchTracemalloc() as tracemalloc: model.train() total_loss = 0 for step, batch in enumerate(tqdm(train_dataloader)): outputs = model(**batch) loss = outputs.loss total_loss += loss.detach().float() accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage accelerator.print(f"GPU Memory before entering the train : {b2mb(tracemalloc.begin)}") accelerator.print(f"GPU Memory consumed at the end of the train (end-begin): {tracemalloc.used}") accelerator.print(f"GPU Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}") accelerator.print( f"GPU Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}" ) accelerator.print(f"CPU Memory before entering the train : {b2mb(tracemalloc.cpu_begin)}") accelerator.print(f"CPU Memory consumed at the end of the train (end-begin): {tracemalloc.cpu_used}") accelerator.print(f"CPU Peak Memory consumed during the train (max-begin): {tracemalloc.cpu_peaked}") accelerator.print( f"CPU Total Peak Memory consumed during the train (max): {tracemalloc.cpu_peaked + b2mb(tracemalloc.cpu_begin)}" ) train_epoch_loss = total_loss / len(train_dataloader) train_ppl = torch.exp(train_epoch_loss) accelerator.print(f"{epoch=}: {train_ppl=} {train_epoch_loss=}") model.eval() eval_preds = [] with TorchTracemalloc() as tracemalloc: for _, batch in enumerate(tqdm(eval_dataloader)): batch = {k: v for k, v in batch.items() if k != "labels"} with torch.no_grad(): outputs = accelerator.unwrap_model(model).generate( **batch, synced_gpus=is_ds_zero_3, max_new_tokens=10 ) # synced_gpus=True for DS-stage 3 outputs = accelerator.pad_across_processes(outputs, dim=1, pad_index=tokenizer.pad_token_id) preds = accelerator.gather_for_metrics(outputs) preds = preds[:, max_length:].detach().cpu().numpy() eval_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True)) # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage accelerator.print(f"GPU Memory before entering the eval : {b2mb(tracemalloc.begin)}") accelerator.print(f"GPU Memory consumed at the end of the eval (end-begin): {tracemalloc.used}") accelerator.print(f"GPU Peak Memory consumed during the eval (max-begin): {tracemalloc.peaked}") accelerator.print( f"GPU Total Peak Memory consumed during the eval (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}" ) accelerator.print(f"CPU Memory before entering the eval : {b2mb(tracemalloc.cpu_begin)}") accelerator.print(f"CPU Memory consumed at the end of the eval (end-begin): {tracemalloc.cpu_used}") accelerator.print(f"CPU Peak Memory consumed during the eval (max-begin): {tracemalloc.cpu_peaked}") accelerator.print( f"CPU Total Peak Memory consumed during the eval (max): {tracemalloc.cpu_peaked + b2mb(tracemalloc.cpu_begin)}" ) correct = 0 total = 0 assert len(eval_preds) == len( dataset["train"][label_column] ), f"{len(eval_preds)} != {len(dataset['train'][label_column])}" for pred, true in zip(eval_preds, dataset["train"][label_column]): if pred.strip() == true.strip(): correct += 1 total += 1 accuracy = correct / total * 100 accelerator.print(f"{accuracy=}") accelerator.print(f"{eval_preds[:10]=}") accelerator.print(f"{dataset['train'][label_column][:10]=}") if do_test: model.eval() test_preds = [] for _, batch in enumerate(tqdm(test_dataloader)): batch = {k: v for k, v in batch.items() if k != "labels"} with torch.no_grad(): outputs = accelerator.unwrap_model(model).generate( **batch, synced_gpus=is_ds_zero_3, max_new_tokens=10 ) # synced_gpus=True for DS-stage 3 outputs = accelerator.pad_across_processes(outputs, dim=1, pad_index=tokenizer.pad_token_id) preds = accelerator.gather(outputs) preds = preds[:, max_length:].detach().cpu().numpy() test_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True)) test_preds_cleaned = [] for _, pred in enumerate(test_preds): test_preds_cleaned.append(get_closest_label(pred, classes)) test_df = dataset["test"].to_pandas() assert len(test_preds_cleaned) == len(test_df), f"{len(test_preds_cleaned)} != {len(test_df)}" test_df[label_column] = test_preds_cleaned test_df["text_labels_orig"] = test_preds accelerator.print(test_df[[text_column, label_column]].sample(20)) pred_df = test_df[["ID", label_column]] pred_df.columns = ["ID", "Label"] os.makedirs(f"data/{dataset_name}", exist_ok=True) pred_df.to_csv(f"data/{dataset_name}/predictions.csv", index=False) accelerator.wait_for_everyone() # Option1: Pushing the model to Hugging Face Hub # model.push_to_hub( # f"{dataset_name}_{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}".replace("/", "_"), # token = "hf_..." # ) # token (`bool` or `str`, *optional*): # `token` is to be used for HTTP Bearer authorization when accessing remote files. If `True`, will use the token generated # when running `huggingface-cli login` (stored in `~/.huggingface`). Will default to `True` if `repo_url` # is not specified. # Or you can get your token from https://huggingface.co/settings/token # Option2: Saving the model locally peft_model_id = f"{dataset_name}_{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}".replace( "/", "_" ) model.save_pretrained(peft_model_id) accelerator.wait_for_everyone()
    [huggingface/peft] docs/source/task_guides/lora_based_methods.md

    Model

    Now let's load a pretrained model to use as the base model. This guide uses the google/vit-base-patch16-224-in21k model, but you can use any image classification model you want. Pass the label2id and id2label dictionaries to the model so it knows how to map the integer labels to their class labels, and you can optionally pass the ignore_mismatched_sizes=True parameter if you're finetuning a checkpoint that has already been finetuned.

    from transformers import AutoModelForImageClassification, TrainingArguments, Trainer model = AutoModelForImageClassification.from_pretrained( "google/vit-base-patch16-224-in21k", label2id=label2id, id2label=id2label, ignore_mismatched_sizes=True, )

    PEFT configuration and model

    Every PEFT method requires a configuration that holds all the parameters specifying how the PEFT method should be applied. Once the configuration is setup, pass it to the [~peft.get_peft_model] function along with the base model to create a trainable [PeftModel].

    <Tip>

    Call the [~PeftModel.print_trainable_parameters] method to compare the number of parameters of [PeftModel] versus the number of parameters in the base model!

    </Tip> <hfoptions id="loras"> <hfoption id="LoRA">

    LoRA decomposes the weight update matrix into two smaller matrices. The size of these low-rank matrices is determined by its rank or r. A higher rank means the model has more parameters to train, but it also means the model has more learning capacity. You'll also want to specify the target_modules which determine where the smaller matrices are inserted. For this guide, you'll target the query and value matrices of the attention blocks. Other important parameters to set are lora_alpha (scaling factor), bias (whether none, all or only the LoRA bias parameters should be trained), and modules_to_save (the modules apart from the LoRA layers to be trained and saved). All of these parameters - and more - are found in the [LoraConfig].

    from peft import LoraConfig, get_peft_model config = LoraConfig( r=16, lora_alpha=16, target_modules=["query", "value"], lora_dropout=0.1, bias="none", modules_to_save=["classifier"], ) model = get_peft_model(model, config) model.print_trainable_parameters() "trainable params: 667,493 || all params: 86,543,818 || trainable%: 0.7712775047664294"
    </hfoption> <hfoption id="LoHa">

    LoHa decomposes the weight update matrix into four smaller matrices and each pair of smaller matrices is combined with the Hadamard product. This allows the weight update matrix to keep the same number of trainable parameters when compared to LoRA, but with a higher rank (r^2 for LoHA when compared to 2*r for LoRA). The size of the smaller matrices is determined by its rank or r. You'll also want to specify the target_modules which determines where the smaller matrices are inserted. For this guide, you'll target the query and value matrices of the attention blocks. Other important parameters to set are alpha (scaling factor), and modules_to_save (the modules apart from the LoHa layers to be trained and saved). All of these parameters - and more - are found in the [LoHaConfig].

    from peft import LoHaConfig, get_peft_model config = LoHaConfig( r=16, alpha=16, target_modules=["query", "value"], module_dropout=0.1, modules_to_save=["classifier"], ) model = get_peft_model(model, config) model.print_trainable_parameters() "trainable params: 1,257,317 || all params: 87,133,642 || trainable%: 1.4429753779831676"
    </hfoption> <hfoption id="LoKr">

    LoKr expresses the weight update matrix as a decomposition of a Kronecker product, creating a block matrix that is able to preserve the rank of the original weight matrix. The size of the smaller matrices are determined by its rank or r. You'll also want to specify the target_modules which determines where the smaller matrices are inserted. For this guide, you'll target the query and value matrices of the attention blocks. Other important parameters to set are alpha (scaling factor), and modules_to_save (the modules apart from the LoKr layers to be trained and saved). All of these parameters - and more - are found in the [LoKrConfig].

    from peft import LoKrConfig, get_peft_model config = LoKrConfig( r=16, alpha=16, target_modules=["query", "value"], module_dropout=0.1, modules_to_save=["classifier"], ) model = get_peft_model(model, config) model.print_trainable_parameters() "trainable params: 116,069 || all params: 87,172,042 || trainable%: 0.13314934162033282"
    </hfoption> <hfoption id="AdaLoRA">

    AdaLoRA efficiently manages the LoRA parameter budget by assigning important weight matrices more parameters and pruning less important ones. In contrast, LoRA evenly distributes parameters across all modules. You can control the average desired rank or r of the matrices, and which modules to apply AdaLoRA to with target_modules. Other important parameters to set are lora_alpha (scaling factor), and modules_to_save (the modules apart from the AdaLoRA layers to be trained and saved). All of these parameters - and more - are found in the [AdaLoraConfig].

    from peft import AdaLoraConfig, get_peft_model config = AdaLoraConfig( r=8, init_r=12, tinit=200, tfinal=1000, deltaT=10, target_modules=["query", "value"], modules_to_save=["classifier"], ) model = get_peft_model(model, config) model.print_trainable_parameters() "trainable params: 520,325 || all params: 87,614,722 || trainable%: 0.5938785036606062"
    </hfoption> </hfoptions>

    Training

    For training, let's use the [~transformers.Trainer] class from Transformers. The [Trainer] contains a PyTorch training loop, and when you're ready, call [~transformers.Trainer.train] to start training. To customize the training run, configure the training hyperparameters in the [~transformers.TrainingArguments] class. With LoRA-like methods, you can afford to use a higher batch size and learning rate.

    [!WARNING] AdaLoRA has an [~AdaLoraModel.update_and_allocate] method that should be called at each training step to update the parameter budget and mask, otherwise the adaptation step is not performed. This requires writing a custom training loop or subclassing the [~transformers.Trainer] to incorporate this method. As an example, take a look at this custom training loop.

    from transformers import TrainingArguments, Trainer account = "stevhliu" peft_model_id = f"{account}/google/vit-base-patch16-224-in21k-lora" batch_size = 128 args = TrainingArguments( peft_model_id, remove_unused_columns=False, evaluation_strategy="epoch", save_strategy="epoch", learning_rate=5e-3, per_device_train_batch_size=batch_size, gradient_accumulation_steps=4, per_device_eval_batch_size=batch_size, fp16=True, num_train_epochs=5, logging_steps=10, load_best_model_at_end=True, label_names=["labels"], )

    Begin training with [~transformers.Trainer.train].

    trainer = Trainer( model, args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=image_processor, data_collator=collate_fn, ) trainer.train()
    [openaccess-ai-collective/axolotl] scripts/finetune.py
    """Prepare and train a model on a dataset. Can also infer from a model or merge lora"""
    [huggingface/peft] examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py
    def main(): accelerator = Accelerator() # model_name_or_path = "bigscience/T0_3B" model_name_or_path = "facebook/bart-large" dataset_name = "twitter_complaints" peft_config = LoraConfig( task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1 ) text_column = "Tweet text" label_column = "text_label" lr = 3e-3 num_epochs = 5 batch_size = 8 seed = 42 do_test = False set_seed(seed) dataset = load_dataset("ought/raft", dataset_name) classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names] dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["Label"]]}, batched=True, num_proc=1, ) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes]) def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, truncation=True) labels = tokenizer( targets, max_length=target_max_length, padding="max_length", truncation=True, return_tensors="pt" ) labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs with accelerator.main_process_first(): processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=True, desc="Running tokenizer on dataset", ) accelerator.wait_for_everyone() train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["train"] test_dataset = processed_datasets["test"] def collate_fn(examples): return tokenizer.pad(examples, padding="longest", return_tensors="pt") train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True ) eval_dataloader = DataLoader(eval_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True) test_dataloader = DataLoader(test_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True) # creating model model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path) model = get_peft_model(model, peft_config) model.print_trainable_parameters() # optimizer optimizer = torch.optim.AdamW(model.parameters(), lr=lr) # lr scheduler lr_scheduler = get_linear_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=0, num_training_steps=(len(train_dataloader) * num_epochs), ) model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler = accelerator.prepare( model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler ) accelerator.print(model) is_ds_zero_3 = False if getattr(accelerator.state, "deepspeed_plugin", None): is_ds_zero_3 = accelerator.state.deepspeed_plugin.zero_stage == 3 for epoch in range(num_epochs): with TorchTracemalloc() as tracemalloc: model.train() total_loss = 0 for step, batch in enumerate(tqdm(train_dataloader)): outputs = model(**batch) loss = outputs.loss total_loss += loss.detach().float() accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage accelerator.print(f"GPU Memory before entering the train : {b2mb(tracemalloc.begin)}") accelerator.print(f"GPU Memory consumed at the end of the train (end-begin): {tracemalloc.used}") accelerator.print(f"GPU Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}") accelerator.print( f"GPU Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}" ) accelerator.print(f"CPU Memory before entering the train : {b2mb(tracemalloc.cpu_begin)}") accelerator.print(f"CPU Memory consumed at the end of the train (end-begin): {tracemalloc.cpu_used}") accelerator.print(f"CPU Peak Memory consumed during the train (max-begin): {tracemalloc.cpu_peaked}") accelerator.print( f"CPU Total Peak Memory consumed during the train (max): {tracemalloc.cpu_peaked + b2mb(tracemalloc.cpu_begin)}" ) train_epoch_loss = total_loss / len(train_dataloader) train_ppl = torch.exp(train_epoch_loss) accelerator.print(f"{epoch=}: {train_ppl=} {train_epoch_loss=}") model.eval() eval_preds = [] with TorchTracemalloc() as tracemalloc: for _, batch in enumerate(tqdm(eval_dataloader)): batch = {k: v for k, v in batch.items() if k != "labels"} with torch.no_grad(): outputs = accelerator.unwrap_model(model).generate( **batch, synced_gpus=is_ds_zero_3 ) # synced_gpus=True for DS-stage 3 outputs = accelerator.pad_across_processes(outputs, dim=1, pad_index=tokenizer.pad_token_id) preds = accelerator.gather_for_metrics(outputs).detach().cpu().numpy() eval_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True)) # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage accelerator.print(f"GPU Memory before entering the eval : {b2mb(tracemalloc.begin)}") accelerator.print(f"GPU Memory consumed at the end of the eval (end-begin): {tracemalloc.used}") accelerator.print(f"GPU Peak Memory consumed during the eval (max-begin): {tracemalloc.peaked}") accelerator.print( f"GPU Total Peak Memory consumed during the eval (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}" ) accelerator.print(f"CPU Memory before entering the eval : {b2mb(tracemalloc.cpu_begin)}") accelerator.print(f"CPU Memory consumed at the end of the eval (end-begin): {tracemalloc.cpu_used}") accelerator.print(f"CPU Peak Memory consumed during the eval (max-begin): {tracemalloc.cpu_peaked}") accelerator.print( f"CPU Total Peak Memory consumed during the eval (max): {tracemalloc.cpu_peaked + b2mb(tracemalloc.cpu_begin)}" ) correct = 0 total = 0 assert len(eval_preds) == len( dataset["train"][label_column] ), f"{len(eval_preds)} != {len(dataset['train'][label_column])}" for pred, true in zip(eval_preds, dataset["train"][label_column]): if pred.strip() == true.strip(): correct += 1 total += 1 accuracy = correct / total * 100 accelerator.print(f"{accuracy=}") accelerator.print(f"{eval_preds[:10]=}") accelerator.print(f"{dataset['train'][label_column][:10]=}") if do_test: model.eval() test_preds = [] for _, batch in enumerate(tqdm(test_dataloader)): batch = {k: v for k, v in batch.items() if k != "labels"} with torch.no_grad(): outputs = accelerator.unwrap_model(model).generate( **batch, synced_gpus=is_ds_zero_3 ) # synced_gpus=True for DS-stage 3 outputs = accelerator.pad_across_processes(outputs, dim=1, pad_index=tokenizer.pad_token_id) preds = accelerator.gather(outputs).detach().cpu().numpy() test_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True)) test_preds_cleaned = [] for _, pred in enumerate(test_preds): test_preds_cleaned.append(get_closest_label(pred, classes)) test_df = dataset["test"].to_pandas() assert len(test_preds_cleaned) == len(test_df), f"{len(test_preds_cleaned)} != {len(test_df)}" test_df[label_column] = test_preds_cleaned test_df["text_labels_orig"] = test_preds accelerator.print(test_df[[text_column, label_column]].sample(20)) pred_df = test_df[["ID", label_column]] pred_df.columns = ["ID", "Label"] os.makedirs(f"data/{dataset_name}", exist_ok=True) pred_df.to_csv(f"data/{dataset_name}/predictions.csv", index=False) accelerator.wait_for_everyone() # Option1: Pushing the model to Hugging Face Hub # model.push_to_hub( # f"{dataset_name}_{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}".replace("/", "_"), # token = "hf_..." # ) # token (`bool` or `str`, *optional*): # `token` is to be used for HTTP Bearer authorization when accessing remote files. If `True`, will use the token generated # when running `huggingface-cli login` (stored in `~/.huggingface`). Will default to `True` if `repo_url` # is not specified. # Or you can get your token from https://huggingface.co/settings/token # Option2: Saving the model locally peft_model_id = f"{dataset_name}_{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}".replace( "/", "_" ) model.save_pretrained(peft_model_id) accelerator.wait_for_everyone()
    [openaccess-ai-collective/axolotl] src/axolotl/cli/__init__.py
    """Prepare and train a model on a dataset. Can also infer from a model or merge lora"""
    [openaccess-ai-collective/axolotl] examples/llama-2/lora.yml
    base_model: NousResearch/Llama-2-7b-hf model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: true load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./lora-out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true s2_attention: warmup_steps: 10 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens:
    [openaccess-ai-collective/axolotl] examples/code-llama/34b/lora.yml
    base_model: codellama/CodeLlama-34b-hf model_type: LlamaForCausalLM tokenizer_type: CodeLlamaTokenizer load_in_8bit: true load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./lora-out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true s2_attention: warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
    [huggingface/transformers] examples/pytorch/language-modeling/run_mlm_no_trainer.py
    completed_steps = 0 starting_epoch = 0 # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last checkpoint_path = path path = os.path.basename(checkpoint_path) accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None completed_steps = starting_epoch * num_update_steps_per_epoch else: # need to multiply `gradient_accumulation_steps` to reflect real steps resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) # update the progress_bar if load from checkpoint progress_bar.update(completed_steps) for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: # We skip the first `n` batches in the dataloader when resuming from a checkpoint active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) else: active_dataloader = train_dataloader for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch if args.with_tracking: total_loss += loss.detach().float() accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) completed_steps += 1 if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if completed_steps >= args.max_train_steps: break model.eval() losses = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(**batch) loss = outputs.loss losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) losses = torch.cat(losses) try: eval_loss = torch.mean(losses) perplexity = math.exp(eval_loss) except OverflowError: perplexity = float("inf") logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}") if args.with_tracking: accelerator.log( { "perplexity": perplexity, "eval_loss": eval_loss, "train_loss": total_loss.item() / len(train_dataloader), "epoch": epoch, "step": completed_steps, }, step=completed_steps, ) if args.push_to_hub and epoch < args.num_train_epochs - 1: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) api.upload_folder( commit_message=f"Training in progress epoch {epoch}", folder_path=args.output_dir, repo_id=repo_id, repo_type="model", token=args.hub_token, ) if args.checkpointing_steps == "epoch": output_dir = f"epoch_{epoch}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if args.with_tracking: accelerator.end_training() if args.output_dir is not None: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) if args.push_to_hub: api.upload_folder( commit_message="End of training", folder_path=args.output_dir, repo_id=repo_id, repo_type="model", token=args.hub_token, ) with open(os.path.join(args.output_dir, "all_results.json"), "w") as f: json.dump({"perplexity": perplexity}, f)
    [openaccess-ai-collective/axolotl] examples/pythia/lora.yml
    base_model: EleutherAI/pythia-1.4b-deduped load_in_8bit: true datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca dataset_prepared_path: val_set_size: 0.05 adapter: lora lora_model_dir: sequence_len: 512 lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 lora_target_modules: - query_key_value lora_target_linear: lora_fan_in_fan_out: true # pythia/GPTNeoX lora specific wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./lora-alpaca-pythia gradient_accumulation_steps: 1 micro_batch_size: 4 num_epochs: 4 learning_rate: 0.00001 train_on_inputs: false group_by_length: false bf16: auto tf32: true early_stopping_patience: resume_from_checkpoint: local_rank: weight_decay: 0.1 evals_per_epoch: 4 logging_steps: 1
    [openaccess-ai-collective/axolotl] examples/code-llama/13b/lora.yml
    base_model: codellama/CodeLlama-13b-hf model_type: LlamaForCausalLM tokenizer_type: CodeLlamaTokenizer load_in_8bit: true load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./lora-out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true s2_attention: warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
    [openaccess-ai-collective/axolotl] examples/tiny-llama/lora.yml
    base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: true load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./lora-out sequence_len: 4096 sample_packing: true eval_sample_packing: false pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens:
    [openaccess-ai-collective/axolotl] examples/code-llama/7b/lora.yml
    base_model: codellama/CodeLlama-7b-hf model_type: LlamaForCausalLM tokenizer_type: CodeLlamaTokenizer load_in_8bit: true load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./lora-out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true s2_attention: warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
    [openaccess-ai-collective/axolotl] docs/dataset-formats/pretraining.qmd
    ---
    title: Pre-training
    description: Data format for a pre-training completion task.
    order: 1
    ---
    
    For pretraining, there is no prompt template or roles.  The only required field is `text`:
    
    ```{.json filename="data.jsonl"}
    {"text": "first row"}
    {"text": "second row"}
    ...
    

    :::{.callout-note}

    Streaming is recommended for large datasets

    Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:

    pretraining_dataset: # hf path only
    ...
    

    :::

    [openaccess-ai-collective/axolotl] src/axolotl/utils/models.py
    def load_lora(model, cfg, inference=False, config_only=False): # type: (PreTrainedModel, DictDefault, bool, bool) -> Tuple[Optional[PreTrainedModel], Optional[PeftConfig]] from peft import LoraConfig, get_peft_model lora_target_modules = list(cfg.lora_target_modules or []) if cfg.lora_target_linear: linear_names = find_all_linear_names(model) LOG.info(f"found linear modules: {repr(linear_names)}") lora_target_modules = list(set(lora_target_modules + linear_names)) lora_config_kwargs = {} loftq_bits = cfg.peft and cfg.peft.loftq_config and cfg.peft.loftq_config.loftq_bits if loftq_bits: lora_config_kwargs["loftq_config"] = LoftQConfig(loftq_bits=loftq_bits) lora_config_kwargs["init_lora_weights"] = "loftq" if cfg.peft_use_dora: lora_config_kwargs["use_dora"] = cfg.peft_use_dora if cfg.peft_use_rslora: lora_config_kwargs["use_rslora"] = cfg.peft_use_rslora if cfg.peft_layer_replication: lora_config_kwargs["layer_replication"] = cfg.peft_layer_replication lora_config = LoraConfig( r=cfg.lora_r, lora_alpha=cfg.lora_alpha, target_modules=lora_target_modules, layers_to_transform=cfg.peft_layers_to_transform, lora_dropout=cfg.lora_dropout, fan_in_fan_out=cfg.lora_fan_in_fan_out, modules_to_save=cfg.lora_modules_to_save if cfg.lora_modules_to_save else None, bias="none", task_type="CAUSAL_LM", **lora_config_kwargs, ) if config_only: return None, lora_config rank = int(os.environ.get("LOCAL_RANK", 0)) if ( cfg.fsdp and cfg.adapter and cfg.fsdp_config.fsdp_cpu_ram_efficient_loading and rank != 0 ): setup_quantized_meta_for_peft(model) if cfg.lora_model_dir: LOG.debug("Loading pretrained PEFT - LoRA") model_kwargs: Any = {} if cfg.lora_on_cpu: model_kwargs["max_memory"] = {"cpu": "256GiB"} model_kwargs["device_map"] = {"": "cpu"} model = PeftModel.from_pretrained( model, cfg.lora_model_dir, is_trainable=(not inference), **model_kwargs, ) else: model = get_peft_model(model, lora_config) if rank == 0: try: model.print_trainable_parameters() except AttributeError as exc: LOG.warning( "Exception caught during model.print_trainable_parameters(): %s", exc ) elif ( cfg.fsdp and cfg.adapter and cfg.fsdp_config.fsdp_cpu_ram_efficient_loading and rank != 0 ): setup_quantized_peft_meta_for_training(model) return model, lora_config
    [huggingface/transformers] examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py
    data_collator = DataCollatorCTCWithPadding(processor=processor) # Initialize Trainer trainer = Trainer( model=model, data_collator=data_collator, args=training_args, compute_metrics=compute_metrics, train_dataset=vectorized_datasets["train"] if training_args.do_train else None, eval_dataset=vectorized_datasets["eval"] if training_args.do_eval else None, tokenizer=processor, ) # 8. Finally, we can start training # Training if training_args.do_train: # use last checkpoint if exist if last_checkpoint is not None: checkpoint = last_checkpoint elif os.path.isdir(model_args.model_name_or_path): checkpoint = model_args.model_name_or_path else: checkpoint = None train_result = trainer.train(resume_from_checkpoint=checkpoint) trainer.save_model() metrics = train_result.metrics max_train_samples = ( data_args.max_train_samples if data_args.max_train_samples is not None else len(vectorized_datasets["train"]) ) metrics["train_samples"] = min(max_train_samples, len(vectorized_datasets["train"])) trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) trainer.save_state() # Evaluation results = {} if training_args.do_eval: logger.info("*** Evaluate ***") metrics = trainer.evaluate() max_eval_samples = ( data_args.max_eval_samples if data_args.max_eval_samples is not None else len(vectorized_datasets["eval"]) ) metrics["eval_samples"] = min(max_eval_samples, len(vectorized_datasets["eval"])) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # Write model card and (optionally) push to hub config_name = data_args.dataset_config_name if data_args.dataset_config_name is not None else "na" kwargs = { "finetuned_from": model_args.model_name_or_path, "tasks": "automatic-speech-recognition", "tags": ["automatic-speech-recognition", data_args.dataset_name, "mms"], "dataset_args": ( f"Config: {config_name}, Training split: {data_args.train_split_name}, Eval split:" f" {data_args.eval_split_name}" ), "dataset": f"{data_args.dataset_name.upper()} - {config_name.upper()}", } if "common_voice" in data_args.dataset_name: kwargs["language"] = config_name # make sure that adapter weights are saved seperately adapter_file = WAV2VEC2_ADAPTER_SAFE_FILE.format(data_args.target_language) adapter_file = os.path.join(training_args.output_dir, adapter_file) logger.info(f"Saving adapter weights under {adapter_file}...") safe_save_file(model._get_adapters(), adapter_file, metadata={"format": "pt"}) if training_args.push_to_hub: trainer.push_to_hub(**kwargs) else: trainer.create_model_card(**kwargs) return results
    [huggingface/transformers] examples/pytorch/language-modeling/run_clm_no_trainer.py
    # We skip the first `n` batches in the dataloader when resuming from a checkpoint active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) else: active_dataloader = train_dataloader for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch if args.with_tracking: total_loss += loss.detach().float() accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) completed_steps += 1 if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if completed_steps >= args.max_train_steps: break model.eval() losses = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(**batch) loss = outputs.loss losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) losses = torch.cat(losses) try: eval_loss = torch.mean(losses) perplexity = math.exp(eval_loss) except OverflowError: perplexity = float("inf") logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}") if args.with_tracking: accelerator.log( { "perplexity": perplexity, "eval_loss": eval_loss, "train_loss": total_loss.item() / len(train_dataloader), "epoch": epoch, "step": completed_steps, }, step=completed_steps, ) if args.push_to_hub and epoch < args.num_train_epochs - 1: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) api.upload_folder( commit_message=f"Training in progress epoch {epoch}", folder_path=args.output_dir, repo_id=repo_id, repo_type="model", token=args.hub_token, ) if args.checkpointing_steps == "epoch": output_dir = f"epoch_{epoch}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if args.with_tracking: accelerator.end_training() if args.output_dir is not None: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) if args.push_to_hub: api.upload_folder( commit_message="End of training", folder_path=args.output_dir, repo_id=repo_id, repo_type="model", token=args.hub_token, ) with open(os.path.join(args.output_dir, "all_results.json"), "w") as f: json.dump({"perplexity": perplexity}, f)
    [huggingface/accelerate] examples/complete_cv_example.py
    def training_function(config, args): # Initialize accelerator if args.with_tracking: accelerator = Accelerator( cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all", project_dir=args.project_dir ) else: accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision) # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs lr = config["lr"] num_epochs = int(config["num_epochs"]) seed = int(config["seed"]) batch_size = int(config["batch_size"]) image_size = config["image_size"] if not isinstance(image_size, (list, tuple)): image_size = (image_size, image_size) # Parse out whether we are saving every epoch or after a certain number of batches if hasattr(args.checkpointing_steps, "isdigit"): if args.checkpointing_steps == "epoch": checkpointing_steps = args.checkpointing_steps elif args.checkpointing_steps.isdigit(): checkpointing_steps = int(args.checkpointing_steps) else: raise ValueError( f"Argument `checkpointing_steps` must be either a number or `epoch`. `{args.checkpointing_steps}` passed." ) else: checkpointing_steps = None # We need to initialize the trackers we use, and also store our configuration if args.with_tracking: run = os.path.split(__file__)[-1].split(".")[0] accelerator.init_trackers(run, config) # Grab all the image filenames file_names = [os.path.join(args.data_dir, fname) for fname in os.listdir(args.data_dir) if fname.endswith(".jpg")] # Build the label correspondences all_labels = [extract_label(fname) for fname in file_names] id_to_label = list(set(all_labels)) id_to_label.sort() label_to_id = {lbl: i for i, lbl in enumerate(id_to_label)} # Set the seed before splitting the data. np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) # Split our filenames between train and validation random_perm = np.random.permutation(len(file_names)) cut = int(0.8 * len(file_names)) train_split = random_perm[:cut] eval_split = random_perm[cut:] # For training we use a simple RandomResizedCrop train_tfm = Compose([RandomResizedCrop(image_size, scale=(0.5, 1.0)), ToTensor()]) train_dataset = PetsDataset( [file_names[i] for i in train_split], image_transform=train_tfm, label_to_id=label_to_id ) # For evaluation, we use a deterministic Resize eval_tfm = Compose([Resize(image_size), ToTensor()]) eval_dataset = PetsDataset([file_names[i] for i in eval_split], image_transform=eval_tfm, label_to_id=label_to_id) # Instantiate dataloaders. train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size, num_workers=4) eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=batch_size, num_workers=4) # Instantiate the model (we build the model here so that the seed also control new weights initialization) model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id)) # We could avoid this line since the accelerator is set with `device_placement=True` (default value). # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that). model = model.to(accelerator.device) # Freezing the base model for param in model.parameters(): param.requires_grad = False for param in model.get_classifier().parameters(): param.requires_grad = True # We normalize the batches of images to be a bit faster. mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None].to(accelerator.device) std = torch.tensor(model.default_cfg["std"])[None, :, None, None].to(accelerator.device) # Instantiate optimizer optimizer = torch.optim.Adam(params=model.parameters(), lr=lr / 25) # Instantiate learning rate scheduler lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=lr, epochs=num_epochs, steps_per_epoch=len(train_dataloader)) # Prepare everything # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the # prepare method. model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler ) # We need to keep track of how many total steps we have iterated over overall_step = 0 # We also need to keep track of the starting epoch so files are named properly starting_epoch = 0 # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") accelerator.load_state(args.resume_from_checkpoint) path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None else: resume_step = int(training_difference.replace("step_", "")) starting_epoch = resume_step // len(train_dataloader) resume_step -= starting_epoch * len(train_dataloader) # Now we train the model for epoch in range(starting_epoch, num_epochs): model.train() if args.with_tracking: total_loss = 0 if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: # We need to skip steps until we reach the resumed step active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) overall_step += resume_step else: # After the first iteration though, we need to go back to the original dataloader active_dataloader = train_dataloader for batch in active_dataloader: # We could avoid this line since we set the accelerator with `device_placement=True`. batch = {k: v.to(accelerator.device) for k, v in batch.items()} inputs = (batch["image"] - mean) / std outputs = model(inputs) loss = torch.nn.functional.cross_entropy(outputs, batch["label"]) # We keep track of the loss at each epoch if args.with_tracking: total_loss += loss.detach().float() accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() overall_step += 1 if isinstance(checkpointing_steps, int): output_dir = f"step_{overall_step}" if overall_step % checkpointing_steps == 0: if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) model.eval() accurate = 0 num_elems = 0 for step, batch in enumerate(eval_dataloader): # We could avoid this line since we set the accelerator with `device_placement=True`. batch = {k: v.to(accelerator.device) for k, v in batch.items()} inputs = (batch["image"] - mean) / std with torch.no_grad(): outputs = model(inputs) predictions = outputs.argmax(dim=-1) predictions, references = accelerator.gather_for_metrics((predictions, batch["label"])) accurate_preds = predictions == references num_elems += accurate_preds.shape[0] accurate += accurate_preds.long().sum() eval_metric = accurate.item() / num_elems # Use accelerator.print to print only on the main process. accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}") if args.with_tracking: accelerator.log( { "accuracy": 100 * eval_metric, "train_loss": total_loss.item() / len(train_dataloader), "epoch": epoch, }, step=overall_step, ) if checkpointing_steps == "epoch": output_dir = f"epoch_{epoch}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if args.with_tracking: accelerator.end_training()
    [huggingface/accelerate] docs/source/basic_tutorials/notebook.md

    Writing the Training Function

    Now you can build the training loop. [notebook_launcher] works by passing in a function to call that will be ran across the distributed system.

    Here is a basic training loop for the animal classification problem:

    <Tip>
    The code has been split up to allow for explanations on each section. A full version that can be copy and pasted will be available at the end
    
    </Tip>
    def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 64): set_seed(seed) accelerator = Accelerator(mixed_precision=mixed_precision)

    First you should set the seed and create an [Accelerator] object as early in the training loop as possible.

    <Tip warning={true}>
    If training on the TPU, your training loop should take in the model as a parameter and it should be instantiated 
    outside of the training loop function. See the [TPU best practices](../concept_guides/training_tpu) 
    to learn why
    
    </Tip>

    Next you should build your dataloaders and create your model:

    train_dataloader, eval_dataloader = get_dataloaders(batch_size) model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))
    <Tip>
    You build the model here so that the seed also controls the new weight initialization
    
    </Tip>

    As you are performing transfer learning in this example, the encoder of the model starts out frozen so the head of the model can be trained only initially:

    for param in model.parameters(): param.requires_grad = False for param in model.get_classifier().parameters(): param.requires_grad = True

    Normalizing the batches of images will make training a little faster:

    mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None] std = torch.tensor(model.default_cfg["std"])[None, :, None, None]

    To make these constants available on the active device, you should set it to the Accelerator's device:

    mean = mean.to(accelerator.device) std = std.to(accelerator.device)

    Next instantiate the rest of the PyTorch classes used for training:

    optimizer = torch.optim.Adam(params=model.parameters(), lr=3e-2 / 25) lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=3e-2, epochs=5, steps_per_epoch=len(train_dataloader))

    Before passing everything to [~Accelerator.prepare].

    <Tip>
    There is no specific order to remember, you just need to unpack the objects in the same order you gave them to the prepare method.
    
    </Tip>
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler )

    Now train the model:

    for epoch in range(5): model.train() for batch in train_dataloader: inputs = (batch["image"] - mean) / std outputs = model(inputs) loss = torch.nn.functional.cross_entropy(outputs, batch["label"]) accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad()

    The evaluation loop will look slightly different compared to the training loop. The number of elements passed as well as the overall total accuracy of each batch will be added to two constants:

    model.eval() accurate = 0 num_elems = 0

    Next you have the rest of your standard PyTorch loop:

    for batch in eval_dataloader: inputs = (batch["image"] - mean) / std with torch.no_grad(): outputs = model(inputs) predictions = outputs.argmax(dim=-1)

    Before finally the last major difference.

    When performing distributed evaluation, the predictions and labels need to be passed through [~Accelerator.gather] so that all of the data is available on the current device and a properly calculated metric can be achieved:

    accurate_preds = accelerator.gather(predictions) == accelerator.gather(batch["label"]) num_elems += accurate_preds.shape[0] accurate += accurate_preds.long().sum()

    Now you just need to calculate the actual metric for this problem, and you can print it on the main process using [~Accelerator.print]:

    eval_metric = accurate.item() / num_elems accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")

    A full version of this training loop is available below:

    def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 64): set_seed(seed) # Initialize accelerator accelerator = Accelerator(mixed_precision=mixed_precision) # Build dataloaders train_dataloader, eval_dataloader = get_dataloaders(batch_size) # Instantiate the model (you build the model here so that the seed also controls new weight initaliziations) model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id)) # Freeze the base model for param in model.parameters(): param.requires_grad = False for param in model.get_classifier().parameters(): param.requires_grad = True # You can normalize the batches of images to be a bit faster mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None] std = torch.tensor(model.default_cfg["std"])[None, :, None, None] # To make these constants available on the active device, set it to the accelerator device mean = mean.to(accelerator.device) std = std.to(accelerator.device) # Instantiate the optimizer optimizer = torch.optim.Adam(params=model.parameters(), lr=3e-2 / 25) # Instantiate the learning rate scheduler lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=3e-2, epochs=5, steps_per_epoch=len(train_dataloader)) # Prepare everything # There is no specific order to remember, you just need to unpack the objects in the same order you gave them to the # prepare method. model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler ) # Now you train the model for epoch in range(5): model.train() for batch in train_dataloader: inputs = (batch["image"] - mean) / std outputs = model(inputs) loss = torch.nn.functional.cross_entropy(outputs, batch["label"]) accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() model.eval() accurate = 0 num_elems = 0 for batch in eval_dataloader: inputs = (batch["image"] - mean) / std with torch.no_grad(): outputs = model(inputs) predictions = outputs.argmax(dim=-1) accurate_preds = accelerator.gather(predictions) == accelerator.gather(batch["label"]) num_elems += accurate_preds.shape[0] accurate += accurate_preds.long().sum() eval_metric = accurate.item() / num_elems # Use accelerator.print to print only on the main process. accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")
    [huggingface/accelerate] examples/by_feature/megatron_lm_gpt_pretraining.py
    continue with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch if args.with_tracking: total_loss += loss.detach().float() accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) completed_steps += 1 if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: output_dir = f"step_{completed_steps }" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if completed_steps >= args.max_train_steps: break model.eval() losses = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(**batch) loss = outputs.loss # New Code # For Megatron-LM, the losses are already averaged across the data parallel group if accelerator.distributed_type == DistributedType.MEGATRON_LM: losses.append(loss) else: losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) try: if accelerator.distributed_type == DistributedType.MEGATRON_LM: losses = torch.tensor(losses) else: losses = torch.cat(losses) eval_loss = torch.mean(losses) perplexity = math.exp(eval_loss) except OverflowError: perplexity = float("inf") logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}") if args.with_tracking: accelerator.log( { "perplexity": perplexity, "eval_loss": eval_loss, "train_loss": total_loss.item() / len(train_dataloader), "epoch": epoch, "step": completed_steps, }, step=completed_steps, ) if args.push_to_hub and epoch < args.num_train_epochs - 1: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message=f"Training in progress epoch {epoch}", run_as_future=True, ) if args.checkpointing_steps == "epoch": output_dir = f"epoch_{epoch}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) # this is causing some issue with Megatron-LM when using `wandb` at the end of the main function. # Everything works fine inspite of commenting this out. (wandb finishes/closes the run without error) # if args.with_tracking: # accelerator.end_training() if args.output_dir is not None: accelerator.wait_for_everyone() # New Code # For Megatron-LM, we need to save the model using `accelerator.save_state` if accelerator.distributed_type == DistributedType.MEGATRON_LM: accelerator.save_state(args.output_dir) else: unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) if args.push_to_hub: api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message="End of training", ) with open(os.path.join(args.output_dir, "all_results.json"), "w") as f: json.dump({"perplexity": perplexity}, f)
OpenAccess-AI-Collective/axolotl
huggingface/transformers
huggingface/peft
huggingface/accelerate