Phorm.ai
OpenAccess-AI-Collective/axolotl

    how to fine-tune vision models ?

    To fine-tune vision models, you can follow a general approach that involves loading a pre-trained model, preparing your dataset, modifying the model for your specific task (if necessary), and then training the model on your dataset. Below is a step-by-step guide to fine-tune a vision model, specifically using a ResNet-50 model for a classification task as an example. This process can be adapted for other vision models and tasks.

    Step 1: Install Required Libraries

    First, ensure you have the necessary libraries installed. For this example, you'll need PyTorch, torchvision, and timm (for easily accessing pre-trained models).

    pip install torch torchvision timm

    Step 2: Prepare Your Dataset

    Prepare your dataset for training and validation. For classification tasks, your dataset should be organized into directories for each class. For example, using the Oxford-IIIT Pet Dataset:

    wget https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz tar -xzf images.tar.gz

    You might need to organize this dataset into a structure that is compatible with torchvision.datasets.ImageFolder, which expects the following directory structure:

    path/to/data/
        train/
            class1/
                img1.jpg
                img2.jpg
                ...
            class2/
                img1.jpg
                img2.jpg
                ...
        val/
            class1/
                img1.jpg
                img2.jpg
                ...
            class2/
                img1.jpg
                img2.jpg
                ...
    

    Step 3: Load a Pre-trained Model

    Load a pre-trained ResNet-50 model using torchvision or timm. Modify the final layer to match the number of classes in your dataset if necessary.

    import torchvision.models as models import torch.nn as nn num_classes = 37 # For Oxford-IIIT Pet Dataset model = models.resnet50(pretrained=True) model.fc = nn.Linear(model.fc.in_features, num_classes)

    Step 4: Define Data Transforms

    Define the appropriate transformations for your dataset. Commonly, you would normalize the images and apply data augmentation techniques.

    from torchvision import transforms data_transforms = { 'train': transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]), 'val': transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]), }

    Step 5: Load Data

    Use torchvision.datasets.ImageFolder to load the dataset and torch.utils.data.DataLoader to create data loaders for training and validation.

    from torchvision.datasets import ImageFolder from torch.utils.data import DataLoader data_dir = 'path/to/data' batch_size = 32 train_dataset = ImageFolder(root=f'{data_dir}/train', transform=data_transforms['train']) val_dataset = ImageFolder(root=f'{data_dir}/val', transform=data_transforms['val']) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=batch_size)

    Step 6: Train the Model

    Define the loss function and optimizer, then train the model. You can use the torch.optim package to define an optimizer. A common choice is the Adam optimizer.

    import torch.optim as optim import torch device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") model.to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # Training loop num_epochs = 10 for epoch in range(num_epochs): model.train() running_loss = 0.0 for inputs, labels in train_loader: inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item() print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}") # Validation loop can be added here

    This is a simplified example to get you started with fine-tuning vision models. Depending on your specific task, you might need to adjust the data preprocessing, model architecture, and training loop.

    Search results
    [huggingface/peft] examples/image_classification/image_classification_peft_lora.ipynb
    ## Introduction
    
    In this notebook, we will learn how to use [LoRA](https://arxiv.org/abs/2106.09685) from šŸ¤— PEFT to fine-tune an image classification model by ONLY using **0.77%** of the original trainable parameters of the model. 
    
    LoRA adds low-rank "update matrices" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://arxiv.org/abs/2106.09685). 
    
    Let's get started by installing the dependencies. 
    
    __*Note that this notebook builds on top the [official image classification example notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_classification.ipynb).*__
    ## Install dependencies
    
    Here we're installing `peft` from source to ensure we have access to all the bleeding edge features of `peft`. 
    

    !pip install transformers accelerate evaluate datasets git+https://github.com/huggingface/peft -q

    ## Authentication
    
    We will share our fine-tuned model at the end of training. So, to do that we just authenticate using our šŸ¤— token. This token is available from [here](https://huggingface.co/settings/tokens). If you don't have a šŸ¤— account already, we highly encourage you to do so; it's free!
    

    from huggingface_hub import notebook_login

    notebook_login()

    ## Check the library versions
    

    import transformers import accelerate import peft

    print(f"Transformers version: {transformers.version}") print(f"Accelerate version: {accelerate.version}") print(f"PEFT version: {peft.version}")

    ## Select a model checkpoint to fine-tune
    

    model_checkpoint = "google/vit-base-patch16-224-in21k" # pre-trained model from which to fine-tune

    ## Load a dataset
    
    We're only loading the first 5000 instances from the training set of the [Food-101 dataset](https://huggingface.co/datasets/food101) to keep this example runtime short. 
    

    from datasets import load_dataset

    dataset = load_dataset("food101", split="train[:5000]")

    ## Prepare datasets for training and evaluation
    1. Prepare `label2id` and `id2label` dictionaries. This will come in handy when performing inference and for metadata information. 
    

    labels = dataset.features["label"].names label2id, id2label = dict(), dict() for i, label in enumerate(labels): label2id[label] = i id2label[i] = label

    id2label[2]

    2. We load the image processor of the model we're fine-tuning.
    

    from transformers import AutoImageProcessor

    image_processor = AutoImageProcessor.from_pretrained(model_checkpoint) image_processor

    As one might notice, the `image_processor` has useful information on which size the training and evaluation images should be resized, stats that should be used to normalize the pixel values, etc. 
    3. Using the image processor we prepare transformation functions for the datasets. These functions will include augmentation and pixel scaling.  
    

    from torchvision.transforms import ( CenterCrop, Compose, Normalize, RandomHorizontalFlip, RandomResizedCrop, Resize, ToTensor, )

    normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std) train_transforms = Compose( [ RandomResizedCrop(image_processor.size["height"]), RandomHorizontalFlip(), ToTensor(), normalize, ] )

    val_transforms = Compose( [ Resize(image_processor.size["height"]), CenterCrop(image_processor.size["height"]), ToTensor(), normalize, ] )

    def preprocess_train(example_batch): """Apply train_transforms across a batch.""" example_batch["pixel_values"] = [train_transforms(image.convert("RGB")) for image in example_batch["image"]] return example_batch

    def preprocess_val(example_batch): """Apply val_transforms across a batch.""" example_batch["pixel_values"] = [val_transforms(image.convert("RGB")) for image in example_batch["image"]] return example_batch

    4. We split our mini dataset into training and validation. 
    

    split up training into training + validation

    splits = dataset.train_test_split(test_size=0.1) train_ds = splits["train"] val_ds = splits["test"]

    5. We set the transformation functions to the datasets accordingly. 
    

    train_ds.set_transform(preprocess_train) val_ds.set_transform(preprocess_val)

    ## Load and prepare a model 
    
    In this section, we first load the model we want to fine-tune. 
    

    def print_trainable_parameters(model): """ Prints the number of trainable parameters in the model. """ trainable_params = 0 all_param = 0 for _, param in model.named_parameters(): all_param += param.numel() if param.requires_grad: trainable_params += param.numel() print( f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}" )

    The `get_peft_model()` method that we will use in a moment wraps the original model to be fine-tuned as a `PeftModel`. So, it's important for us to initialize the original model correctly. As such, we initialize it by specifying the `label2id` and `id2label` so that `AutoModelForImageClassification` can initialize a append classification head to the underlying model, adapted for our dataset. We can confirm this from the warning below:
    
    

    Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.weight', 'classifier.bias']

    from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

    model = AutoModelForImageClassification.from_pretrained( model_checkpoint, label2id=label2id, id2label=id2label, ignore_mismatched_sizes=True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint ) print_trainable_parameters(model)

    Also, take note of the number of total trainable parameters of `model`: it's 100%! We'll compare this number to that of the LoRA model.
    
    We now use the `PeftModel` to wrap `model` so that the "update" matrices are added to the respective places. 
    

    from peft import LoraConfig, get_peft_model

    config = LoraConfig( r=16, lora_alpha=16, target_modules=["query", "value"], lora_dropout=0.1, bias="none", modules_to_save=["classifier"], ) lora_model = get_peft_model(model, config) print_trainable_parameters(lora_model)

    Let's unpack what's going on here. 
    
    In order for LoRA to take effect, we need to specify the target modules to `LoraConfig` so that `get_peft_model()` knows which modules inside our model needs to be amended with LoRA matrices. In this case, we're only interested in targetting the query and value matrices of the attention blocks of the base model. Since the parameters corresponding to these matrices are "named" with `query` and `value` respectively, we specify them accordingly in the `target_modules` argument of `LoraConfig`. 
    
    We also specify `modules_to_save`. After we wrap our base model `model` with `get_peft_model()` along with the `config`, we get a new model where only the LoRA parameters are trainable (so-called "update matrices") while the pre-trained parameters are kept frozen. These include the parameters of the randomly initialized classifier parameters too. This is NOT we want when fine-tuning the base model on our custom dataset. To ensure that the classifier parameters are also trained, we specify `modules_to_save`. This also ensures that these modules are serialized alongside the LoRA trainable parameters when using utilities like `save_pretrained()` and `push_to_hub()`.  
    
    Regarding the other parameters:
    
    * `r`: The dimension used by the LoRA update matrices.
    * `alpha`: Scaling factor.
    * `bias`: Specifying if the `bias` parameters should be trained. `None` denotes none of the `bias` parameters will be trained. 
    
    `r` and `alpha` together control the total number of final trainable parameters when using LoRA giving us the flexbility to balance a trade-off between end performance and compute efficiency.
    
    We can also how many parameters we're actually training. Since we're interested in performing **parameter-efficient fine-tuning**, we should expect to notice a less number of trainable parameters from the `lora_model` in comparison to the original `model` which is indeed the case here. 
    ## Training arguments
    
    We will leverage [šŸ¤— Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) for fine-tuning. It accepts several arguments which we wrap using [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). 
    

    from transformers import TrainingArguments, Trainer

    model_name = model_checkpoint.split("/")[-1] batch_size = 128

    args = TrainingArguments( f"{model_name}-finetuned-lora-food101", remove_unused_columns=False, evaluation_strategy="epoch", save_strategy="epoch", learning_rate=5e-3, per_device_train_batch_size=batch_size, gradient_accumulation_steps=4, per_device_eval_batch_size=batch_size, fp16=True, num_train_epochs=5, logging_steps=10, load_best_model_at_end=True, metric_for_best_model="accuracy", push_to_hub=True, label_names=["labels"], )

    Some things to note here:
    
    * We're using a larger batch size since there is only a handful of parameters to train. 
    * Larger learning rate than the normal (1e-5 for example). 
    
    All of these things are a byproduct of the fact that we're training only a small number of parameters. This can potentially also reduce the need to conduct expensive hyperparameter tuning experiments. 
    ## Prepare evaluation metric
    

    import numpy as np import evaluate

    metric = evaluate.load("accuracy")

    the compute_metrics function takes a Named Tuple as input:

    predictions, which are the logits of the model as Numpy arrays,

    and label_ids, which are the ground-truth labels as Numpy arrays.

    def compute_metrics(eval_pred): """Computes accuracy on a batch of predictions""" predictions = np.argmax(eval_pred.predictions, axis=1) return metric.compute(predictions=predictions, references=eval_pred.label_ids)

    ## Collation function
    
    This is used by `Trainer` to gather a batch of training and evaluation examples and prepare them in a format that is acceptable by the underlying model. 
    

    import torch

    def collate_fn(examples): pixel_values = torch.stack([example["pixel_values"] for example in examples]) labels = torch.tensor([example["label"] for example in examples]) return {"pixel_values": pixel_values, "labels": labels}

    ## Train and evaluate
    

    trainer = Trainer( lora_model, args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=image_processor, compute_metrics=compute_metrics, data_collator=collate_fn, ) train_results = trainer.train()

    In just a few minutes, we have a fine-tuned model with 96% validation accuracy. Also, note that we used a very small subset of the training dataset which is definitely impacting the results. 
    

    trainer.evaluate(val_ds)

    ## Sharing your model and inference 
    
    Once the fine-tuning is done, we can share the LoRA parameters with the community like so: 
    

    repo_name = f"sayakpaul/{model_name}-finetuned-lora-food101" lora_model.push_to_hub(repo_name)

    When we call `push_to_hub()` on the `lora_model`, only the LoRA parameters along with any modules specified in `modules_to_save` are saved. If we take a look at the [trained LoRA parameters](https://huggingface.co/sayakpaul/vit-base-patch16-224-in21k-finetuned-lora-food101/blob/main/adapter_model.bin), we see that it's only **2.6 MB**! This greatly helps with portability especially when we're using a very large model to fine-tune (such as [BLOOM](https://huggingface.co/bigscience/bloom)). 
    Next, we see how to load the LoRA updated parameters along with our base model for inference. When we wrap a base model with `PeftModel` that modifications are DONE in place. So to mitigate any concerns that might stem from in place modifications, we newly initialize our base model just like we did earlier and construct our inference model. 
    

    from peft import PeftConfig, PeftModel

    config = PeftConfig.from_pretrained(repo_name) model = model = AutoModelForImageClassification.from_pretrained( config.base_model_name_or_path, label2id=label2id, id2label=id2label, ignore_mismatched_sizes=True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint )

    Load the Lora model

    inference_model = PeftModel.from_pretrained(model, repo_name)

    Don't worry about the warnings, they're harmless. 
    Let's now fetch a sample for inference.
    

    from PIL import Image import requests

    url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/beignets.jpeg" image = Image.open(requests.get(url, stream=True).raw) image

    We first instantiate an `image_processor` from the underlying model repo. 
    

    image_processor = AutoImageProcessor.from_pretrained(repo_name)

    We then prepare the sample for inference.
    

    prepare image for the model

    encoding = image_processor(image.convert("RGB"), return_tensors="pt") print(encoding.pixel_values.shape)

    And run inference!
    

    import torch

    forward pass

    with torch.no_grad(): outputs = inference_model(**encoding) logits = outputs.logits

    predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", inference_model.config.id2label[predicted_class_idx])

    [huggingface/peft] examples/int8_training/fine_tune_blip2_int8.py
    # Let's define the LoraConfig config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", ) # We load our model and processor using `transformers` model = AutoModelForVision2Seq.from_pretrained( "Salesforce/blip2-opt-2.7b", quantization_config=BitsAndBytesConfig(load_in_8bit=True) ) processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b") # Get our peft model and print the number of trainable parameters model = get_peft_model(model, config)
    [huggingface/transformers] docs/source/en/tasks/visual_question_answering.md

    Fine-tuning ViLT

    ViLT model incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for Vision-and-Language Pre-training (VLP). This model can be used for several downstream tasks. For the VQA task, a classifier head is placed on top (a linear layer on top of the final hidden state of the [CLS] token) and randomly initialized. Visual Question Answering is thus treated as a classification problem.

    More recent models, such as BLIP, BLIP-2, and InstructBLIP, treat VQA as a generative task. Later in this guide we illustrate how to use them for zero-shot VQA inference.

    Before you begin, make sure you have all the necessary libraries installed.

    pip install -q transformers datasets

    We encourage you to share your model with the community. Log in to your Hugging Face account to upload it to the šŸ¤— Hub. When prompted, enter your token to log in:

    >>> from huggingface_hub import notebook_login >>> notebook_login()

    Let's define the model checkpoint as a global variable.

    >>> model_checkpoint = "dandelin/vilt-b32-mlm"
    [huggingface/transformers] examples/research_projects/wav2vec2/FINE_TUNE_XLSR_WAV2VEC2.md

    FAQ

    • Can a participant fine-tune models for more than one language? Yes! A participant can fine-tune models in as many languages she/he likes
    • Can a participant use extra data (apart from the common voice data)? Yes! All data except the official common voice test data can be used for training. If a participant wants to train a model on a language that is not part of Common Voice (which is very much encouraged!), the participant should make sure that some test data is held out to make sure the model is not overfitting.
    • Can we fine-tune for high-resource languages? Yes! While we do not really recommend people to fine-tune models in English since there are already so many fine-tuned speech recognition models in English. However, it is very much appreciated if participants want to fine-tune models in other "high-resource" languages, such as French, Spanish, or German. For such cases, one probably needs to train locally and apply might have to apply tricks such as lazy data loading (check the "Lazy data loading" section for more details).
    [huggingface/transformers] tests/models/align/test_modeling_align.py
    class AlignVisionModelTester: def __init__( self, parent, batch_size=12, image_size=32, num_channels=3, kernel_sizes=[3, 3, 5], in_channels=[32, 16, 24], out_channels=[16, 24, 30], hidden_dim=64, strides=[1, 1, 2], num_block_repeats=[1, 1, 2], expand_ratios=[1, 6, 6], is_training=True, hidden_act="gelu", ): self.parent = parent self.batch_size = batch_size self.image_size = image_size self.num_channels = num_channels self.kernel_sizes = kernel_sizes self.in_channels = in_channels self.out_channels = out_channels self.hidden_dim = hidden_dim self.strides = strides self.num_block_repeats = num_block_repeats self.expand_ratios = expand_ratios self.is_training = is_training self.hidden_act = hidden_act def prepare_config_and_inputs(self): pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size]) config = self.get_config() return config, pixel_values def get_config(self): return AlignVisionConfig( num_channels=self.num_channels, kernel_sizes=self.kernel_sizes, in_channels=self.in_channels, out_channels=self.out_channels, hidden_dim=self.hidden_dim, strides=self.strides, num_block_repeats=self.num_block_repeats, expand_ratios=self.expand_ratios, hidden_act=self.hidden_act, ) def create_and_check_model(self, config, pixel_values): model = AlignVisionModel(config=config) model.to(torch_device) model.eval() with torch.no_grad(): result = model(pixel_values) patch_size = self.image_size // 4 self.parent.assertEqual( result.last_hidden_state.shape, (self.batch_size, config.hidden_dim, patch_size, patch_size) ) self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, config.hidden_dim)) def prepare_config_and_inputs_for_common(self): config_and_inputs = self.prepare_config_and_inputs() config, pixel_values = config_and_inputs inputs_dict = {"pixel_values": pixel_values} return config, inputs_dict
    [huggingface/transformers] tests/models/blip_2/test_modeling_blip_2.py
    class Blip2VisionModelTester: def __init__( self, parent, batch_size=12, image_size=30, patch_size=2, num_channels=3, is_training=True, hidden_size=32, projection_dim=32, num_hidden_layers=2, num_attention_heads=4, intermediate_size=37, dropout=0.1, attention_dropout=0.1, initializer_range=1e-10, scope=None, ): self.parent = parent self.batch_size = batch_size self.image_size = image_size self.patch_size = patch_size self.num_channels = num_channels self.is_training = is_training self.hidden_size = hidden_size self.projection_dim = projection_dim self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.intermediate_size = intermediate_size self.dropout = dropout self.attention_dropout = attention_dropout self.initializer_range = initializer_range self.scope = scope # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token) num_patches = (image_size // patch_size) ** 2 self.seq_length = num_patches + 1 def prepare_config_and_inputs(self): pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size]) config = self.get_config() return config, pixel_values def get_config(self): return Blip2VisionConfig( image_size=self.image_size, patch_size=self.patch_size, num_channels=self.num_channels, hidden_size=self.hidden_size, projection_dim=self.projection_dim, num_hidden_layers=self.num_hidden_layers, num_attention_heads=self.num_attention_heads, intermediate_size=self.intermediate_size, dropout=self.dropout, attention_dropout=self.attention_dropout, initializer_range=self.initializer_range, ) def create_and_check_model(self, config, pixel_values): model = Blip2VisionModel(config=config) model.to(torch_device) model.eval() with torch.no_grad(): result = model(pixel_values) # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) image_size = (self.image_size, self.image_size) patch_size = (self.patch_size, self.patch_size) num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size)) def prepare_config_and_inputs_for_common(self): config_and_inputs = self.prepare_config_and_inputs() config, pixel_values = config_and_inputs inputs_dict = {"pixel_values": pixel_values} return config, inputs_dict
    [huggingface/peft] examples/boft_controlnet/boft_controlnet.md

    Fine-tuning for controllable generation with BOFT (ControlNet)

    This guide demonstrates how to use BOFT, an orthogonal fine-tuning method, to fine-tune Stable Diffusion with either stabilityai/stable-diffusion-2-1 or runwayml/stable-diffusion-v1-5 model for controllable generation.

    By using BOFT from šŸ¤— PEFT, we can significantly reduce the number of trainable parameters while still achieving impressive results in various fine-tuning tasks across different foundation models. BOFT enhances model efficiency by integrating full-rank orthogonal matrices with a butterfly structure into specific model blocks, such as attention blocks, mirroring the approach used in LoRA. During fine-tuning, only these inserted matrices are trained, leaving the original model parameters untouched. During inference, the trainable BOFT paramteres can be merged into the original model, eliminating any additional computational costs.

    As a member of the orthogonal finetuning class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the PEFT's GitHub repo's concept guide OFT, the original BOFT paper and the original OFT paper.

    In this guide we provide a controllable generation (ControlNet) fine-tuning script that is available in PEFT's GitHub repo examples. This implementation is adapted from diffusers's ControlNet and Hecong Wu's ControlLoRA. You can try it out and finetune on your custom images.

    Set up your environment

    Start by cloning the PEFT repository:

    git clone https://github.com/huggingface/peft

    Navigate to the directory containing the training scripts for fine-tuning Dreambooth with BOFT:

    cd peft/examples/boft_controlnet

    Set up your environment: install PEFT, and all the required libraries. At the time of writing this guide we recommend installing PEFT from source.

    conda create --name peft python=3.10 conda activate peft conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia conda install xformers -c xformers pip install -r requirements.txt pip install git+https://github.com/huggingface/peft

    Data

    We use the control-celeba-hq dataset for landmark-to-face controllable generation. We also provide evaluation scripts to evaluate the controllable generation performance. This task can be used to quantitatively compare different fine-tuning techniques.

    export DATASET_NAME="oftverse/control-celeba-hq"

    Train controllable generation (ControlNet) with BOFT

    Start with setting some hyperparamters for BOFT:

    PEFT_TYPE="boft" BLOCK_NUM=8 BLOCK_SIZE=0 N_BUTTERFLY_FACTOR=0

    Here:

    Navigate to the directory containing the training scripts for fine-tuning Stable Diffusion with BOFT for controllable generation:

    ./train_controlnet.sh

    or

    export MODEL_NAME="stabilityai/stable-diffusion-2-1" # export MODEL_NAME="runwayml/stable-diffusion-v1-5" export DATASET_NAME="oftverse/control-celeba-hq" export PROJECT_NAME="controlnet_${PEFT_TYPE}" export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}" export CONTROLNET_PATH="" export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}" accelerate launch train_controlnet.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --resume_from_checkpoint=$RESUME_PATH \ --controlnet_model_name_or_path=$CONTROLNET_PATH \ --output_dir=$OUTPUT_DIR \ --report_to="wandb" \ --dataset_name=$DATASET_NAME \ --resolution=512 \ --learning_rate=1e-5 \ --checkpointing_steps=5000 \ --max_train_steps=50000 \ --validation_steps=2000 \ --num_validation_images=12 \ --train_batch_size=4 \ --dataloader_num_workers=2 \ --seed="0" \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --wandb_project_name=$PROJECT_NAME \ --wandb_run_name=$RUN_NAME \ --enable_xformers_memory_efficient_attention \ --use_boft \ --boft_block_num=$BLOCK_NUM \ --boft_block_size=$BLOCK_SIZE \ --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \ --boft_dropout=0.1 \ --boft_bias="boft_only" \ --report_to="wandb" \

    Run inference on the saved model to sample new images from the validation set:

    ./test_controlnet.sh

    or

    ITER_NUM=50000 export MODEL_NAME="stabilityai/stable-diffusion-2-1" # export MODEL_NAME="runwayml/stable-diffusion-v1-5" export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}" export DATASET_NAME="oftverse/control-celeba-hq" export CKPT_NAME="checkpoint-${ITER_NUM}" export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}/${CKPT_NAME}" export CONTROLNET_PATH="${OUTPUT_DIR}/controlnet/model.safetensors" export UNET_PATH="${OUTPUT_DIR}/unet/${RUN_NAME}" export RESULTS_PATH="${OUTPUT_DIR}/results" accelerate launch test_controlnet.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$DATASET_NAME \ --controlnet_path=$CONTROLNET_PATH \ --unet_path=$UNET_PATH \ --adapter_name=$RUN_NAME \ --output_dir=$RESULTS_PATH \ --dataset_name=$DATASET_NAME \

    Run evaluation on the sampled images to evaluate the landmark reprojection error:

    ./eval.sh

    or

    ITER_NUM=50000 export MODEL_NAME="stabilityai/stable-diffusion-2-1" # export MODEL_NAME="runwayml/stable-diffusion-v1-5" export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}" export DATASET_NAME="oftverse/control-celeba-hq" export CKPT_NAME="checkpoint-${ITER_NUM}" export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}/${CKPT_NAME}" export CONTROLNET_PATH="${OUTPUT_DIR}/controlnet/model.safetensors" export UNET_PATH="${OUTPUT_DIR}/unet/${RUN_NAME}" accelerate launch eval.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$DATASET_NAME \ --controlnet_path=$CONTROLNET_PATH \ --unet_path=$UNET_PATH \ --adapter_name=$RUN_NAME \ --output_dir=$OUTPUT_DIR \ --dataset_name=$DATASET_NAME \ --vis_overlays \
    [huggingface/peft] examples/semantic_segmentation/semantic_segmentation_peft_lora.ipynb
                [20, 0, 255],
                [255, 255, 0],
                [0, 153, 255],
                [0, 41, 255],
                [0, 255, 204],
                [41, 0, 255],
                [41, 255, 0],
                [173, 0, 255],
                [0, 245, 255],
                [71, 0, 255],
                [122, 0, 255],
                [0, 255, 184],
                [0, 92, 255],
                [184, 255, 0],
                [0, 133, 255],
                [255, 214, 0],
                [25, 194, 194],
                [102, 255, 0],
                [92, 0, 255],
            ]
        )
    
    import matplotlib.pyplot as plt
    
    color_seg = np.zeros((pred_seg.shape[0], pred_seg.shape[1], 3), dtype=np.uint8)
    palette = np.array(ade_palette())
    
    for label, color in enumerate(palette):
        color_seg[pred_seg == label, :] = color
    color_seg = color_seg[..., ::-1]  # convert to BGR
    
    img = np.array(image) * 0.5 + color_seg * 0.5  # plot the image with the segmentation map
    img = img.astype(np.uint8)
    
    plt.figure(figsize=(15, 10))
    plt.imshow(img)
    plt.show()
    

    The results are definitely not as expected and as mentioned above, this example is not meant to provide a state-of-the-art model. It exists to familiarize you with the end-to-end workflow.

    On the other hand, if you perform full fine-tuning on the same setup (same model variant, same dataset, same training schedule, etc.), the results would not have been any different. This is a crucial aspect of parameter-efficient fine-tuning -- to be able to match up to the results of the full fine-tuning but with a fraction of total trainable parameters.

    Here are some things that you can try to get better results:

    • Increase the number of training samples.
    • Try a larger SegFormer model variant (know about the available model variants here).
    • Try different values for the arguments available in LoraConfig.
    • Tune the learning rate and batch size.
    [huggingface/peft] examples/boft_dreambooth/boft_dreambooth.md

    DreamBooth fine-tuning with BOFT

    This guide demonstrates how to use BOFT, an orthogonal fine-tuning method, to fine-tune Dreambooth with either stabilityai/stable-diffusion-2-1 or runwayml/stable-diffusion-v1-5 model.

    By using BOFT from šŸ¤— PEFT, we can significantly reduce the number of trainable parameters while still achieving impressive results in various fine-tuning tasks across different foundation models. BOFT enhances model efficiency by integrating full-rank orthogonal matrices with a butterfly structure into specific model blocks, such as attention blocks, mirroring the approach used in LoRA. During fine-tuning, only these inserted matrices are trained, leaving the original model parameters untouched. During inference, the trainable BOFT paramteres can be merged into the original model, eliminating any additional computational costs.

    As a member of the orthogonal finetuning class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the PEFT's GitHub repo's concept guide OFT, the original BOFT paper and the original OFT paper.

    In this guide we provide a Dreambooth fine-tuning script that is available in PEFT's GitHub repo examples. This implementation is adapted from peft's lora_dreambooth. You can try it out and finetune on your custom images.

    Set up your environment

    Start by cloning the PEFT repository:

    git clone --recursive https://github.com/huggingface/peft

    Navigate to the directory containing the training scripts for fine-tuning Dreambooth with BOFT:

    cd peft/examples/boft_dreambooth

    Set up your environment: install PEFT, and all the required libraries. At the time of writing this guide we recommend installing PEFT from source. The following environment setup should work on A100 and H100:

    conda create --name peft python=3.10 conda activate peft conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia conda install xformers -c xformers pip install -r requirements.txt pip install git+https://github.com/huggingface/peft

    Download the data

    dreambooth dataset should have been automatically cloned in the following structure when running the training script.

    boft_dreambooth
    ā”œā”€ā”€ data
    ā”‚   ā”œā”€ā”€ data_dir
    ā”‚   ā””ā”€ā”€ dreambooth
    ā”‚       ā””ā”€ā”€ data
    ā”‚           ā”œā”€ā”€ backpack
    ā”‚           ā””ā”€ā”€ backpack_dog
    ā”‚           ...
    

    You can also put your custom images into boft_dreambooth/data/dreambooth.

    Finetune Dreambooth with BOFT

    ./train_dreambooth.sh

    or using the following script arguments:

    export MODEL_NAME="runwayml/stable-diffusion-v1-5" export INSTANCE_DIR="path-to-instance-images" export CLASS_DIR="path-to-class-images" export OUTPUT_DIR="path-to-save-model"

    Here:

    • INSTANCE_DIR: The directory containing the images that you intend to use for training your model.
    • CLASS_DIR: The directory containing class-specific images. In this example, we use prior preservation to avoid overfitting and language-drift. For prior preservation, you need other images of the same class as part of the training process. However, these images can be generated and the training script will save them to a local path you specify here.
    • OUTPUT_DIR: The destination folder for storing the trained model's weights.

    To learn more about DreamBooth fine-tuning with prior-preserving loss, check out the Diffusers documentation.

    Launch the training script with accelerate and pass hyperparameters, as well as LoRa-specific arguments to it such as:

    • use_boft: Enables BOFT in the training script.
    • boft_block_size: the BOFT matrix block size across different layers, expressed in int. Smaller block size results in sparser update matrices with fewer trainable paramters. Note, please choose it to be dividable to most layer in_features dimension, e.g., 4, 8, 16. Also, you can only specify either boft_block_size or boft_block_num, but not both simultaneously, because boft_block_size x boft_block_num = layer dimension.
    • boft_block_num: the number of BOFT matrix blocks across different layers, expressed in int. Fewer blocks result in sparser update matrices with fewer trainable paramters. Note, please choose it to be dividable to most layer in_features dimension, e.g., 4, 8, 16. Also, you can only specify either boft_block_size or boft_block_num, but not both simultaneously, because boft_block_size x boft_block_num = layer dimension.
    • boft_n_butterfly_factor: the number of butterfly factors. Note, for boft_n_butterfly_factor=1, BOFT is the same as vanilla OFT, for boft_n_butterfly_factor=2, the effective block size of OFT becomes twice as big and the number of blocks become half.
    • bias: specify if the bias paramteres should be traind. Can be none, all or boft_only.
    • boft_dropout: specify the probability of multiplicative dropout.

    Here's what the full set of script arguments may look like:

    PEFT_TYPE="boft" BLOCK_NUM=8 BLOCK_SIZE=0 N_BUTTERFLY_FACTOR=1 VALIDATION_PROMPT=${PROMPT_LIST[@]} INSTANCE_PROMPT="a photo of ${UNIQUE_TOKEN} ${CLASS_TOKEN}" CLASS_PROMPT="a photo of ${CLASS_TOKEN}" export MODEL_NAME="stabilityai/stable-diffusion-2-1" # export MODEL_NAME="runwayml/stable-diffusion-v1-5" export PROJECT_NAME="dreambooth_${PEFT_TYPE}" export RUN_NAME="${SELECTED_SUBJECT}_${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}" export INSTANCE_DIR="./data/dreambooth/dataset/${SELECTED_SUBJECT}" export CLASS_DIR="./data/class_data/${CLASS_TOKEN}" export OUTPUT_DIR="./data/output/${PEFT_TYPE}" accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir="$CLASS_DIR" \ --output_dir=$OUTPUT_DIR \ --wandb_project_name=$PROJECT_NAME \ --wandb_run_name=$RUN_NAME \ --with_prior_preservation --prior_loss_weight=1.0 \ --instance_prompt="$INSTANCE_PROMPT" \ --validation_prompt="$VALIDATION_PROMPT" \ --class_prompt="$CLASS_PROMPT" \ --resolution=512 \ --train_batch_size=1 \ --num_dataloader_workers=2 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=200 \ --use_boft \ --boft_block_num=$BLOCK_NUM \ --boft_block_size=$BLOCK_SIZE \ --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \ --boft_dropout=0.1 \ --boft_bias="boft_only" \ --learning_rate=3e-5 \ --max_train_steps=1010 \ --checkpointing_steps=200 \ --validation_steps=200 \ --enable_xformers_memory_efficient_attention \ --report_to="wandb" \

    or use this training script:

    ./train_dreambooth.sh $idx

    with the $idx corresponds to different subjects.

    If you are running this script on Windows, you may need to set the --num_dataloader_workers to 0.

    Inference with a single adapter

    To run inference with the fine-tuned model, simply run the jupyter notebook dreambooth_inference.ipynb for visualization with jupyter notebook under ./examples/boft_dreambooth.

    [huggingface/peft] examples/boft_controlnet/boft_controlnet.md

    Train controllable generation (ControlNet) with BOFT

    Start with setting some hyperparamters for BOFT:

    PEFT_TYPE="boft" BLOCK_NUM=8 BLOCK_SIZE=0 N_BUTTERFLY_FACTOR=0

    Here:

    Navigate to the directory containing the training scripts for fine-tuning Stable Diffusion with BOFT for controllable generation:

    ./train_controlnet.sh

    or

    export MODEL_NAME="stabilityai/stable-diffusion-2-1" # export MODEL_NAME="runwayml/stable-diffusion-v1-5" export DATASET_NAME="oftverse/control-celeba-hq" export PROJECT_NAME="controlnet_${PEFT_TYPE}" export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}" export CONTROLNET_PATH="" export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}" accelerate launch train_controlnet.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --resume_from_checkpoint=$RESUME_PATH \ --controlnet_model_name_or_path=$CONTROLNET_PATH \ --output_dir=$OUTPUT_DIR \ --report_to="wandb" \ --dataset_name=$DATASET_NAME \ --resolution=512 \ --learning_rate=1e-5 \ --checkpointing_steps=5000 \ --max_train_steps=50000 \ --validation_steps=2000 \ --num_validation_images=12 \ --train_batch_size=4 \ --dataloader_num_workers=2 \ --seed="0" \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --wandb_project_name=$PROJECT_NAME \ --wandb_run_name=$RUN_NAME \ --enable_xformers_memory_efficient_attention \ --use_boft \ --boft_block_num=$BLOCK_NUM \ --boft_block_size=$BLOCK_SIZE \ --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \ --boft_dropout=0.1 \ --boft_bias="boft_only" \ --report_to="wandb" \

    Run inference on the saved model to sample new images from the validation set:

    ./test_controlnet.sh

    or

    ITER_NUM=50000 export MODEL_NAME="stabilityai/stable-diffusion-2-1" # export MODEL_NAME="runwayml/stable-diffusion-v1-5" export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}" export DATASET_NAME="oftverse/control-celeba-hq" export CKPT_NAME="checkpoint-${ITER_NUM}" export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}/${CKPT_NAME}" export CONTROLNET_PATH="${OUTPUT_DIR}/controlnet/model.safetensors" export UNET_PATH="${OUTPUT_DIR}/unet/${RUN_NAME}" export RESULTS_PATH="${OUTPUT_DIR}/results" accelerate launch test_controlnet.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$DATASET_NAME \ --controlnet_path=$CONTROLNET_PATH \ --unet_path=$UNET_PATH \ --adapter_name=$RUN_NAME \ --output_dir=$RESULTS_PATH \ --dataset_name=$DATASET_NAME \

    Run evaluation on the sampled images to evaluate the landmark reprojection error:

    ./eval.sh

    or

    ITER_NUM=50000 export MODEL_NAME="stabilityai/stable-diffusion-2-1" # export MODEL_NAME="runwayml/stable-diffusion-v1-5" export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}" export DATASET_NAME="oftverse/control-celeba-hq" export CKPT_NAME="checkpoint-${ITER_NUM}" export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}/${CKPT_NAME}" export CONTROLNET_PATH="${OUTPUT_DIR}/controlnet/model.safetensors" export UNET_PATH="${OUTPUT_DIR}/unet/${RUN_NAME}" accelerate launch eval.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$DATASET_NAME \ --controlnet_path=$CONTROLNET_PATH \ --unet_path=$UNET_PATH \ --adapter_name=$RUN_NAME \ --output_dir=$OUTPUT_DIR \ --dataset_name=$DATASET_NAME \ --vis_overlays \
    [openaccess-ai-collective/axolotl] scripts/finetune.py
    """Prepare and train a model on a dataset. Can also infer from a model or merge lora"""
    [huggingface/accelerate] examples/README.md

    Simple vision example

    The cv_example.py script is a simple example to fine-tune a ResNet-50 on a classification task (Ofxord-IIT Pet Dataset).

    The same script can be run in any of the following configurations:

    • single CPU or single GPU
    • multi CPUs
    • multi GPUs (using PyTorch distributed mode)
    • (multi) TPUs
    • fp16 (mixed-precision) or fp32 (normal precision)

    Prior to running it you should install timm and torchvision:

    pip install timm torchvision

    and you should download the data with the following commands:

    wget https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz tar -xzf images.tar.gz

    To run it in each of these various modes, use the following commands:

    • single CPU:
      • from a server without GPU
        python ./cv_example.py --data_dir path_to_data
      • from any server by passing cpu=True to the Accelerator.
        python ./cv_example.py --data_dir path_to_data --cpu
      • from any server with Accelerate launcher
        accelerate launch --cpu ./cv_example.py --data_dir path_to_data
    • single GPU:
      python ./cv_example.py # from a server with a GPU
    • with fp16 (mixed-precision)
      • from any server by passing mixed_precison=fp16 to the Accelerator.
        python ./cv_example.py --data_dir path_to_data --mixed_precison fp16
      • from any server with Accelerate launcher
        accelerate launch --mixed_precison fp16 ./cv_example.py --data_dir path_to_data
    • multi CPUs (requires Open MPI, Intel MPI, or MVAPICH)
      • With Accelerate config and launcher, run the following from node 0:
        accelerate config --config_file config.yaml # Select to have accelerate launch mpirun accelerate launch ./cv_example.py --data_dir path_to_data # This will run the script on each server
      • With Intel MPI, execute mpirun from node 0:
        export CCL_WORKER_COUNT=1 export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip mpirun -f hostfile -n 16 -ppn 4 python ./cv_example.py --data_dir path_to_data
    • multi GPUs (using PyTorch distributed mode)
      • With Accelerate config and launcher
        accelerate config --config_file config.yaml # This will create a config file on your server to `config.yaml` accelerate launch --config_file config.yaml ./cv_example.py --data_dir path_to_data # This will run the script on your server
      • With traditional PyTorch launcher (python -m torch.distributed.run can be used instead of torchrun)
        torchrun --nproc_per_node 2 ./cv_example.py --data_dir path_to_data
    • multi GPUs, multi node (several machines, using PyTorch distributed mode)
      • With Accelerate config and launcher, on each machine:
        accelerate config --config_file config.yaml # This will create a config file on your server to `config.yaml` accelerate launch --config_file config.yaml ./cv_example.py --data_dir path_to_data # This will run the script on each server
      • With PyTorch launcher only (python -m torch.distributed.run can be used instead of torchrun). Run this command on each node:
        torchrun \ # python -m torch.distributed.run --nproc_per_node 2 \ --nnodes 2 \ --rdzv_id 2299 \ # A unique job id --rdzv_backend c10d \ --rdzv_endpoint master_node_ip_address:29500 \ ./cv_example.py --data_dir path_to_data
    • (multi) TPUs
      • With Accelerate config and launcher
        accelerate config --config_file config.yaml # This will create a config file on your server to `config.yaml` accelerate launch --config_file config.yaml ./cv_example.py --data_dir path_to_data # This will run the script on each server
      • In PyTorch: Add an xmp.spawn line in your script as you usually do.

    Simple vision example (GANs)

    Using AWS SageMaker integration

    [huggingface/accelerate] examples/by_feature/megatron_lm_gpt_pretraining.py
    #!/usr/bin/env python # Copyright 2021 The HuggingFace Inc. team. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...) on a text file or a dataset without using HuggingFace Trainer. Here is the full list of checkpoints on the hub that can be fine-tuned by this script: https://huggingface.co/models?filter=text-generation """
    [openaccess-ai-collective/axolotl] examples/stablelm-2/1.6b/fft.yml
    base_model: stabilityai/stablelm-2-1_6b model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: true load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true flash_attn_cross_entropy: false flash_attn_rms_norm: true flash_attn_fuse_qkv: false flash_attn_fuse_mlp: true warmup_steps: 100 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 1 debug: deepspeed: #deepspeed_configs/zero2.json # multi-gpu only weight_decay: 0.1 fsdp: fsdp_config: special_tokens:
    [openaccess-ai-collective/axolotl] examples/llama-2/fft_optimized.yml
    base_model: NousResearch/Llama-2-7b-hf model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true flash_attn_cross_entropy: false flash_attn_rms_norm: true flash_attn_fuse_qkv: false flash_attn_fuse_mlp: true warmup_steps: 100 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 1 debug: deepspeed: #deepspeed_configs/zero2.json # multi-gpu only weight_decay: 0.1 fsdp: fsdp_config: special_tokens:
    [openaccess-ai-collective/axolotl] examples/phi/phi2-ft.yml
    base_model: microsoft/phi-2 model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: garage-bAInd/Open-Platypus type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./phi-sft-out sequence_len: 2048 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_torch adam_beta2: 0.95 adam_epsilon: 0.00001 max_grad_norm: 1.0 lr_scheduler: cosine learning_rate: 0.000003 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: True early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 100 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: resize_token_embeddings_to_32x: true special_tokens: pad_token: "<|endoftext|>"
    [openaccess-ai-collective/axolotl] examples/phi/phi-ft.yml
    base_model: microsoft/phi-1_5 model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: garage-bAInd/Open-Platypus type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./phi-sft-out sequence_len: 2048 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_torch adam_beta2: 0.95 adam_epsilon: 0.00001 max_grad_norm: 1.0 lr_scheduler: cosine learning_rate: 0.000003 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: True early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 100 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: resize_token_embeddings_to_32x: true special_tokens: pad_token: "<|endoftext|>"
    [openaccess-ai-collective/axolotl] examples/xgen-7b/xgen-7b-8k-qlora.yml
    # An example finetuning Saleforce's XGen-7b model with 8k context using qlora # on Tim Dettmer's Guanaco dataset. base_model: Salesforce/xgen-7b-8k-base trust_remote_code: true model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false # enable 4bit for QLoRA load_in_4bit: true gptq: false strict: false push_dataset_to_hub: datasets: - path: timdettmers/openassistant-guanaco data_files: - openassistant_best_replies_train.jsonl type: "completion" dataset_prepared_path: val_set_size: 0.05 # enable QLoRA adapter: qlora lora_model_dir: sequence_len: 8192 max_packed_sequence_len: # hyperparameters from QLoRA paper Appendix B.2 # "We find hyperparameters to be largely robust across datasets" lora_r: 64 lora_alpha: 16 # 0.1 for models up to 13B # 0.05 for 33B and 65B models lora_dropout: 0.05 # add LoRA modules on all linear layers of the base model lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./qlora-out # QLoRA paper Table 9 # - 16 for 7b & 13b # - 32 for 33b, 64 for 64b # Max size tested on A6000 # - 7b: 40 # - 40b: 4 # decrease if OOM, increase for max VRAM utilization micro_batch_size: 1 gradient_accumulation_steps: 1 num_epochs: 4 # Optimizer for QLoRA optimizer: paged_adamw_32bit torchdistx_path: lr_scheduler: cosine # QLoRA paper Table 9 # - 2e-4 for 7b & 13b # - 1e-4 for 33b & 64b learning_rate: 0.00002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true # stop training after this many evaluation losses have increased in a row # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback early_stopping_patience: 3 resume_from_checkpoint: auto_resume_from_checkpoints: true local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 special_tokens: eos_token: "<|endoftext|>" bos_token: "<|endoftext|>" unk_token: "<|endoftext|>" pad_token: "<|endoftext|>"
    [huggingface/accelerate] tests/test_big_modeling.py
    quantization_config = BitsAndBytesConfig(load_in_4bit=True) model = replace_with_bnb_linear( model, modules_to_not_convert=["lm_head"], quantization_config=quantization_config ) model_path = hf_hub_download("bigscience/bloom-560m", "pytorch_model.bin") # test with auto model = load_checkpoint_and_dispatch( model, checkpoint=model_path, device_map="auto", ) assert model.h[0].self_attention.query_key_value.weight.dtype == torch.uint8 assert model.h[0].self_attention.query_key_value.weight.device.index == 0 with init_empty_weights(): model = AutoModel.from_config(AutoConfig.from_pretrained("bigscience/bloom-560m")) model = replace_with_bnb_linear( model, modules_to_not_convert=["lm_head"], quantization_config=quantization_config ) # test with str device map model = load_checkpoint_and_dispatch( model, checkpoint=model_path, device_map={"": torch.device("cuda:0")}, ) assert model.h[0].self_attention.query_key_value.weight.dtype == torch.uint8 assert model.h[0].self_attention.query_key_value.weight.device.index == 0 with init_empty_weights(): model = AutoModel.from_config(AutoConfig.from_pretrained("bigscience/bloom-560m")) model = replace_with_bnb_linear( model, modules_to_not_convert=["lm_head"], quantization_config=quantization_config ) # test with torch.device device map model = load_checkpoint_and_dispatch( model, checkpoint=model_path, device_map={"": "cuda:0"}, ) assert model.h[0].self_attention.query_key_value.weight.dtype == torch.uint8 assert model.h[0].self_attention.query_key_value.weight.device.index == 0
    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/__init__.py
    def legacy_validate_config(cfg): """ This is a "pre-validation" step that handles the yaml configuration before we have any information about the model architecture """ if is_torch_bf16_gpu_available(): if not cfg.bf16 and not cfg.bfloat16: LOG.info("bf16 support detected, but not enabled for this configuration.") else: if ( not cfg.merge_lora and not cfg.is_preprocess and (cfg.bf16 is True or cfg.bfloat16 is True) ): raise ValueError( "bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above." ) if ( # pylint: disable=too-many-boolean-expressions not (cfg.bf16 or cfg.bfloat16) and (cfg.fp16 or cfg.float16) and not cfg.adapter and not cfg.flash_attention and cfg.sample_packing ): LOG.warning( "Full fine tune w/o FA2 w/ sample packing and fp16/float16 is likely to raise errors. Try LoRA." ) # ValueError: Attempting to unscale FP16 gradients. # OR # RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half if cfg.max_packed_sequence_len: raise DeprecationWarning("`max_packed_sequence_len` is no longer supported") if cfg.sample_packing and cfg.rl: raise ValueError("`sample_packing: true` does not work with RLHF training") if cfg.sample_packing and not cfg.pad_to_sequence_len: LOG.warning( "`pad_to_sequence_len: true` is recommended when using sample_packing" ) if cfg.gradient_accumulation_steps and cfg.batch_size: raise ValueError( "please set only one of gradient_accumulation_steps or batch_size" ) if cfg.batch_size: LOG.warning( "%s\n%s", "batch_size is not recommended. Please use gradient_accumulation_steps instead.", "To calculate the equivalent gradient_accumulation_steps, divide batch_size / micro_batch_size / number of gpus.", ) if ( cfg.eval_batch_size and cfg.micro_batch_size and cfg.eval_batch_size != cfg.micro_batch_size ): LOG.warning( "eval_batch_size != micro_batch_size. This can lead to VRAM instability." ) if cfg.adapter == "qlora": if cfg.merge_lora: # can't merge qlora if loaded in 8bit or 4bit if cfg.load_in_8bit: raise ValueError("Can't merge qlora if loaded in 8bit") if cfg.gptq: raise ValueError("Can't merge qlora if gptq") if cfg.load_in_4bit: raise ValueError("Can't merge qlora if loaded in 4bit") else: if cfg.load_in_8bit: raise ValueError("Can't load qlora in 8bit") if cfg.gptq: raise ValueError("Can't load qlora if gptq") if not cfg.load_in_4bit: raise ValueError("Require cfg.load_in_4bit to be True for qlora") if cfg.flash_attn_fuse_qkv or cfg.flash_attn_fuse_mlp: raise ValueError("Fused modules are not supported with QLoRA") loftq = cfg.peft and cfg.peft.loftq_config and cfg.peft.loftq_config.loftq_bits if not cfg.load_in_8bit and cfg.adapter == "lora" and not loftq: LOG.warning("We recommend setting `load_in_8bit: true` for LORA finetuning") if cfg.adapter == "lora" and (cfg.flash_attn_fuse_qkv or cfg.flash_attn_fuse_mlp): raise ValueError("Fused modules are not supported with LoRA") if cfg.adapter and cfg.peft_layers_to_transform and cfg.unfrozen_parameters: raise ValueError( "`unfrozen_parameters` used with `peft_layers_to_transform` can have unexpected behavior." ) if cfg.relora_steps: if cfg.adapter not in ("lora", "qlora"): raise ValueError("cfg.adapter must be lora or qlora to use ReLoRA") if cfg.fsdp: raise ValueError("fsdp not supported with ReLoRA") if cfg.deepspeed: raise ValueError("deepspeed not supported with ReLoRA") if cfg.lr_scheduler == "one_cycle": raise ValueError("ReLoRA is not compatible with the one_cycle scheduler") if cfg.flash_attn_fuse_qkv or cfg.flash_attn_fuse_mlp: raise ValueError("Fused modules are not supported with ReLoRA") if cfg.trust_remote_code: LOG.warning( "`trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model." ) if cfg.push_dataset_to_hub and cfg.hf_use_auth_token is not True: raise ValueError( "Require cfg.hf_use_auth_token to be True for push_dataset_to_hub" ) if (cfg.base_model and "falcon" in cfg.base_model.lower()) and cfg.fsdp: raise ValueError("FSDP is not supported for falcon models") if ( cfg.base_model and "mpt" in cfg.base_model.lower() ) and cfg.gradient_checkpointing: raise ValueError("gradient_checkpointing is not supported for MPT models") if cfg.flash_optimum is True: if cfg.adapter: LOG.warning("BetterTransformers probably doesn't work with PEFT adapters") if cfg.fp16 or cfg.bf16: raise ValueError("AMP is not supported with BetterTransformer") if cfg.float16 is not True and cfg.bfloat16 is not True: LOG.warning( "You should probably set bfloat16 or float16 to true to " "load the model in float16 for BetterTransformers" ) if int(torch.__version__.split(".", maxsplit=1)[0]) < 2: LOG.warning("torch>=2.0.0 required") raise ValueError( f"flash_optimum for BetterTransformers may not be used with {torch.__version__}" ) if cfg.pretraining_dataset and cfg.group_by_length: LOG.warning( "You probably want to disable group_by_length as it will force a streamed dataset to download completely." ) if cfg.pretraining_dataset and not cfg.max_steps: raise ValueError( "max_steps must be set when using iterable pretraining_dataset, Trainer can't infer length and schedule optimizer/learning rate without it!" ) if any([cfg.adam_beta1, cfg.adam_beta2, cfg.adam_epsilon]) and ( not cfg.optimizer or "adamw" not in cfg.optimizer ): LOG.warning("adamw hyperparameters found, but no adamw optimizer set") if cfg.push_to_hub_model_id: raise ValueError( "push_to_hub_model_id is deprecated. Please use hub_model_id instead." ) if cfg.hub_model_id and cfg.save_strategy not in ["steps", "epoch", None]: LOG.warning( "hub_model_id is set without any models being saved. To save a model, set save_strategy to steps, epochs or leave empty." ) if cfg.gptq and cfg.revision_of_model: raise ValueError( "revision_of_model is not supported for GPTQ models. " + "Please download the model from HuggingFace Hub manually for correct branch, " + "point to its path, and remove revision_of_model from the config." ) # if cfg.sample_packing and cfg.sdp_attention: # # incompatible due to bug w/ accelerate causing 0.0 loss when using llama2 # raise ValueError( # "sample_packing not compatible with sdp_attention. Use flash_attention" # ) if cfg.sample_packing and cfg.xformers_attention: raise ValueError( "sample_packing not compatible with xformers_attention. Use flash_attention" ) if cfg.sample_packing and cfg.sdp_attention and (cfg.bfloat16 or cfg.bf16): # https://github.com/pytorch/pytorch/blob/1b03423526536b5f3d35bdfa95ccc6197556cf9b/test/test_transformers.py#L2440-L2450 LOG.warning( "sample_packing & torch sdpa with bf16 is unsupported may results in 0.0 loss. " "This may work on H100s." ) if cfg.early_stopping_patience: if not cfg.save_steps or not cfg.eval_steps: raise ValueError( "`early_stopping_patience` requires save_steps and eval_steps to be set. eval_steps should evenly divide save_steps." ) if cfg.save_steps % cfg.eval_steps != 0: raise ValueError( "`early_stopping_patience` requires that eval_steps should evenly divide save_steps." ) if cfg.datasets: for idx, ds_cfg in enumerate(cfg.datasets): if not ds_cfg.type: continue if ds_cfg.type == "sharegpt:chat": LOG.warning( PendingDeprecationWarning( "`type: sharegpt:chat` will soon be deprecated. simply use `type: sharegpt` instead." ) ) cfg.datasets[idx].type = "sharegpt" if "sharegpt_simple" in ds_cfg.type: LOG.warning( PendingDeprecationWarning( "`type: sharegpt_simple` will soon be deprecated. simply use `type: sharegpt` instead." ) ) cfg.datasets[idx].type = cfg.datasets[idx].type.replace( "sharegpt_simple", "sharegpt" ) if cfg.saves_per_epoch and cfg.save_steps: raise ValueError( "save_steps and saves_per_epoch are mutually exclusive and cannot be used together." ) if cfg.save_strategy and cfg.saves_per_epoch and cfg.save_strategy != "steps": raise ValueError( "save_strategy must be empty or set to `steps` when used with saves_per_epoch." ) if cfg.save_strategy and cfg.save_steps and cfg.save_strategy != "steps": raise ValueError( "save_strategy and save_steps mismatch. Please set save_strategy to 'steps' or remove save_steps." ) if cfg.evals_per_epoch and cfg.eval_steps: raise ValueError( "eval_steps and evals_per_epoch are mutually exclusive and cannot be used together." ) if ( cfg.evals_per_epoch and cfg.evaluation_strategy and cfg.evaluation_strategy != "steps" ): raise ValueError( "evaluation_strategy must be empty or set to `steps` when used with evals_per_epoch." ) if ( cfg.evaluation_strategy and cfg.eval_steps and cfg.evaluation_strategy != "steps" ): raise ValueError( "evaluation_strategy and eval_steps mismatch. Please set evaluation_strategy to 'steps' or remove eval_steps." ) if ( cfg.val_set_size == 0 and (cfg.eval_steps or cfg.evaluation_strategy) and not cfg.test_datasets ): raise ValueError( "eval_steps and evaluation_strategy are not supported with val_set_size == 0" ) if ( cfg.sample_packing and cfg.eval_table_size and cfg.eval_sample_packing is not False ): raise ValueError( "eval_table_size and eval_sample_packing are not supported together with sample_packing. Please set 'eval_sample_packing' to false." ) if not cfg.adapter and (cfg.load_in_8bit or cfg.load_in_4bit): raise ValueError( "load_in_8bit and load_in_4bit are not supported without setting an adapter." "If you want to full finetune, please turn off load_in_8bit and load_in_4bit." ) if cfg.rope_scaling: LOG.warning("`rope_scaling` should now be be a key under `model_config`") if cfg.wandb_run_id and not cfg.wandb_name: cfg.wandb_name = cfg.wandb_run_id LOG.warning( "wandb_run_id sets the ID of the run. If you would like to set the name, please use wandb_name instead." ) if cfg.noisy_embedding_alpha is not None: # Deprecated, use neftune_noise_alpha LOG.warning("noisy_embedding_alpha is deprecated, use neftune_noise_alpha") if cfg.neftune_noise_alpha is None: cfg.neftune_noise_alpha = cfg.noisy_embedding_alpha else: # User is providing both; bail and have them sort out their settings raise ValueError( "noisy_embedding_alpha is deprecated, use neftune_noise_alpha; both are set, please remove the deprecated noisy_embedding_alpha setting" ) if cfg.neftune_noise_alpha is not None and cfg.neftune_noise_alpha <= 0.0: raise ValueError("neftune_noise_alpha must be > 0.0") if cfg.max_memory is not None and cfg.gpu_memory_limit is not None: raise ValueError( "max_memory and gpu_memory_limit are mutually exclusive and cannot be used together." ) if ( cfg.unfrozen_parameters and cfg.gradient_checkpointing_kwargs and cfg.gradient_checkpointing_kwargs.use_reentrant is True ): # https://github.com/huggingface/transformers/issues/21381 raise ValueError( "`use_reentrant` must be false when used with partially frozen model." ) if cfg.deepspeed and Path(cfg.deepspeed).is_file(): with open(cfg.deepspeed, encoding="utf-8") as file: contents = file.read() deepspeed_cfg: DictDefault = DictDefault(json.loads(contents)) if cfg.flash_attention: if ( deepspeed_cfg.zero_optimization and deepspeed_cfg.zero_optimization.stage == 3 ): if not ( ( deepspeed_cfg.bf16 and deepspeed_cfg.bf16.enabled # pylint: disable=no-member is True ) or ( deepspeed_cfg.fp16 and deepspeed_cfg.fp16.enabled # pylint: disable=no-member is True ) ): raise ValueError( "bf16.enabled or fp16.enabled must be set to true when using ZeRO-3 with flash-attention" ) if "8bit" in cfg.optimizer and deepspeed_cfg.optimizer: LOG.warning( f"conflicting optimizer: {cfg.optimizer} used alongside deepspeed optimizer." ) if cfg.test_datasets and cfg.val_set_size: raise ValueError( "non-zero val_set_size should not be used with test_datasets configuration" ) if cfg.fsdp and "bnb" in cfg.optimizer: raise ValueError(f"FSDP not compatible with {cfg.optimizer}") if cfg.do_causal_lm_eval and cfg.eval_sample_packing: raise ValueError( "do_causal_lm_eval is enabled, eval_sample_packing must be set to False" ) if cfg.eval_causal_lm_metrics: supported_metrics = ["sacrebleu", "comet", "ter", "chrf"] if not isinstance(cfg.eval_causal_lm_metrics, list): raise ValueError("eval_causal_lm_metrics must be a list") # only ["sacrebleu", "comet", "ter", "chrf"] supported if set(cfg.eval_causal_lm_metrics) - set(supported_metrics): raise ValueError( f"eval_causal_lm_metrics must be one of {supported_metrics}" ) # TODO # MPT 7b # https://github.com/facebookresearch/bitsandbytes/issues/25 # no 8bit adaAmw w bf16 # GPT-NeoX # evals broken when extending context len # File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 162, in forward attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) # File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/optimum/bettertransformer/models/attention.py", line 74, in gpt2_wrapped_scaled_dot_product # attention_mask = causal_mask + attention_mask # RuntimeError: The size of tensor a (2048) must match the size of tensor b (8132) at non-singleton dimension 3
    [openaccess-ai-collective/axolotl] examples/mistral/mixtral.yml
    base_model: mistralai/Mixtral-8x7B-v0.1 model_type: AutoModelForCausalLM tokenizer_type: LlamaTokenizer trust_remote_code: true load_in_8bit: false load_in_4bit: true strict: false datasets: - path: tatsu-lab/alpaca type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.0 output_dir: ./qlora-out ## You can optionally freeze the entire model and unfreeze a subset of parameters unfrozen_parameters: # - ^lm_head.weight$ # - ^model.embed_tokens.weight$[:32000] # - model.layers.2[0-9]+.block_sparse_moe.gate # - model.layers.2[0-9]+.block_sparse_moe.experts # - model.layers.3[0-9]+.block_sparse_moe.gate # - model.layers.3[0-9]+.block_sparse_moe.experts model_config: output_router_logits: true adapter: qlora lora_model_dir: sequence_len: 4096 sample_packing: true pad_to_sequence_len: true lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: #lora_target_modules: # - gate # - q_proj # - k_proj # - v_proj # - o_proj # - w1 # - w2 # - w3 wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 2 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true loss_watchdog_threshold: 5.0 loss_watchdog_patience: 3 warmup_steps: 10 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128 saves_per_epoch: 1 debug: deepspeed: deepspeed_configs/zero2.json weight_decay: 0.0 fsdp: fsdp_config: special_tokens:
    [openaccess-ai-collective/axolotl] examples/falcon/config-7b.yml
    base_model: tiiuae/falcon-7b trust_remote_code: true model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false gptq: false strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca:chat dataset_prepared_path: val_set_size: 0.05 adapter: lora_model_dir: sequence_len: 2048 max_packed_sequence_len: lora_r: 64 lora_alpha: 32 lora_dropout: 0.0 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./falcon-7b batch_size: 2 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.00003 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 40 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: pad_token: "<|endoftext|>" bos_token: "<|endoftext|>" eos_token: "<|endoftext|>"
    [openaccess-ai-collective/axolotl] examples/falcon/config-7b-qlora.yml
    # 1b: tiiuae/falcon-rw-1b # 40b: tiiuae/falcon-40b base_model: tiiuae/falcon-7b # required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main trust_remote_code: true model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false # enable 4bit for QLoRA load_in_4bit: true gptq: false strict: false push_dataset_to_hub: datasets: - path: QingyiSi/Alpaca-CoT data_files: - Chain-of-Thought/formatted_cot_data/gsm8k_train.json type: "alpaca:chat" dataset_prepared_path: val_set_size: 0.05 # enable QLoRA adapter: qlora lora_model_dir: sequence_len: 2048 max_packed_sequence_len: # hyperparameters from QLoRA paper Appendix B.2 # "We find hyperparameters to be largely robust across datasets" lora_r: 64 lora_alpha: 16 # 0.1 for models up to 13B # 0.05 for 33B and 65B models lora_dropout: 0.05 # add LoRA modules on all linear layers of the base model lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./qlora-out # QLoRA paper Table 9 # - 16 for 7b & 13b # - 32 for 33b, 64 for 64b # Max size tested on A6000 # - 7b: 40 # - 40b: 4 # decrease if OOM, increase for max VRAM utilization micro_batch_size: 1 gradient_accumulation_steps: 2 num_epochs: 4 # Optimizer for QLoRA optimizer: paged_adamw_32bit torchdistx_path: lr_scheduler: cosine # QLoRA paper Table 9 # - 2e-4 for 7b & 13b # - 1e-4 for 33b & 64b learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true # stop training after this many evaluation losses have increased in a row # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback early_stopping_patience: 3 resume_from_checkpoint: auto_resume_from_checkpoints: true local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.000001 fsdp: fsdp_config: special_tokens: pad_token: "<|endoftext|>" bos_token: "<|endoftext|>" eos_token: "<|endoftext|>"
OpenAccess-AI-Collective/axolotl
huggingface/transformers
huggingface/peft
huggingface/accelerate