OpenAccess-AI-Collective/axolotl

how to fine-tune vision models ?

To fine-tune vision models, you can follow a general approach that involves loading a pre-trained model, preparing your dataset, modifying the model for your specific task (if necessary), and then training the model on your dataset. Below is a step-by-step guide to fine-tune a vision model, specifically using a ResNet-50 model for a classification task as an example. This process can be adapted for other vision models and tasks.

Step 1: Install Required Libraries

First, ensure you have the necessary libraries installed. For this example, you'll need PyTorch, torchvision, and timm (for easily accessing pre-trained models).

pip install torch torchvision timm

Step 2: Prepare Your Dataset

Prepare your dataset for training and validation. For classification tasks, your dataset should be organized into directories for each class. For example, using the Oxford-IIIT Pet Dataset:

wget https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
tar -xzf images.tar.gz

You might need to organize this dataset into a structure that is compatible with torchvision.datasets.ImageFolder, which expects the following directory structure:

path/to/data/
    train/
        class1/
            img1.jpg
            img2.jpg
            ...
        class2/
            img1.jpg
            img2.jpg
            ...
    val/
        class1/
            img1.jpg
            img2.jpg
            ...
        class2/
            img1.jpg
            img2.jpg
            ...

Step 3: Load a Pre-trained Model

Load a pre-trained ResNet-50 model using torchvision or timm. Modify the final layer to match the number of classes in your dataset if necessary.

import torchvision.models as models
import torch.nn as nn

num_classes = 37  # For Oxford-IIIT Pet Dataset
model = models.resnet50(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, num_classes)

Step 4: Define Data Transforms

Define the appropriate transformations for your dataset. Commonly, you would normalize the images and apply data augmentation techniques.

from torchvision import transforms

data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

Step 5: Load Data

Use torchvision.datasets.ImageFolder to load the dataset and torch.utils.data.DataLoader to create data loaders for training and validation.

from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader

data_dir = 'path/to/data'
batch_size = 32

train_dataset = ImageFolder(root=f'{data_dir}/train', transform=data_transforms['train'])
val_dataset = ImageFolder(root=f'{data_dir}/val', transform=data_transforms['val'])

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

Step 6: Train the Model

Define the loss function and optimizer, then train the model. You can use the torch.optim package to define an optimizer. A common choice is the Adam optimizer.

import torch.optim as optim
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}")
    
    # Validation loop can be added here

This is a simplified example to get you started with fine-tuning vision models. Depending on your specific task, you might need to adjust the data preprocessing, model architecture, and training loop.

Search results

[huggingface/peft] examples/image_classification/image_classification_peft_lora.ipynb

## Introduction

In this notebook, we will learn how to use [LoRA](https://arxiv.org/abs/2106.09685) from 🤗 PEFT to fine-tune an image classification model by ONLY using **0.77%** of the original trainable parameters of the model. 

LoRA adds low-rank "update matrices" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://arxiv.org/abs/2106.09685). 

Let's get started by installing the dependencies. 

__*Note that this notebook builds on top the [official image classification example notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_classification.ipynb).*__
## Install dependencies

Here we're installing `peft` from source to ensure we have access to all the bleeding edge features of `peft`.

!pip install transformers accelerate evaluate datasets git+https://github.com/huggingface/peft -q

## Authentication

We will share our fine-tuned model at the end of training. So, to do that we just authenticate using our 🤗 token. This token is available from [here](https://huggingface.co/settings/tokens). If you don't have a 🤗 account already, we highly encourage you to do so; it's free!

from huggingface_hub import notebook_login

notebook_login()

## Check the library versions

import transformers import accelerate import peft

print(f"Transformers version: {transformers.version}") print(f"Accelerate version: {accelerate.version}") print(f"PEFT version: {peft.version}")

## Select a model checkpoint to fine-tune

model_checkpoint = "google/vit-base-patch16-224-in21k" # pre-trained model from which to fine-tune

## Load a dataset

We're only loading the first 5000 instances from the training set of the [Food-101 dataset](https://huggingface.co/datasets/food101) to keep this example runtime short.

from datasets import load_dataset

dataset = load_dataset("food101", split="train[:5000]")

## Prepare datasets for training and evaluation
1. Prepare `label2id` and `id2label` dictionaries. This will come in handy when performing inference and for metadata information.

labels = dataset.features["label"].names label2id, id2label = dict(), dict() for i, label in enumerate(labels): label2id[label] = i id2label[i] = label

id2label[2]

2. We load the image processor of the model we're fine-tuning.

from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained(model_checkpoint) image_processor

As one might notice, the `image_processor` has useful information on which size the training and evaluation images should be resized, stats that should be used to normalize the pixel values, etc. 
3. Using the image processor we prepare transformation functions for the datasets. These functions will include augmentation and pixel scaling.

from torchvision.transforms import ( CenterCrop, Compose, Normalize, RandomHorizontalFlip, RandomResizedCrop, Resize, ToTensor, )

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std) train_transforms = Compose( [ RandomResizedCrop(image_processor.size["height"]), RandomHorizontalFlip(), ToTensor(), normalize, ] )

val_transforms = Compose( [ Resize(image_processor.size["height"]), CenterCrop(image_processor.size["height"]), ToTensor(), normalize, ] )

def preprocess_train(example_batch): """Apply train_transforms across a batch.""" example_batch["pixel_values"] = [train_transforms(image.convert("RGB")) for image in example_batch["image"]] return example_batch

def preprocess_val(example_batch): """Apply val_transforms across a batch.""" example_batch["pixel_values"] = [val_transforms(image.convert("RGB")) for image in example_batch["image"]] return example_batch

4. We split our mini dataset into training and validation.

split up training into training + validation

splits = dataset.train_test_split(test_size=0.1) train_ds = splits["train"] val_ds = splits["test"]

5. We set the transformation functions to the datasets accordingly.

train_ds.set_transform(preprocess_train) val_ds.set_transform(preprocess_val)

## Load and prepare a model 

In this section, we first load the model we want to fine-tune.

def print_trainable_parameters(model): """ Prints the number of trainable parameters in the model. """ trainable_params = 0 all_param = 0 for _, param in model.named_parameters(): all_param += param.numel() if param.requires_grad: trainable_params += param.numel() print( f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}" )

The `get_peft_model()` method that we will use in a moment wraps the original model to be fine-tuned as a `PeftModel`. So, it's important for us to initialize the original model correctly. As such, we initialize it by specifying the `label2id` and `id2label` so that `AutoModelForImageClassification` can initialize a append classification head to the underlying model, adapted for our dataset. We can confirm this from the warning below:

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.weight', 'classifier.bias']

from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

model = AutoModelForImageClassification.from_pretrained( model_checkpoint, label2id=label2id, id2label=id2label, ignore_mismatched_sizes=True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint ) print_trainable_parameters(model)

Also, take note of the number of total trainable parameters of `model`: it's 100%! We'll compare this number to that of the LoRA model.

We now use the `PeftModel` to wrap `model` so that the "update" matrices are added to the respective places.

from peft import LoraConfig, get_peft_model

config = LoraConfig( r=16, lora_alpha=16, target_modules=["query", "value"], lora_dropout=0.1, bias="none", modules_to_save=["classifier"], ) lora_model = get_peft_model(model, config) print_trainable_parameters(lora_model)

Let's unpack what's going on here. 

In order for LoRA to take effect, we need to specify the target modules to `LoraConfig` so that `get_peft_model()` knows which modules inside our model needs to be amended with LoRA matrices. In this case, we're only interested in targetting the query and value matrices of the attention blocks of the base model. Since the parameters corresponding to these matrices are "named" with `query` and `value` respectively, we specify them accordingly in the `target_modules` argument of `LoraConfig`. 

We also specify `modules_to_save`. After we wrap our base model `model` with `get_peft_model()` along with the `config`, we get a new model where only the LoRA parameters are trainable (so-called "update matrices") while the pre-trained parameters are kept frozen. These include the parameters of the randomly initialized classifier parameters too. This is NOT we want when fine-tuning the base model on our custom dataset. To ensure that the classifier parameters are also trained, we specify `modules_to_save`. This also ensures that these modules are serialized alongside the LoRA trainable parameters when using utilities like `save_pretrained()` and `push_to_hub()`.  

Regarding the other parameters:

* `r`: The dimension used by the LoRA update matrices.
* `alpha`: Scaling factor.
* `bias`: Specifying if the `bias` parameters should be trained. `None` denotes none of the `bias` parameters will be trained. 

`r` and `alpha` together control the total number of final trainable parameters when using LoRA giving us the flexbility to balance a trade-off between end performance and compute efficiency.

We can also how many parameters we're actually training. Since we're interested in performing **parameter-efficient fine-tuning**, we should expect to notice a less number of trainable parameters from the `lora_model` in comparison to the original `model` which is indeed the case here. 
## Training arguments

We will leverage [🤗 Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) for fine-tuning. It accepts several arguments which we wrap using [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

from transformers import TrainingArguments, Trainer

model_name = model_checkpoint.split("/")[-1] batch_size = 128

args = TrainingArguments( f"{model_name}-finetuned-lora-food101", remove_unused_columns=False, evaluation_strategy="epoch", save_strategy="epoch", learning_rate=5e-3, per_device_train_batch_size=batch_size, gradient_accumulation_steps=4, per_device_eval_batch_size=batch_size, fp16=True, num_train_epochs=5, logging_steps=10, load_best_model_at_end=True, metric_for_best_model="accuracy", push_to_hub=True, label_names=["labels"], )

Some things to note here:

* We're using a larger batch size since there is only a handful of parameters to train. 
* Larger learning rate than the normal (1e-5 for example). 

All of these things are a byproduct of the fact that we're training only a small number of parameters. This can potentially also reduce the need to conduct expensive hyperparameter tuning experiments. 
## Prepare evaluation metric

import numpy as np import evaluate

metric = evaluate.load("accuracy")

the compute_metrics function takes a Named Tuple as input:

predictions, which are the logits of the model as Numpy arrays,

and label_ids, which are the ground-truth labels as Numpy arrays.

def compute_metrics(eval_pred): """Computes accuracy on a batch of predictions""" predictions = np.argmax(eval_pred.predictions, axis=1) return metric.compute(predictions=predictions, references=eval_pred.label_ids)

## Collation function

This is used by `Trainer` to gather a batch of training and evaluation examples and prepare them in a format that is acceptable by the underlying model.

import torch

def collate_fn(examples): pixel_values = torch.stack([example["pixel_values"] for example in examples]) labels = torch.tensor([example["label"] for example in examples]) return {"pixel_values": pixel_values, "labels": labels}

## Train and evaluate

trainer = Trainer( lora_model, args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=image_processor, compute_metrics=compute_metrics, data_collator=collate_fn, ) train_results = trainer.train()

In just a few minutes, we have a fine-tuned model with 96% validation accuracy. Also, note that we used a very small subset of the training dataset which is definitely impacting the results.

trainer.evaluate(val_ds)

## Sharing your model and inference 

Once the fine-tuning is done, we can share the LoRA parameters with the community like so:

repo_name = f"sayakpaul/{model_name}-finetuned-lora-food101" lora_model.push_to_hub(repo_name)

When we call `push_to_hub()` on the `lora_model`, only the LoRA parameters along with any modules specified in `modules_to_save` are saved. If we take a look at the [trained LoRA parameters](https://huggingface.co/sayakpaul/vit-base-patch16-224-in21k-finetuned-lora-food101/blob/main/adapter_model.bin), we see that it's only **2.6 MB**! This greatly helps with portability especially when we're using a very large model to fine-tune (such as [BLOOM](https://huggingface.co/bigscience/bloom)). 
Next, we see how to load the LoRA updated parameters along with our base model for inference. When we wrap a base model with `PeftModel` that modifications are DONE in place. So to mitigate any concerns that might stem from in place modifications, we newly initialize our base model just like we did earlier and construct our inference model.

from peft import PeftConfig, PeftModel

config = PeftConfig.from_pretrained(repo_name) model = model = AutoModelForImageClassification.from_pretrained( config.base_model_name_or_path, label2id=label2id, id2label=id2label, ignore_mismatched_sizes=True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint )

Load the Lora model

inference_model = PeftModel.from_pretrained(model, repo_name)

Don't worry about the warnings, they're harmless. 
Let's now fetch a sample for inference.

from PIL import Image import requests

url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/beignets.jpeg" image = Image.open(requests.get(url, stream=True).raw) image

We first instantiate an `image_processor` from the underlying model repo.

image_processor = AutoImageProcessor.from_pretrained(repo_name)

We then prepare the sample for inference.

prepare image for the model

encoding = image_processor(image.convert("RGB"), return_tensors="pt") print(encoding.pixel_values.shape)

And run inference!

import torch

forward pass

with torch.no_grad(): outputs = inference_model(**encoding) logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", inference_model.config.id2label[predicted_class_idx])

[huggingface/peft] examples/int8_training/fine_tune_blip2_int8.py

# Let's define the LoraConfig
config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
)

# We load our model and processor using `transformers`
model = AutoModelForVision2Seq.from_pretrained(
    "Salesforce/blip2-opt-2.7b", quantization_config=BitsAndBytesConfig(load_in_8bit=True)
)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")

# Get our peft model and print the number of trainable parameters
model = get_peft_model(model, config)

[huggingface/transformers] docs/source/en/tasks/visual_question_answering.md

Fine-tuning ViLT

ViLT model incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for Vision-and-Language Pre-training (VLP). This model can be used for several downstream tasks. For the VQA task, a classifier head is placed on top (a linear layer on top of the final hidden state of the [CLS] token) and randomly initialized. Visual Question Answering is thus treated as a classification problem.

More recent models, such as BLIP, BLIP-2, and InstructBLIP, treat VQA as a generative task. Later in this guide we illustrate how to use them for zero-shot VQA inference.

Before you begin, make sure you have all the necessary libraries installed.

pip install -q transformers datasets

We encourage you to share your model with the community. Log in to your Hugging Face account to upload it to the 🤗 Hub. When prompted, enter your token to log in:

>>> from huggingface_hub import notebook_login

>>> notebook_login()

Let's define the model checkpoint as a global variable.

>>> model_checkpoint = "dandelin/vilt-b32-mlm"

[huggingface/transformers] examples/research_projects/wav2vec2/FINE_TUNE_XLSR_WAV2VEC2.md

FAQ

Can a participant fine-tune models for more than one language? Yes! A participant can fine-tune models in as many languages she/he likes
Can a participant use extra data (apart from the common voice data)? Yes! All data except the official common voice test data can be used for training. If a participant wants to train a model on a language that is not part of Common Voice (which is very much encouraged!), the participant should make sure that some test data is held out to make sure the model is not overfitting.
Can we fine-tune for high-resource languages? Yes! While we do not really recommend people to fine-tune models in English since there are already so many fine-tuned speech recognition models in English. However, it is very much appreciated if participants want to fine-tune models in other "high-resource" languages, such as French, Spanish, or German. For such cases, one probably needs to train locally and apply might have to apply tricks such as lazy data loading (check the "Lazy data loading" section for more details).

[huggingface/transformers] tests/models/align/test_modeling_align.py

class AlignVisionModelTester:
    def __init__(
        self,
        parent,
        batch_size=12,
        image_size=32,
        num_channels=3,
        kernel_sizes=[3, 3, 5],
        in_channels=[32, 16, 24],
        out_channels=[16, 24, 30],
        hidden_dim=64,
        strides=[1, 1, 2],
        num_block_repeats=[1, 1, 2],
        expand_ratios=[1, 6, 6],
        is_training=True,
        hidden_act="gelu",
    ):
        self.parent = parent
        self.batch_size = batch_size
        self.image_size = image_size
        self.num_channels = num_channels
        self.kernel_sizes = kernel_sizes
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.hidden_dim = hidden_dim
        self.strides = strides
        self.num_block_repeats = num_block_repeats
        self.expand_ratios = expand_ratios
        self.is_training = is_training
        self.hidden_act = hidden_act

    def prepare_config_and_inputs(self):
        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
        config = self.get_config()

        return config, pixel_values

    def get_config(self):
        return AlignVisionConfig(
            num_channels=self.num_channels,
            kernel_sizes=self.kernel_sizes,
            in_channels=self.in_channels,
            out_channels=self.out_channels,
            hidden_dim=self.hidden_dim,
            strides=self.strides,
            num_block_repeats=self.num_block_repeats,
            expand_ratios=self.expand_ratios,
            hidden_act=self.hidden_act,
        )

    def create_and_check_model(self, config, pixel_values):
        model = AlignVisionModel(config=config)
        model.to(torch_device)
        model.eval()
        with torch.no_grad():
            result = model(pixel_values)

        patch_size = self.image_size // 4
        self.parent.assertEqual(
            result.last_hidden_state.shape, (self.batch_size, config.hidden_dim, patch_size, patch_size)
        )
        self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, config.hidden_dim))

    def prepare_config_and_inputs_for_common(self):
        config_and_inputs = self.prepare_config_and_inputs()
        config, pixel_values = config_and_inputs
        inputs_dict = {"pixel_values": pixel_values}
        return config, inputs_dict

[huggingface/transformers] tests/models/blip_2/test_modeling_blip_2.py

class Blip2VisionModelTester:
    def __init__(
        self,
        parent,
        batch_size=12,
        image_size=30,
        patch_size=2,
        num_channels=3,
        is_training=True,
        hidden_size=32,
        projection_dim=32,
        num_hidden_layers=2,
        num_attention_heads=4,
        intermediate_size=37,
        dropout=0.1,
        attention_dropout=0.1,
        initializer_range=1e-10,
        scope=None,
    ):
        self.parent = parent
        self.batch_size = batch_size
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_channels = num_channels
        self.is_training = is_training
        self.hidden_size = hidden_size
        self.projection_dim = projection_dim
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.dropout = dropout
        self.attention_dropout = attention_dropout
        self.initializer_range = initializer_range
        self.scope = scope

        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
        num_patches = (image_size // patch_size) ** 2
        self.seq_length = num_patches + 1

    def prepare_config_and_inputs(self):
        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
        config = self.get_config()

        return config, pixel_values

    def get_config(self):
        return Blip2VisionConfig(
            image_size=self.image_size,
            patch_size=self.patch_size,
            num_channels=self.num_channels,
            hidden_size=self.hidden_size,
            projection_dim=self.projection_dim,
            num_hidden_layers=self.num_hidden_layers,
            num_attention_heads=self.num_attention_heads,
            intermediate_size=self.intermediate_size,
            dropout=self.dropout,
            attention_dropout=self.attention_dropout,
            initializer_range=self.initializer_range,
        )

    def create_and_check_model(self, config, pixel_values):
        model = Blip2VisionModel(config=config)
        model.to(torch_device)
        model.eval()
        with torch.no_grad():
            result = model(pixel_values)
        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
        image_size = (self.image_size, self.image_size)
        patch_size = (self.patch_size, self.patch_size)
        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
        self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))

    def prepare_config_and_inputs_for_common(self):
        config_and_inputs = self.prepare_config_and_inputs()
        config, pixel_values = config_and_inputs
        inputs_dict = {"pixel_values": pixel_values}
        return config, inputs_dict

[huggingface/peft] examples/boft_controlnet/boft_controlnet.md

Fine-tuning for controllable generation with BOFT (ControlNet)

This guide demonstrates how to use BOFT, an orthogonal fine-tuning method, to fine-tune Stable Diffusion with either stabilityai/stable-diffusion-2-1 or runwayml/stable-diffusion-v1-5 model for controllable generation.

By using BOFT from 🤗 PEFT, we can significantly reduce the number of trainable parameters while still achieving impressive results in various fine-tuning tasks across different foundation models. BOFT enhances model efficiency by integrating full-rank orthogonal matrices with a butterfly structure into specific model blocks, such as attention blocks, mirroring the approach used in LoRA. During fine-tuning, only these inserted matrices are trained, leaving the original model parameters untouched. During inference, the trainable BOFT paramteres can be merged into the original model, eliminating any additional computational costs.

As a member of the orthogonal finetuning class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the PEFT's GitHub repo's concept guide OFT, the original BOFT paper and the original OFT paper.

In this guide we provide a controllable generation (ControlNet) fine-tuning script that is available in PEFT's GitHub repo examples. This implementation is adapted from diffusers's ControlNet and Hecong Wu's ControlLoRA. You can try it out and finetune on your custom images.

Set up your environment

Start by cloning the PEFT repository:

git clone https://github.com/huggingface/peft

Navigate to the directory containing the training scripts for fine-tuning Dreambooth with BOFT:

cd peft/examples/boft_controlnet

Set up your environment: install PEFT, and all the required libraries. At the time of writing this guide we recommend installing PEFT from source.

conda create --name peft python=3.10
conda activate peft
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
conda install xformers -c xformers
pip install -r requirements.txt
pip install git+https://github.com/huggingface/peft

Data

We use the control-celeba-hq dataset for landmark-to-face controllable generation. We also provide evaluation scripts to evaluate the controllable generation performance. This task can be used to quantitatively compare different fine-tuning techniques.

export DATASET_NAME="oftverse/control-celeba-hq"

Train controllable generation (ControlNet) with BOFT

Start with setting some hyperparamters for BOFT:

PEFT_TYPE="boft"
BLOCK_NUM=8
BLOCK_SIZE=0
N_BUTTERFLY_FACTOR=0

Here:

Navigate to the directory containing the training scripts for fine-tuning Stable Diffusion with BOFT for controllable generation:

./train_controlnet.sh

export MODEL_NAME="stabilityai/stable-diffusion-2-1"
# export MODEL_NAME="runwayml/stable-diffusion-v1-5"

export DATASET_NAME="oftverse/control-celeba-hq"
export PROJECT_NAME="controlnet_${PEFT_TYPE}"
export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}"
export CONTROLNET_PATH=""
export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}"

accelerate launch train_controlnet.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --resume_from_checkpoint=$RESUME_PATH \
  --controlnet_model_name_or_path=$CONTROLNET_PATH \
  --output_dir=$OUTPUT_DIR \
  --report_to="wandb" \
  --dataset_name=$DATASET_NAME \
  --resolution=512 \
  --learning_rate=1e-5 \
  --checkpointing_steps=5000 \
  --max_train_steps=50000 \
  --validation_steps=2000 \
  --num_validation_images=12 \
  --train_batch_size=4 \
  --dataloader_num_workers=2 \
  --seed="0" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --wandb_project_name=$PROJECT_NAME \
  --wandb_run_name=$RUN_NAME \
  --enable_xformers_memory_efficient_attention \
  --use_boft \
  --boft_block_num=$BLOCK_NUM \
  --boft_block_size=$BLOCK_SIZE \
  --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \
  --boft_dropout=0.1 \
  --boft_bias="boft_only" \
  --report_to="wandb" \

Run inference on the saved model to sample new images from the validation set:

./test_controlnet.sh

ITER_NUM=50000

export MODEL_NAME="stabilityai/stable-diffusion-2-1"
# export MODEL_NAME="runwayml/stable-diffusion-v1-5"

export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}"
export DATASET_NAME="oftverse/control-celeba-hq"
export CKPT_NAME="checkpoint-${ITER_NUM}"
export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}/${CKPT_NAME}"
export CONTROLNET_PATH="${OUTPUT_DIR}/controlnet/model.safetensors"
export UNET_PATH="${OUTPUT_DIR}/unet/${RUN_NAME}"
export RESULTS_PATH="${OUTPUT_DIR}/results"

accelerate launch test_controlnet.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --controlnet_path=$CONTROLNET_PATH \
  --unet_path=$UNET_PATH \
  --adapter_name=$RUN_NAME \
  --output_dir=$RESULTS_PATH \
  --dataset_name=$DATASET_NAME \

Run evaluation on the sampled images to evaluate the landmark reprojection error:

./eval.sh

ITER_NUM=50000

export MODEL_NAME="stabilityai/stable-diffusion-2-1"
# export MODEL_NAME="runwayml/stable-diffusion-v1-5"

export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}"
export DATASET_NAME="oftverse/control-celeba-hq"
export CKPT_NAME="checkpoint-${ITER_NUM}"
export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}/${CKPT_NAME}"
export CONTROLNET_PATH="${OUTPUT_DIR}/controlnet/model.safetensors"
export UNET_PATH="${OUTPUT_DIR}/unet/${RUN_NAME}"

accelerate launch eval.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --controlnet_path=$CONTROLNET_PATH \
  --unet_path=$UNET_PATH \
  --adapter_name=$RUN_NAME \
  --output_dir=$OUTPUT_DIR \
  --dataset_name=$DATASET_NAME \
  --vis_overlays \

[huggingface/peft] examples/semantic_segmentation/semantic_segmentation_peft_lora.ipynb

            [20, 0, 255],
            [255, 255, 0],
            [0, 153, 255],
            [0, 41, 255],
            [0, 255, 204],
            [41, 0, 255],
            [41, 255, 0],
            [173, 0, 255],
            [0, 245, 255],
            [71, 0, 255],
            [122, 0, 255],
            [0, 255, 184],
            [0, 92, 255],
            [184, 255, 0],
            [0, 133, 255],
            [255, 214, 0],
            [25, 194, 194],
            [102, 255, 0],
            [92, 0, 255],
        ]
    )

import matplotlib.pyplot as plt

color_seg = np.zeros((pred_seg.shape[0], pred_seg.shape[1], 3), dtype=np.uint8)
palette = np.array(ade_palette())

for label, color in enumerate(palette):
    color_seg[pred_seg == label, :] = color
color_seg = color_seg[..., ::-1]  # convert to BGR

img = np.array(image) * 0.5 + color_seg * 0.5  # plot the image with the segmentation map
img = img.astype(np.uint8)

plt.figure(figsize=(15, 10))
plt.imshow(img)
plt.show()

The results are definitely not as expected and as mentioned above, this example is not meant to provide a state-of-the-art model. It exists to familiarize you with the end-to-end workflow.

On the other hand, if you perform full fine-tuning on the same setup (same model variant, same dataset, same training schedule, etc.), the results would not have been any different. This is a crucial aspect of parameter-efficient fine-tuning -- to be able to match up to the results of the full fine-tuning but with a fraction of total trainable parameters.

Here are some things that you can try to get better results:

Increase the number of training samples.
Try a larger SegFormer model variant (know about the available model variants here).
Try different values for the arguments available in LoraConfig.
Tune the learning rate and batch size.

[huggingface/peft] examples/boft_dreambooth/boft_dreambooth.md

DreamBooth fine-tuning with BOFT

This guide demonstrates how to use BOFT, an orthogonal fine-tuning method, to fine-tune Dreambooth with either stabilityai/stable-diffusion-2-1 or runwayml/stable-diffusion-v1-5 model.

In this guide we provide a Dreambooth fine-tuning script that is available in PEFT's GitHub repo examples. This implementation is adapted from peft's lora_dreambooth. You can try it out and finetune on your custom images.

Set up your environment

Start by cloning the PEFT repository:

git clone --recursive https://github.com/huggingface/peft

Navigate to the directory containing the training scripts for fine-tuning Dreambooth with BOFT:

cd peft/examples/boft_dreambooth

Set up your environment: install PEFT, and all the required libraries. At the time of writing this guide we recommend installing PEFT from source. The following environment setup should work on A100 and H100:

conda create --name peft python=3.10
conda activate peft
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
conda install xformers -c xformers
pip install -r requirements.txt
pip install git+https://github.com/huggingface/peft

Download the data

dreambooth dataset should have been automatically cloned in the following structure when running the training script.

boft_dreambooth
├── data
│   ├── data_dir
│   └── dreambooth
│       └── data
│           ├── backpack
│           └── backpack_dog
│           ...

You can also put your custom images into boft_dreambooth/data/dreambooth.

Finetune Dreambooth with BOFT

./train_dreambooth.sh

or using the following script arguments:

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="path-to-instance-images"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="path-to-save-model"

Here:

INSTANCE_DIR: The directory containing the images that you intend to use for training your model.
CLASS_DIR: The directory containing class-specific images. In this example, we use prior preservation to avoid overfitting and language-drift. For prior preservation, you need other images of the same class as part of the training process. However, these images can be generated and the training script will save them to a local path you specify here.
OUTPUT_DIR: The destination folder for storing the trained model's weights.

To learn more about DreamBooth fine-tuning with prior-preserving loss, check out the Diffusers documentation.

Launch the training script with accelerate and pass hyperparameters, as well as LoRa-specific arguments to it such as:

use_boft: Enables BOFT in the training script.
boft_block_size: the BOFT matrix block size across different layers, expressed in int. Smaller block size results in sparser update matrices with fewer trainable paramters. Note, please choose it to be dividable to most layer in_features dimension, e.g., 4, 8, 16. Also, you can only specify either boft_block_size or boft_block_num, but not both simultaneously, because boft_block_size x boft_block_num = layer dimension.
boft_block_num: the number of BOFT matrix blocks across different layers, expressed in int. Fewer blocks result in sparser update matrices with fewer trainable paramters. Note, please choose it to be dividable to most layer in_features dimension, e.g., 4, 8, 16. Also, you can only specify either boft_block_size or boft_block_num, but not both simultaneously, because boft_block_size x boft_block_num = layer dimension.
boft_n_butterfly_factor: the number of butterfly factors. Note, for boft_n_butterfly_factor=1, BOFT is the same as vanilla OFT, for boft_n_butterfly_factor=2, the effective block size of OFT becomes twice as big and the number of blocks become half.
bias: specify if the bias paramteres should be traind. Can be none, all or boft_only.
boft_dropout: specify the probability of multiplicative dropout.

Here's what the full set of script arguments may look like:

PEFT_TYPE="boft"
BLOCK_NUM=8
BLOCK_SIZE=0
N_BUTTERFLY_FACTOR=1

VALIDATION_PROMPT=${PROMPT_LIST[@]}
INSTANCE_PROMPT="a photo of ${UNIQUE_TOKEN} ${CLASS_TOKEN}"
CLASS_PROMPT="a photo of ${CLASS_TOKEN}"

export MODEL_NAME="stabilityai/stable-diffusion-2-1"
# export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export PROJECT_NAME="dreambooth_${PEFT_TYPE}"
export RUN_NAME="${SELECTED_SUBJECT}_${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}"
export INSTANCE_DIR="./data/dreambooth/dataset/${SELECTED_SUBJECT}"
export CLASS_DIR="./data/class_data/${CLASS_TOKEN}"
export OUTPUT_DIR="./data/output/${PEFT_TYPE}"


accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir="$CLASS_DIR" \
  --output_dir=$OUTPUT_DIR \
  --wandb_project_name=$PROJECT_NAME \
  --wandb_run_name=$RUN_NAME \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="$INSTANCE_PROMPT" \
  --validation_prompt="$VALIDATION_PROMPT" \
  --class_prompt="$CLASS_PROMPT" \
  --resolution=512 \
  --train_batch_size=1 \
  --num_dataloader_workers=2 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --use_boft \
  --boft_block_num=$BLOCK_NUM \
  --boft_block_size=$BLOCK_SIZE \
  --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \
  --boft_dropout=0.1 \
  --boft_bias="boft_only" \
  --learning_rate=3e-5 \
  --max_train_steps=1010 \
  --checkpointing_steps=200 \
  --validation_steps=200 \
  --enable_xformers_memory_efficient_attention \
  --report_to="wandb" \

or use this training script:

./train_dreambooth.sh $idx

with the $idx corresponds to different subjects.

If you are running this script on Windows, you may need to set the --num_dataloader_workers to 0.

Inference with a single adapter

To run inference with the fine-tuned model, simply run the jupyter notebook dreambooth_inference.ipynb for visualization with jupyter notebook under ./examples/boft_dreambooth.

[huggingface/peft] examples/boft_controlnet/boft_controlnet.md

Train controllable generation (ControlNet) with BOFT

Start with setting some hyperparamters for BOFT:

PEFT_TYPE="boft"
BLOCK_NUM=8
BLOCK_SIZE=0
N_BUTTERFLY_FACTOR=0

Here:

Navigate to the directory containing the training scripts for fine-tuning Stable Diffusion with BOFT for controllable generation:

./train_controlnet.sh

export MODEL_NAME="stabilityai/stable-diffusion-2-1"
# export MODEL_NAME="runwayml/stable-diffusion-v1-5"

export DATASET_NAME="oftverse/control-celeba-hq"
export PROJECT_NAME="controlnet_${PEFT_TYPE}"
export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}"
export CONTROLNET_PATH=""
export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}"

accelerate launch train_controlnet.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --resume_from_checkpoint=$RESUME_PATH \
  --controlnet_model_name_or_path=$CONTROLNET_PATH \
  --output_dir=$OUTPUT_DIR \
  --report_to="wandb" \
  --dataset_name=$DATASET_NAME \
  --resolution=512 \
  --learning_rate=1e-5 \
  --checkpointing_steps=5000 \
  --max_train_steps=50000 \
  --validation_steps=2000 \
  --num_validation_images=12 \
  --train_batch_size=4 \
  --dataloader_num_workers=2 \
  --seed="0" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --wandb_project_name=$PROJECT_NAME \
  --wandb_run_name=$RUN_NAME \
  --enable_xformers_memory_efficient_attention \
  --use_boft \
  --boft_block_num=$BLOCK_NUM \
  --boft_block_size=$BLOCK_SIZE \
  --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \
  --boft_dropout=0.1 \
  --boft_bias="boft_only" \
  --report_to="wandb" \

Run inference on the saved model to sample new images from the validation set:

./test_controlnet.sh

ITER_NUM=50000

export MODEL_NAME="stabilityai/stable-diffusion-2-1"
# export MODEL_NAME="runwayml/stable-diffusion-v1-5"

export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}"
export DATASET_NAME="oftverse/control-celeba-hq"
export CKPT_NAME="checkpoint-${ITER_NUM}"
export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}/${CKPT_NAME}"
export CONTROLNET_PATH="${OUTPUT_DIR}/controlnet/model.safetensors"
export UNET_PATH="${OUTPUT_DIR}/unet/${RUN_NAME}"
export RESULTS_PATH="${OUTPUT_DIR}/results"

accelerate launch test_controlnet.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --controlnet_path=$CONTROLNET_PATH \
  --unet_path=$UNET_PATH \
  --adapter_name=$RUN_NAME \
  --output_dir=$RESULTS_PATH \
  --dataset_name=$DATASET_NAME \

Run evaluation on the sampled images to evaluate the landmark reprojection error:

./eval.sh

ITER_NUM=50000

export MODEL_NAME="stabilityai/stable-diffusion-2-1"
# export MODEL_NAME="runwayml/stable-diffusion-v1-5"

export RUN_NAME="${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}"
export DATASET_NAME="oftverse/control-celeba-hq"
export CKPT_NAME="checkpoint-${ITER_NUM}"
export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}/${CKPT_NAME}"
export CONTROLNET_PATH="${OUTPUT_DIR}/controlnet/model.safetensors"
export UNET_PATH="${OUTPUT_DIR}/unet/${RUN_NAME}"

accelerate launch eval.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --controlnet_path=$CONTROLNET_PATH \
  --unet_path=$UNET_PATH \
  --adapter_name=$RUN_NAME \
  --output_dir=$OUTPUT_DIR \
  --dataset_name=$DATASET_NAME \
  --vis_overlays \

[openaccess-ai-collective/axolotl] scripts/finetune.py

"""Prepare and train a model on a dataset. Can also infer from a model or merge lora"""

[huggingface/accelerate] examples/README.md

Simple vision example

The cv_example.py script is a simple example to fine-tune a ResNet-50 on a classification task (Ofxord-IIT Pet Dataset).

The same script can be run in any of the following configurations:

single CPU or single GPU
multi CPUs
multi GPUs (using PyTorch distributed mode)
(multi) TPUs
fp16 (mixed-precision) or fp32 (normal precision)

Prior to running it you should install timm and torchvision:

pip install timm torchvision

and you should download the data with the following commands:

wget https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
tar -xzf images.tar.gz

To run it in each of these various modes, use the following commands:

single CPU:

from a server without GPU

python ./cv_example.py --data_dir path_to_data

from any server by passing cpu=True to the Accelerator.

python ./cv_example.py --data_dir path_to_data --cpu

from any server with Accelerate launcher

accelerate launch --cpu ./cv_example.py --data_dir path_to_data

single GPU:

python ./cv_example.py  # from a server with a GPU

with fp16 (mixed-precision)

from any server by passing mixed_precison=fp16 to the Accelerator.

python ./cv_example.py --data_dir path_to_data --mixed_precison fp16

from any server with Accelerate launcher

accelerate launch --mixed_precison fp16 ./cv_example.py --data_dir path_to_data

multi CPUs (requires Open MPI, Intel MPI, or MVAPICH)

With Accelerate config and launcher, run the following from node 0:

accelerate config --config_file config.yaml  # Select to have accelerate launch mpirun
accelerate launch ./cv_example.py --data_dir path_to_data # This will run the script on each server

With Intel MPI, execute mpirun from node 0:

export CCL_WORKER_COUNT=1
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
mpirun -f hostfile -n 16 -ppn 4 python ./cv_example.py --data_dir path_to_data

multi GPUs (using PyTorch distributed mode)

With Accelerate config and launcher

accelerate config --config_file config.yaml  # This will create a config file on your server to `config.yaml`
accelerate launch --config_file config.yaml ./cv_example.py --data_dir path_to_data  # This will run the script on your server

With traditional PyTorch launcher (python -m torch.distributed.run can be used instead of torchrun)
```
torchrun --nproc_per_node 2 ./cv_example.py --data_dir path_to_data
```

multi GPUs, multi node (several machines, using PyTorch distributed mode)

With Accelerate config and launcher, on each machine:

accelerate config --config_file config.yaml  # This will create a config file on your server to `config.yaml`
accelerate launch --config_file config.yaml ./cv_example.py --data_dir path_to_data  # This will run the script on each server

With PyTorch launcher only (python -m torch.distributed.run can be used instead of torchrun). Run this command on each node:

torchrun \ # python -m torch.distributed.run
    --nproc_per_node 2 \
    --nnodes 2 \
    --rdzv_id 2299 \ # A unique job id 
    --rdzv_backend c10d \
    --rdzv_endpoint master_node_ip_address:29500 \
    ./cv_example.py --data_dir path_to_data

(multi) TPUs

With Accelerate config and launcher

accelerate config --config_file config.yaml  # This will create a config file on your server to `config.yaml`
accelerate launch --config_file config.yaml ./cv_example.py --data_dir path_to_data  # This will run the script on each server

In PyTorch: Add an xmp.spawn line in your script as you usually do.

Simple vision example (GANs)

huggan project

Using AWS SageMaker integration

Examples showcasing AWS SageMaker integration of 🤗 Accelerate.

[huggingface/accelerate] examples/by_feature/megatron_lm_gpt_pretraining.py

#!/usr/bin/env python
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...)
on a text file or a dataset without using HuggingFace Trainer.

Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
https://huggingface.co/models?filter=text-generation
"""

[openaccess-ai-collective/axolotl] examples/stablelm-2/1.6b/fft.yml

base_model: stabilityai/stablelm-2-1_6b
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
flash_attn_cross_entropy: false
flash_attn_rms_norm: true
flash_attn_fuse_qkv: false
flash_attn_fuse_mlp: true

warmup_steps: 100
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: #deepspeed_configs/zero2.json # multi-gpu only
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:

[openaccess-ai-collective/axolotl] examples/llama-2/fft_optimized.yml

base_model: NousResearch/Llama-2-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
flash_attn_cross_entropy: false
flash_attn_rms_norm: true
flash_attn_fuse_qkv: false
flash_attn_fuse_mlp: true

warmup_steps: 100
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: #deepspeed_configs/zero2.json # multi-gpu only
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:

[openaccess-ai-collective/axolotl] examples/phi/phi2-ft.yml

base_model: microsoft/phi-2
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: garage-bAInd/Open-Platypus
    type: alpaca

dataset_prepared_path:
val_set_size: 0.05
output_dir: ./phi-sft-out

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_torch
adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0
lr_scheduler: cosine
learning_rate: 0.000003

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: True
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
resize_token_embeddings_to_32x: true
special_tokens:
  pad_token: "<|endoftext|>"

[openaccess-ai-collective/axolotl] examples/phi/phi-ft.yml

base_model: microsoft/phi-1_5
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: garage-bAInd/Open-Platypus
    type: alpaca

dataset_prepared_path:
val_set_size: 0.05
output_dir: ./phi-sft-out

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_torch
adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0
lr_scheduler: cosine
learning_rate: 0.000003

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: True
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
resize_token_embeddings_to_32x: true
special_tokens:
  pad_token: "<|endoftext|>"

[openaccess-ai-collective/axolotl] examples/xgen-7b/xgen-7b-8k-qlora.yml

# An example finetuning Saleforce's XGen-7b model with 8k context using qlora
# on Tim Dettmer's Guanaco dataset.
base_model: Salesforce/xgen-7b-8k-base
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
# enable 4bit for QLoRA
load_in_4bit: true
gptq: false
strict: false
push_dataset_to_hub:
datasets:
  - path: timdettmers/openassistant-guanaco
    data_files:
      - openassistant_best_replies_train.jsonl
    type: "completion"
dataset_prepared_path:
val_set_size: 0.05
# enable QLoRA
adapter: qlora
lora_model_dir:
sequence_len: 8192
max_packed_sequence_len:

# hyperparameters from QLoRA paper Appendix B.2
# "We find hyperparameters to be largely robust across datasets"
lora_r: 64
lora_alpha: 16
# 0.1 for models up to 13B
# 0.05 for 33B and 65B models
lora_dropout: 0.05
# add LoRA modules on all linear layers of the base model
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./qlora-out

# QLoRA paper Table 9
# - 16 for 7b & 13b
# - 32 for 33b, 64 for 64b
# Max size tested on A6000
# - 7b: 40
# - 40b: 4
# decrease if OOM, increase for max VRAM utilization
micro_batch_size: 1
gradient_accumulation_steps: 1
num_epochs: 4
# Optimizer for QLoRA
optimizer: paged_adamw_32bit
torchdistx_path:
lr_scheduler: cosine
# QLoRA paper Table 9
# - 2e-4 for 7b & 13b
# - 1e-4 for 33b & 64b
learning_rate: 0.00002
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
# stop training after this many evaluation losses have increased in a row
# https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
early_stopping_patience: 3
resume_from_checkpoint:
auto_resume_from_checkpoints: true
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
special_tokens:
  eos_token: "<|endoftext|>"
  bos_token: "<|endoftext|>"
  unk_token: "<|endoftext|>"
  pad_token: "<|endoftext|>"

[huggingface/accelerate] tests/test_big_modeling.py


        quantization_config = BitsAndBytesConfig(load_in_4bit=True)

        model = replace_with_bnb_linear(
            model, modules_to_not_convert=["lm_head"], quantization_config=quantization_config
        )

        model_path = hf_hub_download("bigscience/bloom-560m", "pytorch_model.bin")

        # test with auto
        model = load_checkpoint_and_dispatch(
            model,
            checkpoint=model_path,
            device_map="auto",
        )

        assert model.h[0].self_attention.query_key_value.weight.dtype == torch.uint8
        assert model.h[0].self_attention.query_key_value.weight.device.index == 0

        with init_empty_weights():
            model = AutoModel.from_config(AutoConfig.from_pretrained("bigscience/bloom-560m"))

        model = replace_with_bnb_linear(
            model, modules_to_not_convert=["lm_head"], quantization_config=quantization_config
        )

        # test with str device map
        model = load_checkpoint_and_dispatch(
            model,
            checkpoint=model_path,
            device_map={"": torch.device("cuda:0")},
        )

        assert model.h[0].self_attention.query_key_value.weight.dtype == torch.uint8
        assert model.h[0].self_attention.query_key_value.weight.device.index == 0

        with init_empty_weights():
            model = AutoModel.from_config(AutoConfig.from_pretrained("bigscience/bloom-560m"))

        model = replace_with_bnb_linear(
            model, modules_to_not_convert=["lm_head"], quantization_config=quantization_config
        )

        # test with torch.device device map
        model = load_checkpoint_and_dispatch(
            model,
            checkpoint=model_path,
            device_map={"": "cuda:0"},
        )

        assert model.h[0].self_attention.query_key_value.weight.dtype == torch.uint8
        assert model.h[0].self_attention.query_key_value.weight.device.index == 0

[openaccess-ai-collective/axolotl] src/axolotl/utils/config/__init__.py

def legacy_validate_config(cfg):
    """
    This is a "pre-validation" step that handles the yaml configuration before we have any
    information about the model architecture
    """
    if is_torch_bf16_gpu_available():
        if not cfg.bf16 and not cfg.bfloat16:
            LOG.info("bf16 support detected, but not enabled for this configuration.")
    else:
        if (
            not cfg.merge_lora
            and not cfg.is_preprocess
            and (cfg.bf16 is True or cfg.bfloat16 is True)
        ):
            raise ValueError(
                "bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above."
            )
    if (
        # pylint: disable=too-many-boolean-expressions
        not (cfg.bf16 or cfg.bfloat16)
        and (cfg.fp16 or cfg.float16)
        and not cfg.adapter
        and not cfg.flash_attention
        and cfg.sample_packing
    ):
        LOG.warning(
            "Full fine tune w/o FA2 w/ sample packing and fp16/float16 is likely to raise errors. Try LoRA."
        )
        # ValueError: Attempting to unscale FP16 gradients.
        # OR
        # RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
    if cfg.max_packed_sequence_len:
        raise DeprecationWarning("`max_packed_sequence_len` is no longer supported")

    if cfg.sample_packing and cfg.rl:
        raise ValueError("`sample_packing: true` does not work with RLHF training")

    if cfg.sample_packing and not cfg.pad_to_sequence_len:
        LOG.warning(
            "`pad_to_sequence_len: true` is recommended when using sample_packing"
        )

    if cfg.gradient_accumulation_steps and cfg.batch_size:
        raise ValueError(
            "please set only one of gradient_accumulation_steps or batch_size"
        )
    if cfg.batch_size:
        LOG.warning(
            "%s\n%s",
            "batch_size is not recommended. Please use gradient_accumulation_steps instead.",
            "To calculate the equivalent gradient_accumulation_steps, divide batch_size / micro_batch_size / number of gpus.",
        )
    if (
        cfg.eval_batch_size
        and cfg.micro_batch_size
        and cfg.eval_batch_size != cfg.micro_batch_size
    ):
        LOG.warning(
            "eval_batch_size != micro_batch_size. This can lead to VRAM instability."
        )

    if cfg.adapter == "qlora":
        if cfg.merge_lora:
            # can't merge qlora if loaded in 8bit or 4bit
            if cfg.load_in_8bit:
                raise ValueError("Can't merge qlora if loaded in 8bit")

            if cfg.gptq:
                raise ValueError("Can't merge qlora if gptq")

            if cfg.load_in_4bit:
                raise ValueError("Can't merge qlora if loaded in 4bit")

        else:
            if cfg.load_in_8bit:
                raise ValueError("Can't load qlora in 8bit")

            if cfg.gptq:
                raise ValueError("Can't load qlora if gptq")

            if not cfg.load_in_4bit:
                raise ValueError("Require cfg.load_in_4bit to be True for qlora")

        if cfg.flash_attn_fuse_qkv or cfg.flash_attn_fuse_mlp:
            raise ValueError("Fused modules are not supported with QLoRA")

    loftq = cfg.peft and cfg.peft.loftq_config and cfg.peft.loftq_config.loftq_bits
    if not cfg.load_in_8bit and cfg.adapter == "lora" and not loftq:
        LOG.warning("We recommend setting `load_in_8bit: true` for LORA finetuning")

    if cfg.adapter == "lora" and (cfg.flash_attn_fuse_qkv or cfg.flash_attn_fuse_mlp):
        raise ValueError("Fused modules are not supported with LoRA")

    if cfg.adapter and cfg.peft_layers_to_transform and cfg.unfrozen_parameters:
        raise ValueError(
            "`unfrozen_parameters` used with `peft_layers_to_transform` can have unexpected behavior."
        )

    if cfg.relora_steps:
        if cfg.adapter not in ("lora", "qlora"):
            raise ValueError("cfg.adapter must be lora or qlora to use ReLoRA")

        if cfg.fsdp:
            raise ValueError("fsdp not supported with ReLoRA")

        if cfg.deepspeed:
            raise ValueError("deepspeed not supported with ReLoRA")

        if cfg.lr_scheduler == "one_cycle":
            raise ValueError("ReLoRA is not compatible with the one_cycle scheduler")

        if cfg.flash_attn_fuse_qkv or cfg.flash_attn_fuse_mlp:
            raise ValueError("Fused modules are not supported with ReLoRA")

    if cfg.trust_remote_code:
        LOG.warning(
            "`trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model."
        )

    if cfg.push_dataset_to_hub and cfg.hf_use_auth_token is not True:
        raise ValueError(
            "Require cfg.hf_use_auth_token to be True for push_dataset_to_hub"
        )

    if (cfg.base_model and "falcon" in cfg.base_model.lower()) and cfg.fsdp:
        raise ValueError("FSDP is not supported for falcon models")

    if (
        cfg.base_model and "mpt" in cfg.base_model.lower()
    ) and cfg.gradient_checkpointing:
        raise ValueError("gradient_checkpointing is not supported for MPT models")

    if cfg.flash_optimum is True:
        if cfg.adapter:
            LOG.warning("BetterTransformers probably doesn't work with PEFT adapters")
        if cfg.fp16 or cfg.bf16:
            raise ValueError("AMP is not supported with BetterTransformer")
        if cfg.float16 is not True and cfg.bfloat16 is not True:
            LOG.warning(
                "You should probably set bfloat16 or float16 to true to "
                "load the model in float16 for BetterTransformers"
            )
        if int(torch.__version__.split(".", maxsplit=1)[0]) < 2:
            LOG.warning("torch>=2.0.0 required")
            raise ValueError(
                f"flash_optimum for BetterTransformers may not be used with {torch.__version__}"
            )

    if cfg.pretraining_dataset and cfg.group_by_length:
        LOG.warning(
            "You probably want to disable group_by_length as it will force a streamed dataset to download completely."
        )
    if cfg.pretraining_dataset and not cfg.max_steps:
        raise ValueError(
            "max_steps must be set when using iterable pretraining_dataset, Trainer can't infer length and schedule optimizer/learning rate without it!"
        )

    if any([cfg.adam_beta1, cfg.adam_beta2, cfg.adam_epsilon]) and (
        not cfg.optimizer or "adamw" not in cfg.optimizer
    ):
        LOG.warning("adamw hyperparameters found, but no adamw optimizer set")

    if cfg.push_to_hub_model_id:
        raise ValueError(
            "push_to_hub_model_id is deprecated. Please use hub_model_id instead."
        )

    if cfg.hub_model_id and cfg.save_strategy not in ["steps", "epoch", None]:
        LOG.warning(
            "hub_model_id is set without any models being saved. To save a model, set save_strategy to steps, epochs or leave empty."
        )

    if cfg.gptq and cfg.revision_of_model:
        raise ValueError(
            "revision_of_model is not supported for GPTQ models. "
            + "Please download the model from HuggingFace Hub manually for correct branch, "
            + "point to its path, and remove revision_of_model from the config."
        )

    # if cfg.sample_packing and cfg.sdp_attention:
    #     # incompatible due to bug w/ accelerate causing 0.0 loss when using llama2
    #     raise ValueError(
    #         "sample_packing not compatible with sdp_attention. Use flash_attention"
    #     )

    if cfg.sample_packing and cfg.xformers_attention:
        raise ValueError(
            "sample_packing not compatible with xformers_attention. Use flash_attention"
        )

    if cfg.sample_packing and cfg.sdp_attention and (cfg.bfloat16 or cfg.bf16):
        # https://github.com/pytorch/pytorch/blob/1b03423526536b5f3d35bdfa95ccc6197556cf9b/test/test_transformers.py#L2440-L2450
        LOG.warning(
            "sample_packing & torch sdpa with bf16 is unsupported may results in 0.0 loss. "
            "This may work on H100s."
        )

    if cfg.early_stopping_patience:
        if not cfg.save_steps or not cfg.eval_steps:
            raise ValueError(
                "`early_stopping_patience` requires save_steps and eval_steps to be set. eval_steps should evenly divide save_steps."
            )
        if cfg.save_steps % cfg.eval_steps != 0:
            raise ValueError(
                "`early_stopping_patience` requires that eval_steps should evenly divide save_steps."
            )

    if cfg.datasets:
        for idx, ds_cfg in enumerate(cfg.datasets):
            if not ds_cfg.type:
                continue
            if ds_cfg.type == "sharegpt:chat":
                LOG.warning(
                    PendingDeprecationWarning(
                        "`type: sharegpt:chat` will soon be deprecated. simply use `type: sharegpt` instead."
                    )
                )
                cfg.datasets[idx].type = "sharegpt"
            if "sharegpt_simple" in ds_cfg.type:
                LOG.warning(
                    PendingDeprecationWarning(
                        "`type: sharegpt_simple` will soon be deprecated. simply use `type: sharegpt` instead."
                    )
                )
                cfg.datasets[idx].type = cfg.datasets[idx].type.replace(
                    "sharegpt_simple", "sharegpt"
                )

    if cfg.saves_per_epoch and cfg.save_steps:
        raise ValueError(
            "save_steps and saves_per_epoch are mutually exclusive and cannot be used together."
        )
    if cfg.save_strategy and cfg.saves_per_epoch and cfg.save_strategy != "steps":
        raise ValueError(
            "save_strategy must be empty or set to `steps` when used with saves_per_epoch."
        )
    if cfg.save_strategy and cfg.save_steps and cfg.save_strategy != "steps":
        raise ValueError(
            "save_strategy and save_steps mismatch. Please set save_strategy to 'steps' or remove save_steps."
        )
    if cfg.evals_per_epoch and cfg.eval_steps:
        raise ValueError(
            "eval_steps and evals_per_epoch are mutually exclusive and cannot be used together."
        )
    if (
        cfg.evals_per_epoch
        and cfg.evaluation_strategy
        and cfg.evaluation_strategy != "steps"
    ):
        raise ValueError(
            "evaluation_strategy must be empty or set to `steps` when used with evals_per_epoch."
        )
    if (
        cfg.evaluation_strategy
        and cfg.eval_steps
        and cfg.evaluation_strategy != "steps"
    ):
        raise ValueError(
            "evaluation_strategy and eval_steps mismatch. Please set evaluation_strategy to 'steps' or remove eval_steps."
        )

    if (
        cfg.val_set_size == 0
        and (cfg.eval_steps or cfg.evaluation_strategy)
        and not cfg.test_datasets
    ):
        raise ValueError(
            "eval_steps and evaluation_strategy are not supported with val_set_size == 0"
        )

    if (
        cfg.sample_packing
        and cfg.eval_table_size
        and cfg.eval_sample_packing is not False
    ):
        raise ValueError(
            "eval_table_size and eval_sample_packing are not supported together with sample_packing. Please set 'eval_sample_packing' to false."
        )

    if not cfg.adapter and (cfg.load_in_8bit or cfg.load_in_4bit):
        raise ValueError(
            "load_in_8bit and load_in_4bit are not supported without setting an adapter."
            "If you want to full finetune, please turn off load_in_8bit and load_in_4bit."
        )

    if cfg.rope_scaling:
        LOG.warning("`rope_scaling` should now be be a key under `model_config`")

    if cfg.wandb_run_id and not cfg.wandb_name:
        cfg.wandb_name = cfg.wandb_run_id

        LOG.warning(
            "wandb_run_id sets the ID of the run. If you would like to set the name, please use wandb_name instead."
        )

    if cfg.noisy_embedding_alpha is not None:
        # Deprecated, use neftune_noise_alpha
        LOG.warning("noisy_embedding_alpha is deprecated, use neftune_noise_alpha")
        if cfg.neftune_noise_alpha is None:
            cfg.neftune_noise_alpha = cfg.noisy_embedding_alpha
        else:
            # User is providing both; bail and have them sort out their settings
            raise ValueError(
                "noisy_embedding_alpha is deprecated, use neftune_noise_alpha; both are set, please remove the deprecated noisy_embedding_alpha setting"
            )

    if cfg.neftune_noise_alpha is not None and cfg.neftune_noise_alpha <= 0.0:
        raise ValueError("neftune_noise_alpha must be > 0.0")

    if cfg.max_memory is not None and cfg.gpu_memory_limit is not None:
        raise ValueError(
            "max_memory and gpu_memory_limit are mutually exclusive and cannot be used together."
        )

    if (
        cfg.unfrozen_parameters
        and cfg.gradient_checkpointing_kwargs
        and cfg.gradient_checkpointing_kwargs.use_reentrant is True
    ):
        # https://github.com/huggingface/transformers/issues/21381
        raise ValueError(
            "`use_reentrant` must be false when used with partially frozen model."
        )

    if cfg.deepspeed and Path(cfg.deepspeed).is_file():
        with open(cfg.deepspeed, encoding="utf-8") as file:
            contents = file.read()
            deepspeed_cfg: DictDefault = DictDefault(json.loads(contents))
            if cfg.flash_attention:
                if (
                    deepspeed_cfg.zero_optimization
                    and deepspeed_cfg.zero_optimization.stage == 3
                ):
                    if not (
                        (
                            deepspeed_cfg.bf16
                            and deepspeed_cfg.bf16.enabled  # pylint: disable=no-member
                            is True
                        )
                        or (
                            deepspeed_cfg.fp16
                            and deepspeed_cfg.fp16.enabled  # pylint: disable=no-member
                            is True
                        )
                    ):
                        raise ValueError(
                            "bf16.enabled or fp16.enabled must be set to true when using ZeRO-3 with flash-attention"
                        )
            if "8bit" in cfg.optimizer and deepspeed_cfg.optimizer:
                LOG.warning(
                    f"conflicting optimizer: {cfg.optimizer} used alongside deepspeed optimizer."
                )

    if cfg.test_datasets and cfg.val_set_size:
        raise ValueError(
            "non-zero val_set_size should not be used with test_datasets configuration"
        )

    if cfg.fsdp and "bnb" in cfg.optimizer:
        raise ValueError(f"FSDP not compatible with {cfg.optimizer}")

    if cfg.do_causal_lm_eval and cfg.eval_sample_packing:
        raise ValueError(
            "do_causal_lm_eval is enabled, eval_sample_packing must be set to False"
        )

    if cfg.eval_causal_lm_metrics:
        supported_metrics = ["sacrebleu", "comet", "ter", "chrf"]
        if not isinstance(cfg.eval_causal_lm_metrics, list):
            raise ValueError("eval_causal_lm_metrics must be a list")
        # only ["sacrebleu", "comet", "ter", "chrf"] supported
        if set(cfg.eval_causal_lm_metrics) - set(supported_metrics):
            raise ValueError(
                f"eval_causal_lm_metrics must be one of {supported_metrics}"
            )

    # TODO
    # MPT 7b
    # https://github.com/facebookresearch/bitsandbytes/issues/25
    # no 8bit adaAmw w bf16

    # GPT-NeoX
    # evals broken when extending context len
    # File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 162, in forward                        attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
    # File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/optimum/bettertransformer/models/attention.py", line 74, in gpt2_wrapped_scaled_dot_product
    # attention_mask = causal_mask + attention_mask
    # RuntimeError: The size of tensor a (2048) must match the size of tensor b (8132) at non-singleton dimension 3

[openaccess-ai-collective/axolotl] examples/mistral/mixtral.yml

base_model: mistralai/Mixtral-8x7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
trust_remote_code: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.0
output_dir: ./qlora-out

## You can optionally freeze the entire model and unfreeze a subset of parameters
unfrozen_parameters:
#  - ^lm_head.weight$
#  - ^model.embed_tokens.weight$[:32000]
#  - model.layers.2[0-9]+.block_sparse_moe.gate
#  - model.layers.2[0-9]+.block_sparse_moe.experts
#  - model.layers.3[0-9]+.block_sparse_moe.gate
#  - model.layers.3[0-9]+.block_sparse_moe.experts

model_config:
  output_router_logits: true

adapter: qlora
lora_model_dir:

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
#lora_target_modules:
#  - gate
#  - q_proj
#  - k_proj
#  - v_proj
#  - o_proj
#  - w1
#  - w2
#  - w3

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero2.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

[openaccess-ai-collective/axolotl] examples/falcon/config-7b.yml

base_model: tiiuae/falcon-7b
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
gptq: false
strict: false
push_dataset_to_hub:
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca:chat
dataset_prepared_path:
val_set_size: 0.05
adapter:
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len:
lora_r: 64
lora_alpha: 32
lora_dropout: 0.0
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./falcon-7b
batch_size: 2
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.00003
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 40
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"
  bos_token: "<|endoftext|>"
  eos_token: "<|endoftext|>"

[openaccess-ai-collective/axolotl] examples/falcon/config-7b-qlora.yml

# 1b: tiiuae/falcon-rw-1b
# 40b: tiiuae/falcon-40b
base_model: tiiuae/falcon-7b
# required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
# enable 4bit for QLoRA
load_in_4bit: true
gptq: false
strict: false
push_dataset_to_hub:
datasets:
  - path: QingyiSi/Alpaca-CoT
    data_files:
      - Chain-of-Thought/formatted_cot_data/gsm8k_train.json
    type: "alpaca:chat"
dataset_prepared_path:
val_set_size: 0.05
# enable QLoRA
adapter: qlora
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len:

# hyperparameters from QLoRA paper Appendix B.2
# "We find hyperparameters to be largely robust across datasets"
lora_r: 64
lora_alpha: 16
# 0.1 for models up to 13B
# 0.05 for 33B and 65B models
lora_dropout: 0.05
# add LoRA modules on all linear layers of the base model
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./qlora-out

# QLoRA paper Table 9
# - 16 for 7b & 13b
# - 32 for 33b, 64 for 64b
# Max size tested on A6000
# - 7b: 40
# - 40b: 4
# decrease if OOM, increase for max VRAM utilization
micro_batch_size: 1
gradient_accumulation_steps: 2
num_epochs: 4
# Optimizer for QLoRA
optimizer: paged_adamw_32bit
torchdistx_path:
lr_scheduler: cosine
# QLoRA paper Table 9
# - 2e-4 for 7b & 13b
# - 1e-4 for 33b & 64b
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true
gradient_checkpointing: true
# stop training after this many evaluation losses have increased in a row
# https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
early_stopping_patience: 3
resume_from_checkpoint:
auto_resume_from_checkpoints: true
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.000001
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"
  bos_token: "<|endoftext|>"
  eos_token: "<|endoftext|>"

OpenAccess-AI-Collective/axolotl

huggingface/transformers

huggingface/peft

huggingface/accelerate