Phorm.ai
OpenAccess-AI-Collective/axolotl

    If I change the number of epochs by X, but I want to keep the amount of learning per data sample the same, I should reduce the learning rate by X.

    If I change the micro_batch_size by X, should I change the learning rate? If I change the gradient_accumulation_steps by X, should I change the learning rate? If I change the number of GPUs by X, should I change the learning rate? Assume that I want to do roughly the same amount of learning per data sample.

    When adjusting the number of epochs, micro_batch_size, gradient_accumulation_steps, or the number of GPUs, and aiming to keep the amount of learning per data sample roughly the same, here's how you should consider adjusting the learning rate:

    1. Changing the number of epochs by X: If you increase the number of epochs by X, to keep the amount of learning per data sample the same, you should reduce the learning rate by X. This is because more epochs mean more passes through the data, so reducing the learning rate compensates for the increased number of updates.

    2. Changing the micro_batch_size by X: Generally, when you change the micro_batch_size, it's not strictly necessary to adjust the learning rate if the effective batch size (micro_batch_size * gradient_accumulation_steps * number of GPUs) remains constant. However, if the effective batch size changes, you might need to adjust the learning rate. The common practice is to scale the learning rate linearly with the batch size (source).

    3. Changing the gradient_accumulation_steps by X: If you change the gradient_accumulation_steps by X, you do not need to adjust the learning rate directly because gradient accumulation is a technique to simulate larger batch sizes without increasing the memory consumption. The key is to ensure the effective batch size remains the same. If the effective batch size changes due to the adjustment in gradient accumulation steps, then consider adjusting the learning rate accordingly.

    4. Changing the number of GPUs by X: When you increase the number of GPUs by X, it's recommended to scale the learning rate linearly with the number of GPUs to keep the amount of learning per data sample the same. This is because increasing the number of GPUs effectively increases the batch size, allowing for a higher learning rate without the risk of instability (source).

    In summary, the key factor to consider when deciding whether to adjust the learning rate is the effective batch size (micro_batch_size * gradient_accumulation_steps * number of GPUs). If the effective batch size changes, then adjusting the learning rate may be necessary to maintain the same amount of learning per data sample. The common guideline is to scale the learning rate linearly with the effective batch size.

    Search results
    [openaccess-ai-collective/axolotl] docs/batch_vs_grad.qmd
    ---
    title: Batch size vs Gradient accumulation
    description: Understanding of batch size and gradient accumulation steps
    ---
    
    Gradient accumulation means accumulating gradients over several mini-batches and updating the model weights afterward. When the samples in each batch are diverse, this technique doesn't significantly impact learning.
    
    This method allows for effective training with larger effective batch sizes without needing proportionally larger memory. Here's why:
    
    1. **Memory Consumption with Batch Size**: The primary reason increasing the batch size impacts memory is due to the storage requirements for intermediate activations. When you forward propagate a batch through a network, you have to store the activations at each layer for each sample in the batch, because these activations are used during backpropagation to compute gradients. Therefore, larger batches mean more activations, leading to greater GPU memory consumption.
    
    2. **Gradient Accumulation**: With gradient accumulation, you're effectively simulating a larger batch size by accumulating gradients over several smaller batches (or micro-batches). However, at any given time, you're only forward and backward propagating a micro-batch. This means you only store activations for the micro-batch, not the full accumulated batch. As a result, you can simulate the effect of a larger batch size without the memory cost of storing activations for a large batch.
    
    **Example 1:**
    Micro batch size: 3
    Gradient accumulation steps: 2
    Number of GPUs: 3
    Total batch size = 3 * 2 * 3 = 18
    
    

    | GPU 1 | GPU 2 | GPU 3 | |----------------|----------------|----------------| | S1, S2, S3 | S4, S5, S6 | S7, S8, S9 | | e1, e2, e3 | e4, e5, e6 | e7, e8, e9 | |----------------|----------------|----------------| | → (accumulate) | → (accumulate) | → (accumulate) | |----------------|----------------|----------------| | S10, S11, S12 | S13, S14, S15 | S16, S17, S18 | | e10, e11, e12 | e13, e14, e15 | e16, e17, e18 | |----------------|----------------|----------------| | → (apply) | → (apply) | → (apply) |

    Accumulated gradient for the weight w1 after the second iteration (considering all GPUs): Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6 + e7 + e8 + e9 + e10 + e11 + e12 + e13 + e14 + e15 + e16 + e17 + e18

    Weight update for w1: w1_new = w1_old - learning rate x (Total gradient for w1 / 18)

    
    **Example 2:**
    Micro batch size: 2
    Gradient accumulation steps: 1
    Number of GPUs: 3
    Total batch size = 2 * 1 * 3 = 6
    
    

    | GPU 1 | GPU 2 | GPU 3 | |-----------|-----------|-----------| | S1, S2 | S3, S4 | S5, S6 | | e1, e2 | e3, e4 | e5, e6 | |-----------|-----------|-----------| | → (apply) | → (apply) | → (apply) |

    Accumulated gradient for the weight w1 (considering all GPUs): Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6

    Weight update for w1: w1_new = w1_old - learning rate × (Total gradient for w1 / 6)

    [huggingface/accelerate] docs/source/concept_guides/performance.md

    Learning Rates

    As noted in multiple sources[1][2], the learning rate should be scaled linearly based on the number of devices present. The below snippet shows doing so with Accelerate:

    <Tip>

    Since users can have their own learning rate schedulers defined, we leave this up to the user to decide if they wish to scale their learning rate or not.

    </Tip>
    learning_rate = 1e-3 accelerator = Accelerator() learning_rate *= accelerator.num_processes optimizer = AdamW(params=model.parameters(), lr=learning_rate)

    You will also find that accelerate will step the learning rate based on the number of processes being trained on. This is because of the observed batch size noted earlier. So in the case of 2 GPUs, the learning rate will be stepped twice as often as a single GPU to account for the batch size being twice as large (if no changes to the batch size on the single GPU instance are made).

    [huggingface/accelerate] docs/source/concept_guides/performance.md

    Comparing performance between different device setups

    Evaluating and comparing the performance from different setups can be quite tricky if you don't know what to look for. For example, you cannot run the same script with the same batch size across TPU, multi-GPU, and single-GPU with Accelerate and expect your results to line up.

    But why?

    There are three reasons for this that this tutorial will cover:

    1. Setting the right seeds
    2. Observed Batch Sizes
    3. Learning Rates

    Setting the Seed

    While this issue has not come up as much, make sure to use [utils.set_seed] to fully set the seed in all distributed cases so training will be reproducible:

    from accelerate.utils import set_seed set_seed(42)

    Why is this important? Under the hood this will set 5 different seed settings:

    random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) # ^^ safe to call this function even if cuda is not available if is_torch_xla_available(): xm.set_rng_state(seed)

    The random state, numpy's state, torch, torch's cuda state, and if TPUs are available torch_xla's cuda state.

    Observed Batch Sizes

    When training with Accelerate, the batch size passed to the dataloader is the batch size per GPU. What this entails is a batch size of 64 on two GPUs is truly a batch size of 128. As a result, when testing on a single GPU this needs to be accounted for, as well as similarly for TPUs.

    The below table can be used as a quick reference to try out different batch sizes:

    <Tip>

    In this example, there are two GPUs for "Multi-GPU" and a TPU pod with 8 workers

    </Tip>

    | Single GPU Batch Size | Multi-GPU Equivalent Batch Size | TPU Equivalent Batch Size | |-----------------------|---------------------------------|---------------------------| | 256 | 128 | 32 | | 128 | 64 | 16 | | 64 | 32 | 8 | | 32 | 16 | 4 |

    Learning Rates

    As noted in multiple sources[1][2], the learning rate should be scaled linearly based on the number of devices present. The below snippet shows doing so with Accelerate:

    <Tip>

    Since users can have their own learning rate schedulers defined, we leave this up to the user to decide if they wish to scale their learning rate or not.

    </Tip>
    learning_rate = 1e-3 accelerator = Accelerator() learning_rate *= accelerator.num_processes optimizer = AdamW(params=model.parameters(), lr=learning_rate)

    You will also find that accelerate will step the learning rate based on the number of processes being trained on. This is because of the observed batch size noted earlier. So in the case of 2 GPUs, the learning rate will be stepped twice as often as a single GPU to account for the batch size being twice as large (if no changes to the batch size on the single GPU instance are made).

    Gradient Accumulation and Mixed Precision

    When using gradient accumulation and mixed precision, due to how gradient averaging works (accumulation) and the precision loss (mixed precision), some degradation in performance is expected. This will be explicitly seen when comparing the batch-wise loss between different compute setups. However, the overall loss, metric, and general performance at the end of training should be roughly the same.

    [huggingface/transformers] docs/source/en/perf_train_gpu_one.md

    Gradient Accumulation

    The gradient accumulation method aims to calculate gradients in smaller increments instead of computing them for the entire batch at once. This approach involves iteratively calculating gradients in smaller batches by performing forward and backward passes through the model and accumulating the gradients during the process. Once a sufficient number of gradients have been accumulated, the model's optimization step is executed. By employing gradient accumulation, it becomes possible to increase the effective batch size beyond the limitations imposed by the GPU's memory capacity. However, it is important to note that the additional forward and backward passes introduced by gradient accumulation can slow down the training process.

    You can enable gradient accumulation by adding the gradient_accumulation_steps argument to [TrainingArguments]:

    training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)

    In the above example, your effective batch size becomes 4.

    Alternatively, use 🤗 Accelerate to gain full control over the training loop. Find the 🤗 Accelerate example further down in this guide.

    While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can result in a more pronounced training slowdown. Consider the following example. Let's say, the per_device_train_batch_size=4 without gradient accumulation hits the GPU's limit. If you would like to train with batches of size 64, do not set the per_device_train_batch_size to 1 and gradient_accumulation_steps to 64. Instead, keep per_device_train_batch_size=4 and set gradient_accumulation_steps=16. This results in the same effective batch size while making better use of the available GPU resources.

    For additional information, please refer to batch size and gradient accumulation benchmarks for RTX-3090 and A100.

    [huggingface/transformers] templates/adding_a_new_example_script/{{cookiecutter.directory_name}}/run_{{cookiecutter.example_shortcut}}.py
    # Scheduler and math around the number of training steps. num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    [huggingface/accelerate] examples/by_feature/gradient_accumulation.py
    MAX_GPU_BATCH_SIZE = 16 EVAL_BATCH_SIZE = 32
    [huggingface/accelerate] docs/source/concept_guides/gradient_synchronization.md

    Solving the slowdown problem

    Since you are skipping model parameter updates when training on these batches, their gradients do not need to be synchronized until the point where optimizer.step() is actually called. PyTorch cannot automagically tell when you need to do this, but they do provide a tool to help through the no_sync context manager that is added to your model after converting it to DDP.

    Under this context manager, PyTorch will skip synchronizing the gradients when .backward() is called, and the first call to .backward() outside this context manager will trigger the synchronization. See an example below:

    ddp_model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer) for index, batch in enumerate(dataloader): inputs, targets = batch # Trigger gradient synchronization on the last batch if index != (len(dataloader) - 1): with ddp_model.no_sync(): # Gradients only accumulate outputs = ddp_model(inputs) loss = loss_func(outputs) accelerator.backward(loss) else: # Gradients finally sync outputs = ddp_model(inputs) loss = loss_func(outputs) accelerator.backward(loss) optimizer.step()

    In 🤗 Accelerate to make this an API that can be called no matter the training device (though it may not do anything if you are not in a distributed system!), ddp_model.no_sync gets replaced with [~Accelerator.no_sync] and operates the same way:

    ddp_model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer) for index, batch in enumerate(dataloader): inputs, targets = batch # Trigger gradient synchronization on the last batch if index != (len(dataloader)-1): - with ddp_model.no_sync(): + with accelerator.no_sync(model): # Gradients only accumulate outputs = ddp_model(inputs) loss = loss_func(outputs, targets) accelerator.backward(loss) else: # Gradients finally sync outputs = ddp_model(inputs) loss = loss_func(outputs) accelerator.backward(loss) optimizer.step() optimizer.zero_grad()

    As you may expect, the [~Accelerator.accumulate] function wraps around this conditional check by keeping track of the current batch number, leaving you with the final gradient accumulation API:

    ddp_model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer) for batch in dataloader: with accelerator.accumulate(model): optimizer.zero_grad() inputs, targets = batch outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) optimizer.step() optimizer.zero_grad()

    As a result, you should either use accelerator.accumulate or accelerator.no_sync when it comes to API choice.

    [huggingface/transformers] docs/source/en/perf_train_gpu_one.md

    Batch size choice

    To achieve optimal performance, start by identifying the appropriate batch size. It is recommended to use batch sizes and input/output neuron counts that are of size 2^N. Often it's a multiple of 8, but it can be higher depending on the hardware being used and the model's dtype.

    For reference, check out NVIDIA's recommendation for input/output neuron counts and batch size for fully connected layers (which are involved in GEMMs (General Matrix Multiplications)).

    Tensor Core Requirements define the multiplier based on the dtype and the hardware. For instance, for fp16 data type a multiple of 8 is recommended, unless it's an A100 GPU, in which case use multiples of 64.

    For parameters that are small, consider also Dimension Quantization Effects. This is where tiling happens and the right multiplier can have a significant speedup.

    [huggingface/accelerate] docs/source/concept_guides/gradient_synchronization.md

    Gradient Synchronization

    PyTorch's distributed module operates by communicating back and forth between all of the GPUs in your system. This communication takes time, and ensuring all processes know the states of each other happens at particular triggerpoints when using the ddp module.

    These triggerpoints are added to the PyTorch model, specifically their forward() and backward() methods. This happens when the model is wrapped with DistributedDataParallel:

    import torch.nn as nn from torch.nn.parallel import DistributedDataParallel model = nn.Linear(10, 10) ddp_model = DistributedDataParallel(model)

    In 🤗 Accelerate this conversion happens automatically when calling [~Accelerator.prepare] and passing in your model.

    + from accelerate import Accelerator + accelerator = Accelerator() import torch.nn as nn - from torch.nn.parallel import DistributedDataParallel model = nn.Linear(10,10) + model = accelerator.prepare(model)

    The slowdown in gradient accumulation

    You now understand that PyTorch adds hooks to the forward and backward method of your PyTorch model when training in a distributed setup. But how does this risk slowing down your code?

    In DDP (distributed data parallel), the specific order in which processes are performed and ran are expected at specific points and these must also occur at roughly the same time before moving on.

    The most direct example is when you update model parameters through optimizer.step(). Without gradient accumulation, all instances of the model need to have updated their gradients computed, collated, and updated before moving on to the next batch of data. When performing gradient accumulation, you accumulate n loss gradients and skip optimizer.step() until n batches have been reached. As all training processes only need to synchronize by the time optimizer.step() is called, without any modification to your training step, this needless inter-process communication can cause a significant slowdown.

    How can you avoid this overhead?

    Solving the slowdown problem

    Since you are skipping model parameter updates when training on these batches, their gradients do not need to be synchronized until the point where optimizer.step() is actually called. PyTorch cannot automagically tell when you need to do this, but they do provide a tool to help through the no_sync context manager that is added to your model after converting it to DDP.

    Under this context manager, PyTorch will skip synchronizing the gradients when .backward() is called, and the first call to .backward() outside this context manager will trigger the synchronization. See an example below:

    ddp_model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer) for index, batch in enumerate(dataloader): inputs, targets = batch # Trigger gradient synchronization on the last batch if index != (len(dataloader) - 1): with ddp_model.no_sync(): # Gradients only accumulate outputs = ddp_model(inputs) loss = loss_func(outputs) accelerator.backward(loss) else: # Gradients finally sync outputs = ddp_model(inputs) loss = loss_func(outputs) accelerator.backward(loss) optimizer.step()

    In 🤗 Accelerate to make this an API that can be called no matter the training device (though it may not do anything if you are not in a distributed system!), ddp_model.no_sync gets replaced with [~Accelerator.no_sync] and operates the same way:

    ddp_model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer) for index, batch in enumerate(dataloader): inputs, targets = batch # Trigger gradient synchronization on the last batch if index != (len(dataloader)-1): - with ddp_model.no_sync(): + with accelerator.no_sync(model): # Gradients only accumulate outputs = ddp_model(inputs) loss = loss_func(outputs, targets) accelerator.backward(loss) else: # Gradients finally sync outputs = ddp_model(inputs) loss = loss_func(outputs) accelerator.backward(loss) optimizer.step() optimizer.zero_grad()

    As you may expect, the [~Accelerator.accumulate] function wraps around this conditional check by keeping track of the current batch number, leaving you with the final gradient accumulation API:

    ddp_model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer) for batch in dataloader: with accelerator.accumulate(model): optimizer.zero_grad() inputs, targets = batch outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) optimizer.step() optimizer.zero_grad()

    As a result, you should either use accelerator.accumulate or accelerator.no_sync when it comes to API choice.

    Just how much of a slowdown is there, and easy mistakes you can make

    To set up a realistic example, consider the following setup:

    • Two single-GPU T4 nodes and one node with two GPUs
    • Each GPU is a T4, and are hosted on GCP
    • The script used is a modification of the NLP Example script
    • Batch size per GPU is 16, and gradients are accumulated every 4 steps

    All scripts are available in this repository.

    If not careful about gradient synchronization and GPU communication, a large amount of time can be wasted from when these GPUs communicate to each other during unnecessary periods.

    By how much?

    Reference:

    • Baseline: uses no synchronization practices discussed here
    • no_sync improperly: no_sync only around the backward call, not the forward
    • no_sync: using the no_sync pattern properly
    • accumulate: using [~Accelerator.accumulate] properly

    Below are the average seconds per batch iterating over 29 batches of data for each setup on both a single node and on the dual-node setup:

    | | Baseline | no_sync improperly | no_sync | accumulate| | :---------: | :-------: | :------------------: | :-------: | :---------: | | Multi-Node | 2±0.01s | 2.13±0.08s | 0.91±0.11s | 0.91±0.11s | | Single Node | 0.50±0.01s | 0.50±0.01s | 0.41±0.015s | 0.41±0.015s |

    As you can see, if you are not careful about how you set up your gradient synchronization, you can get upwards of more than a 2x slowdown during training!

    If you are worried about making sure everything is done properly, we highly recommend utilizing the [~Accelerator.accumulate] function and passing in gradient_accumulation_steps or gradient_accumulation_plugin to the [Accelerator] object so Accelerate can handle this for you.

    no_sync requires additional GPU memory when using FSDP

    Be aware that not syncing gradients can have adverse effects while performing FSDP training. As it has been warned in torch, the no_sync context manager for FSDP will require additional memory.

    Therefore in memory intensive situations while using FSDP, we recommend to set sync_each_batch to True in the [~utils.GradientAccumulationPlugin] to disable no_sync.

    See the example below where we fine-tune Mixtral (47B parameters) on 8 A100-80GB GPUs. We see that even for a modest gradient_accumulation_steps=2 we quickly go out-of-memory (OOM) if no_sync is enabled. Again, this is due to additional memory overheads due to FSDP's no_sync. However, if no_sync is disabled via sync_each_batch=True, then the memory consumption for gradient_accumulation_steps=16 reverts to that of gradient_accumulation_steps=1.

    | Model | no_sync (accum=1) | no_sync (accum=2) | no_sync disabled (accum=16) | :-------------: | :-----------------: | :-----------------: | :-----------------: mixtral 8x7B | 69G | OOM | 69G

    [!WARNING] Disabling no_sync means there will be slowdown due the extra data syncs, as explained by the earlier sections of this guide.

    [huggingface/accelerate] examples/by_feature/schedule_free.py
    MAX_GPU_BATCH_SIZE = 16 EVAL_BATCH_SIZE = 32
    [huggingface/transformers] examples/research_projects/rag/lightning_base.py
    def total_steps(self) -> int: """The number of total training steps that will be run. Used for lr scheduler purposes.""" num_devices = max(1, self.hparams.gpus) # TODO: consider num_tpu_cores effective_batch_size = self.hparams.train_batch_size * self.hparams.accumulate_grad_batches * num_devices return (self.dataset_size / effective_batch_size) * self.hparams.max_epochs
    [huggingface/transformers] docs/source/en/main_classes/pipelines.md

    Pipeline batching

    All pipelines can use batching. This will work whenever the pipeline uses its streaming ability (so when passing lists or Dataset or generator).

    from transformers import pipeline from transformers.pipelines.pt_utils import KeyDataset import datasets dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised") pipe = pipeline("text-classification", device=0) for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"): print(out) # [{'label': 'POSITIVE', 'score': 0.9998743534088135}] # Exactly the same output as before, but the content are passed # as batches to the model
    <Tip warning={true}>

    However, this is not automatically a win for performance. It can be either a 10x speedup or 5x slowdown depending on hardware, data and the actual model being used.

    Example where it's mostly a speedup:

    </Tip>
    from transformers import pipeline from torch.utils.data import Dataset from tqdm.auto import tqdm pipe = pipeline("text-classification", device=0) class MyDataset(Dataset): def __len__(self): return 5000 def __getitem__(self, i): return "This is a test" dataset = MyDataset() for batch_size in [1, 8, 64, 256]: print("-" * 30) print(f"Streaming batch_size={batch_size}") for out in tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)): pass
    # On GTX 970
    ------------------------------
    Streaming no batching
    100%|██████████████████████████████████████████████████████████████████████| 5000/5000 [00:26<00:00, 187.52it/s]
    ------------------------------
    Streaming batch_size=8
    100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:04<00:00, 1205.95it/s]
    ------------------------------
    Streaming batch_size=64
    100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2478.24it/s]
    ------------------------------
    Streaming batch_size=256
    100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 2554.43it/s]
    (diminishing returns, saturated the GPU)
    

    Example where it's most a slowdown:

    class MyDataset(Dataset): def __len__(self): return 5000 def __getitem__(self, i): if i % 64 == 0: n = 100 else: n = 1 return "This is a test" * n

    This is a occasional very long sentence compared to the other. In that case, the whole batch will need to be 400 tokens long, so the whole batch will be [64, 400] instead of [64, 4], leading to the high slowdown. Even worse, on bigger batches, the program simply crashes.

    ------------------------------
    Streaming no batching
    100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s]
    ------------------------------
    Streaming batch_size=8
    100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 265.74it/s]
    ------------------------------
    Streaming batch_size=64
    100%|██████████████████████████████████████████████████████████████████████| 1000/1000 [00:26<00:00, 37.80it/s]
    ------------------------------
    Streaming batch_size=256
      0%|                                                                                 | 0/1000 [00:00<?, ?it/s]
    Traceback (most recent call last):
      File "/home/nicolas/src/transformers/test.py", line 42, in <module>
        for out in tqdm(pipe(dataset, batch_size=256), total=len(dataset)):
    ....
        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)
    RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 0; 3.95 GiB total capacity; 1.72 GiB already allocated; 354.88 MiB free; 2.46 GiB reserved in total by PyTorch)
    

    There are no good (general) solutions for this problem, and your mileage may vary depending on your use cases. Rule of thumb:

    For users, a rule of thumb is:

    • Measure performance on your load, with your hardware. Measure, measure, and keep measuring. Real numbers are the only way to go.

    • If you are latency constrained (live product doing inference), don't batch.

    • If you are using CPU, don't batch.

    • If you are using throughput (you want to run your model on a bunch of static data), on GPU, then:

      • If you have no clue about the size of the sequence_length ("natural" data), by default don't batch, measure and try tentatively to add it, add OOM checks to recover when it will fail (and it will at some point if you don't control the sequence_length.)
      • If your sequence_length is super regular, then batching is more likely to be VERY interesting, measure and push it until you get OOMs.
      • The larger the GPU the more likely batching is going to be more interesting
    • As soon as you enable batching, make sure you can handle OOMs nicely.

    [huggingface/peft] examples/stable_diffusion/train_dreambooth.py
    if accelerator.sync_gradients: params_to_clip = ( itertools.chain(unet.parameters(), text_encoder.parameters()) if args.train_text_encoder else unet.parameters() ) accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) if args.report_to == "wandb": accelerator.print(progress_bar) global_step += 1 # if global_step % args.checkpointing_steps == 0: # if accelerator.is_main_process: # save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") # accelerator.save_state(save_path) # logger.info(f"Saved state to {save_path}") logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0]} progress_bar.set_postfix(**logs) accelerator.log(logs, step=global_step) if ( args.validation_prompt is not None and (step + num_update_steps_per_epoch * epoch) % args.validation_steps == 0 ): logger.info( f"Running validation... \n Generating {args.num_validation_images} images with prompt:" f" {args.validation_prompt}." ) # create pipeline pipeline = DiffusionPipeline.from_pretrained( args.pretrained_model_name_or_path, safety_checker=None, revision=args.revision, ) # set `keep_fp32_wrapper` to True because we do not want to remove # mixed precision hooks while we are still training pipeline.unet = accelerator.unwrap_model(unet, keep_fp32_wrapper=True) pipeline.text_encoder = accelerator.unwrap_model(text_encoder, keep_fp32_wrapper=True) pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) pipeline = pipeline.to(accelerator.device) pipeline.set_progress_bar_config(disable=True) # Set evaliation mode pipeline.unet.eval() pipeline.text_encoder.eval() # run inference if args.seed is not None: generator = torch.Generator(device=accelerator.device).manual_seed(args.seed) else: generator = None images = [] for _ in range(args.num_validation_images): image = pipeline(args.validation_prompt, num_inference_steps=25, generator=generator).images[0] images.append(image) for tracker in accelerator.trackers: if tracker.name == "tensorboard": np_images = np.stack([np.asarray(img) for img in images]) tracker.writer.add_images("validation", np_images, epoch, dataformats="NHWC") if tracker.name == "wandb": import wandb tracker.log( { "validation": [ wandb.Image(image, caption=f"{i}: {args.validation_prompt}") for i, image in enumerate(images) ] } ) # Set evaliation mode pipeline.unet.train() pipeline.text_encoder.train() del pipeline torch.cuda.empty_cache() if global_step >= args.max_train_steps: break # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage accelerator.print(f"GPU Memory before entering the train : {b2mb(tracemalloc.begin)}") accelerator.print(f"GPU Memory consumed at the end of the train (end-begin): {tracemalloc.used}") accelerator.print(f"GPU Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}") accelerator.print( f"GPU Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}" ) accelerator.print(f"CPU Memory before entering the train : {b2mb(tracemalloc.cpu_begin)}") accelerator.print(f"CPU Memory consumed at the end of the train (end-begin): {tracemalloc.cpu_used}") accelerator.print(f"CPU Peak Memory consumed during the train (max-begin): {tracemalloc.cpu_peaked}") accelerator.print( f"CPU Total Peak Memory consumed during the train (max): {tracemalloc.cpu_peaked + b2mb(tracemalloc.cpu_begin)}" ) # Create the pipeline using using the trained modules and save it. accelerator.wait_for_everyone() if accelerator.is_main_process: if args.adapter != "full": unwarpped_unet = accelerator.unwrap_model(unet) unwarpped_unet.save_pretrained( os.path.join(args.output_dir, "unet"), state_dict=accelerator.get_state_dict(unet) ) if args.train_text_encoder: unwarpped_text_encoder = accelerator.unwrap_model(text_encoder) unwarpped_text_encoder.save_pretrained( os.path.join(args.output_dir, "text_encoder"), state_dict=accelerator.get_state_dict(text_encoder), ) else: pipeline = DiffusionPipeline.from_pretrained( args.pretrained_model_name_or_path, unet=accelerator.unwrap_model(unet), text_encoder=accelerator.unwrap_model(text_encoder), revision=args.revision, ) pipeline.save_pretrained(args.output_dir) if args.push_to_hub: api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message="End of training", run_as_future=True, ) accelerator.end_training()
    [huggingface/peft] docs/source/accelerate/deepspeed.md

    Use PEFT QLoRA and DeepSpeed with ZeRO3 for finetuning large models on multiple GPUs

    In this section, we will look at how to use QLoRA and DeepSpeed Stage-3 for finetuning 70B llama model on 2X40GB GPUs. For this, we first need bitsandbytes>=0.43.0, accelerate>=0.28.0, transformers>4.38.2, trl>0.7.11 and peft>0.9.0. We need to set zero3_init_flag to true when using Accelerate config. Below is the config which can be found at deepspeed_config_z3_qlora.yaml:

    compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_multinode_launcher: standard offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

    Launch command is given below which is available at run_peft_qlora_deepspeed_stage3.sh:

    accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml"  train.py \
    --seed 100 \
    --model_name_or_path "meta-llama/Llama-2-70b-hf" \
    --dataset_name "smangrul/ultrachat-10k-chatml" \
    --chat_template_format "chatml" \
    --add_special_tokens False \
    --append_concat_token False \
    --splits "train,test" \
    --max_seq_len 2048 \
    --num_train_epochs 1 \
    --logging_steps 5 \
    --log_level "info" \
    --logging_strategy "steps" \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --push_to_hub \
    --hub_private_repo True \
    --hub_strategy "every_save" \
    --bf16 True \
    --packing True \
    --learning_rate 1e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 1e-4 \
    --warmup_ratio 0.0 \
    --max_grad_norm 1.0 \
    --output_dir "llama-sft-qlora-dsz3" \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --gradient_checkpointing True \
    --use_reentrant True \
    --dataset_text_field "content" \
    --use_flash_attn True \
    --use_peft_lora True \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.1 \
    --lora_target_modules "all-linear" \
    --use_4bit_quantization True \
    --use_nested_quant True \
    --bnb_4bit_compute_dtype "bfloat16" \
    --bnb_4bit_quant_storage_dtype "bfloat16"
    

    Notice the new argument being passed bnb_4bit_quant_storage_dtype which denotes the data type for packing the 4-bit parameters. For example, when it is set to bfloat16, 32/4 = 8 4-bit params are packed together post quantization.

    In terms of training code, the important code changes are:

    ... bnb_config = BitsAndBytesConfig( load_in_4bit=args.use_4bit_quantization, bnb_4bit_quant_type=args.bnb_4bit_quant_type, bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=args.use_nested_quant, + bnb_4bit_quant_storage=quant_storage_dtype, ) ... model = AutoModelForCausalLM.from_pretrained( args.model_name_or_path, quantization_config=bnb_config, trust_remote_code=True, attn_implementation="flash_attention_2" if args.use_flash_attn else "eager", + torch_dtype=quant_storage_dtype or torch.float32, )

    Notice that torch_dtype for AutoModelForCausalLM is same as the bnb_4bit_quant_storage data type. That's it. Everything else is handled by Trainer and TRL.

    Memory usage

    In the above example, the memory consumed per GPU is 36.6 GB. Therefore, what took 8X80GB GPUs with DeepSpeed Stage 3+LoRA and a couple of 80GB GPUs with DDP+QLoRA now requires 2X40GB GPUs. This makes finetuning of large models more accessible.

    [huggingface/peft] docs/source/task_guides/ia3.md

    Training

    Set up an optimizer and learning rate scheduler.

    import torch from transformers import get_linear_schedule_with_warmup lr = 8e-3 num_epochs = 3 optimizer = torch.optim.AdamW(model.parameters(), lr=lr) lr_scheduler = get_linear_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=0, num_training_steps=(len(train_dataloader) * num_epochs), )

    Move the model to the GPU and create a training loop that reports the loss and perplexity for each epoch.

    from tqdm import tqdm device = "cuda" model = model.to(device) for epoch in range(num_epochs): model.train() total_loss = 0 for step, batch in enumerate(tqdm(train_dataloader)): batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss total_loss += loss.detach().float() loss.backward() optimizer.step() lr_scheduler.step() optimizer.zero_grad() model.eval() eval_loss = 0 eval_preds = [] for step, batch in enumerate(tqdm(eval_dataloader)): batch = {k: v.to(device) for k, v in batch.items()} with torch.no_grad(): outputs = model(**batch) loss = outputs.loss eval_loss += loss.detach().float() eval_preds.extend( tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True) ) eval_epoch_loss = eval_loss / len(eval_dataloader) eval_ppl = torch.exp(eval_epoch_loss) train_epoch_loss = total_loss / len(train_dataloader) train_ppl = torch.exp(train_epoch_loss) print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")
    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/models/input/v0_4_1/__init__.py
    def hint_batch_size_set(cls, batch_size): if batch_size: LOG.warning( "%s\n%s", "batch_size is not recommended. Please use gradient_accumulation_steps instead.", "To calculate the equivalent gradient_accumulation_steps, divide batch_size / micro_batch_size / number of gpus.", ) return batch_size
    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/models/input/v0_4_1/__init__.py
    class HyperparametersConfig(BaseModel): """training hyperparams configuration subset""" gradient_accumulation_steps: Optional[int] = Field(default=1) micro_batch_size: Optional[int] = Field( default=1, metadata={"help": "per gpu micro batch size for training"}, ) batch_size: Optional[int] = Field( default=None, metadata={ "help": "Total batch size, we do not recommended setting this manually" }, ) eval_batch_size: Optional[int] = Field( default=None, metadata={ "help": "per gpu micro batch size for evals, defaults to value of micro_batch_size" }, ) train_on_inputs: Optional[bool] = False group_by_length: Optional[bool] = None learning_rate: Union[str, float] weight_decay: Optional[float] = 0.0 optimizer: Optional[ Union[OptimizerNames, Literal["lion_pytorch"]] ] = OptimizerNames.ADAMW_HF.value optim_args: Optional[Union[str, Dict[str, Any]]] = Field( default=None, metadata={"help": "Optional arguments to supply to optimizer."} ) optim_target_modules: Optional[Union[List[str], Literal["all_linear"]]] = Field( default=None, metadata={ "help": "The target modules to optimize, i.e. the module names that you would like to train." }, ) torchdistx_path: Optional[str] = None lr_scheduler: Optional[SchedulerType] = "cosine" lr_scheduler_kwargs: Optional[Dict[str, Any]] = None lr_quadratic_warmup: Optional[bool] = None cosine_min_lr_ratio: Optional[float] = None cosine_constant_lr_ratio: Optional[float] = None lr_div_factor: Optional[float] = None adam_epsilon: Optional[float] = None adam_beta1: Optional[float] = None adam_beta2: Optional[float] = None max_grad_norm: Optional[float] = None num_epochs: int = Field(default=1) @field_validator("batch_size") @classmethod def hint_batch_size_set(cls, batch_size): if batch_size: LOG.warning( "%s\n%s", "batch_size is not recommended. Please use gradient_accumulation_steps instead.", "To calculate the equivalent gradient_accumulation_steps, divide batch_size / micro_batch_size / number of gpus.", ) return batch_size @field_validator("learning_rate") @classmethod def convert_learning_rate(cls, learning_rate): if learning_rate and isinstance(learning_rate, str): learning_rate = float(learning_rate) return learning_rate
    [huggingface/peft] examples/loftq_finetuning/train_gsm8k_llama.py
    completed_steps = starting_epoch * num_update_steps_per_epoch else: # need to multiply `gradient_accumulation_steps` to reflect real steps resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) resume_step -= starting_epoch * len(train_dataloader) completed_steps = resume_step // args.gradient_accumulation_steps # update the progress_bar if load from checkpoint progress_bar.update(completed_steps) for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: # We skip the first `n` batches in the dataloader when resuming from a checkpoint active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) else: active_dataloader = train_dataloader for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch if args.with_tracking: total_loss += loss.detach().float() accelerator.backward(loss) if completed_steps % 50: accelerator.print(f"Epoch: {epoch} | Step: {completed_steps} | Loss: {loss}") optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) completed_steps += 1 if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if completed_steps >= args.max_train_steps: break model.eval() gen_kwargs = { "max_new_tokens": args.max_target_length, "temperature": args.temperature, "top_k": args.k, "top_p": args.p, "do_sample": True, } ans_pred_list = [] ans_gold_list = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): gen_kwargs["input_ids"] = batch["input_ids"] gen_kwargs["attention_mask"] = batch["attention_mask"] generated_tokens = accelerator.unwrap_model(model).generate(**gen_kwargs) pred_tokens = generated_tokens[:, args.max_source_length :] pred_tokens = accelerator.pad_across_processes(pred_tokens, dim=1, pad_index=tokenizer.pad_token_id) gold_tokens = batch["labels"] if not args.pad_to_max_length: # If we did not pad to max length, we need to pad the labels too gold_tokens = accelerator.pad_across_processes( batch["labels"], dim=1, pad_index=tokenizer.pad_token_id ) pred_tokens, gold_tokens = accelerator.gather_for_metrics((pred_tokens, gold_tokens)) pred_tokens, gold_tokens = pred_tokens.cpu().numpy(), gold_tokens.cpu().numpy() if isinstance(pred_tokens, tuple): pred_tokens = pred_tokens[0] decoded_pred = tokenizer.batch_decode(pred_tokens, skip_special_tokens=True) decoded_gold = tokenizer.batch_decode(gold_tokens, skip_special_tokens=True) # Extract the numbers in sentences accelerator.print(decoded_pred) ans_pred_list += [extract_answer_number(sentence_pred) for sentence_pred in decoded_pred] ans_gold_list += [extract_answer_number(sentence_gold) for sentence_gold in decoded_gold] accelerator.print(ans_pred_list) accelerator.print(ans_gold_list) accuracy = compute_accuracy(ans_gold_list, ans_pred_list) logger.info(f"epoch {epoch}: accuracy: {accuracy}") if args.with_tracking: accelerator.log( { "accuracy": accuracy, "train_loss": total_loss.item() / len(train_dataloader), "epoch": epoch, "step": completed_steps, }, step=completed_steps, ) if args.push_to_hub and epoch < args.num_train_epochs - 1: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message=f"Training in progress epoch {epoch}", run_as_future=True, ) if args.checkpointing_steps == "epoch": output_dir = f"epoch_{epoch}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if args.with_tracking: accelerator.end_training() if args.output_dir is not None: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) if args.push_to_hub: api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message="End of training", )
    [openaccess-ai-collective/axolotl] docs/config.qmd
    ---
    title: Config options
    description: A complete list of all configuration options.
    ---
    
    ```yaml
    # This is the huggingface model that contains *.pt, *.safetensors, or *.bin files
    # This can also be a relative path to a model on disk
    base_model: ./llama-7b-hf
    # You can specify an ignore pattern if the model repo contains more than 1 model type (*.pt, etc)
    base_model_ignore_patterns:
    # If the base_model repo on hf hub doesn't include configuration .json files,
    # You can set that here, or leave this empty to default to base_model
    base_model_config: ./llama-7b-hf
    # You can specify to choose a specific model revision from huggingface hub
    revision_of_model:
    # Optional tokenizer configuration path in case you want to use a different tokenizer
    # than the one defined in the base model
    tokenizer_config:
    # If you want to specify the type of model to load, AutoModelForCausalLM is a good choice too
    model_type: AutoModelForCausalLM
    # Corresponding tokenizer for the model AutoTokenizer is a good choice
    tokenizer_type: AutoTokenizer
    # Trust remote code for untrusted source
    trust_remote_code:
    # use_fast option for tokenizer loading from_pretrained, default to True
    tokenizer_use_fast:
    # Whether to use the legacy tokenizer setting, defaults to True
    tokenizer_legacy:
    # Resize the model embeddings when new tokens are added to multiples of 32
    # This is reported to improve training speed on some models
    resize_token_embeddings_to_32x:
    
    # (Internal use only)
    # Used to identify which the model is based on
    is_falcon_derived_model:
    is_llama_derived_model:
    is_qwen_derived_model:
    # Please note that if you set this to true, `padding_side` will be set to "left" by default
    is_mistral_derived_model:
    
    # optional overrides to the base model configuration
    overrides_of_model_config:
      # RoPE Scaling https://github.com/huggingface/transformers/pull/24653
      rope_scaling:
        type: # linear | dynamic
        factor: # float
    
    # optional overrides to the bnb 4bit quantization configuration
    # https://huggingface.co/docs/transformers/main/main_classes/quantization#transformers.BitsAndBytesConfig
    bnb_config_kwargs:
      # These are default values
      llm_int8_has_fp16_weight: false
      bnb_4bit_quant_type: nf4
      bnb_4bit_use_double_quant: true
    
    
    # Whether you are training a 4-bit GPTQ quantized model
    gptq: true
    
    # This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
    load_in_8bit: true
    # Use bitsandbytes 4 bit
    load_in_4bit:
    
    # Use CUDA bf16
    bf16: true # bool or 'full' for `bf16_full_eval`. require >=ampere
    # Use CUDA fp16
    fp16: true
    # Use CUDA tf32
    tf32: true # require >=ampere
    
    # No AMP (automatic mixed precision)
    bfloat16: true # require >=ampere
    float16: true
    
    # Limit the memory for all available GPUs to this amount (if an integer, expressed in gigabytes); default: unset
    gpu_memory_limit: 20GiB
    # Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge
    lora_on_cpu: true
    
    # A list of one or more datasets to finetune the model with
    datasets:
      # HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
      - path: vicgalle/alpaca-gpt4
      # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
        type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
        ds_type: # Optional[str] (json|arrow|parquet|text|csv) defines the datatype when path is a file
        data_files: # Optional[str] path to source data files
        shards: # Optional[int] number of shards to split data into
        name: # Optional[str] name of dataset configuration to load
        train_on_split: train # Optional[str] name of dataset split to load from
    
        # Optional[str] fastchat conversation type, only used with type: sharegpt
        conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
        field_human: # Optional[str]. Human key to use for conversation.
        field_model: # Optional[str]. Assistant key to use for conversation.
        # Add additional keys from your dataset as input or output roles
        roles:
          input: # Optional[List[str]]. These will be masked based on train_on_input
          output: # Optional[List[str]].
    
      # Custom user instruction prompt
      - path: repo
        type:
          # The below are defaults. only set what's needed if you use a different column name.
          system_prompt: ""
          system_format: "{system}"
          field_system: system
          field_instruction: instruction
          field_input: input
          field_output: output
    
          # Customizable to be single line or multi-line
          # Use {instruction}/{input} as key to be replaced
          # 'format' can include {input}
          format: |-
            User: {instruction} {input}
            Assistant:
          # 'no_input_format' cannot include {input}
          no_input_format: "{instruction} "
    
          # For `completion` datsets only, uses the provided field instead of `text` column
          field:
    
    # If false, the datasets will not be shuffled and will keep their original order in `datasets`.
    # The same applies to the `test_datasets` option and the `pretraining_dataset` option. Default is true.
    shuffle_merged_datasets: true
    
    # A list of one or more datasets to eval the model with.
    # You can use either test_datasets, or val_set_size, but not both.
    test_datasets:
      - path: /workspace/data/eval.jsonl
        ds_type: json
        # You need to specify a split. For "json" datasets the default split is called "train".
        split: train
        type: completion
        data_files:
          - /workspace/data/eval.jsonl
    
    # use RL training: 'dpo', 'ipo', 'kto_pair'
    rl:
    
    # Saves the desired chat template to the tokenizer_config.json for easier inferencing
    # Currently supports chatml and inst (mistral/mixtral)
    chat_template: chatml
    # Changes the default system message
    default_system_message: You are a helpful assistant. Please give a long and detailed answer. # Currently only supports chatml.
    # Axolotl attempts to save the dataset as an arrow after packing the data together so
    # subsequent training attempts load faster, relative path
    dataset_prepared_path: data/last_run_prepared
    # Push prepared dataset to hub
    push_dataset_to_hub: # repo path
    # The maximum number of processes to use while preprocessing your input dataset. This defaults to `os.cpu_count()`
    # if not set.
    dataset_processes: # defaults to os.cpu_count() if not set
    # Keep dataset in memory while preprocessing
    # Only needed if cached dataset is taking too much storage
    dataset_keep_in_memory:
    # push checkpoints to hub
    hub_model_id: # private repo path to push finetuned model
    # how to push checkpoints to hub
    # https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
    hub_strategy:
    # Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
    # Required to be true when used in combination with `push_dataset_to_hub`
    hf_use_auth_token: # boolean
    # How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for no eval.
    val_set_size: 0.04
    # Num shards for whole dataset
    dataset_shard_num:
    # Index of shard to use for whole dataset
    dataset_shard_idx:
    
    # The maximum length of an input to train with, this should typically be less than 2048
    # as most models have a token/context limit of 2048
    sequence_len: 2048
    # Pad inputs so each step uses constant sized buffers
    # This will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
    pad_to_sequence_len:
    # Use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
    sample_packing:
    # Set to 'false' if getting errors during eval with sample_packing on.
    eval_sample_packing:
    # You can set these packing optimizations AFTER starting a training at least once.
    # The trainer will provide recommended values for these values.
    sample_packing_eff_est:
    total_num_tokens:
    
    # Passed through to transformers when loading the model when launched without accelerate
    # Use `sequential` when training w/ model parallelism to limit memory
    device_map:
    # Defines the max memory usage per gpu on the system. Passed through to transformers when loading the model.
    max_memory:
    
    # If you want to use 'lora' or 'qlora' or leave blank to train all parameters in original model
    adapter: lora
    # If you already have a lora model trained that you want to load, put that here.
    # This means after training, if you want to test the model, you should set this to the value of `output_dir`.
    # Note that if you merge an adapter to the base model, a new subdirectory `merged` will be created under the `output_dir`.
    lora_model_dir:
    
    # LoRA hyperparameters
    # For more details about the following options, see:
    # https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2
    lora_r: 8
    lora_alpha: 16
    lora_dropout: 0.05
    lora_target_modules:
      - q_proj
      - v_proj
    #  - k_proj
    #  - o_proj
    #  - gate_proj
    #  - down_proj
    #  - up_proj
    lora_target_linear: # If true, will target all linear modules
    peft_layers_to_transform: # The layer indices to transform, otherwise, apply to all layers
    
    # If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
    # For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models.
    # `embed_tokens` converts tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
    # https://github.com/huggingface/peft/issues/334#issuecomment-1561727994
    lora_modules_to_save:
    #  - embed_tokens
    #  - lm_head
    
    lora_fan_in_fan_out: false
    
    # LoRA+ hyperparameters
    # For more details about the following options, see:
    # https://arxiv.org/abs/2402.12354  and `src/axolotl/core/train_builder.py`
    loraplus_lr_ratio: # loraplus learning rate ratio lr_B / lr_A. Recommended value is 2^4.
    loraplus_lr_embedding: #  loraplus learning rate for lora embedding layers. Default value is 1e-6.
    
    peft:
      # Configuration options for loftq initialization for LoRA
      # https://huggingface.co/docs/peft/developer_guides/quantization#loftq-initialization
      loftq_config:
        loftq_bits:  # typically 4 bits
    
    # ReLoRA configuration
    # Must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
    relora_steps: # Number of steps per ReLoRA restart
    relora_warmup_steps: # Number of per-restart warmup steps
    relora_anneal_steps: # Number of anneal steps for each relora cycle
    relora_prune_ratio: # threshold for optimizer magnitude when pruning
    relora_cpu_offload: # True to perform lora weight merges on cpu during restarts, for modest gpu memory savings
    
    # wandb configuration if you're using it
    # Make sure your `WANDB_API_KEY` environment variable is set (recommended) or you login to wandb with `wandb login`.
    wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
    wandb_project: # Your wandb project name
    wandb_entity: # A wandb Team name if using a Team
    wandb_watch:
    wandb_name: # Set the name of your wandb run
    wandb_run_id: # Set the ID of your wandb run
    wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training
    
    # mlflow configuration if you're using it
    mlflow_tracking_uri: # URI to mlflow
    mlflow_experiment_name: # Your experiment name
    hf_mlflow_log_artifacts:  # set to true to copy each saved checkpoint on each save to mlflow artifact registry
    
    # Where to save the full-finetuned model to
    output_dir: ./completed-model
    
    # Whether to use torch.compile and which backend to use
    torch_compile:  # bool
    torch_compile_backend:  # Optional[str]
    
    # Training hyperparameters
    
    # If greater than 1, backpropagation will be skipped and the gradients will be accumulated for the given number of steps.
    gradient_accumulation_steps: 1
    # The number of samples to include in each batch. This is the number of samples sent to each GPU.
    # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
    micro_batch_size: 2
    eval_batch_size:
    num_epochs: 4
    warmup_steps: 100  # cannot use with warmup_ratio
    warmup_ratio: 0.05  # cannot use with warmup_steps
    learning_rate: 0.00003
    lr_quadratic_warmup:
    logging_steps:
    eval_steps: # Leave empty to eval at each epoch, integers for every N steps. decimal for fraction of total steps
    evals_per_epoch: # number of times per epoch to run evals, mutually exclusive with eval_steps
    save_strategy: # Set to `no` to skip checkpoint saves
    save_steps: # Leave empty to save at each epoch
    saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
    save_total_limit: # Checkpoints saved at a time
    # Maximum number of iterations to train for. It precedes num_epochs which means that
    # if both are set, num_epochs will not be guaranteed.
    # e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps
    max_steps:
    
    eval_table_size: # Approximate number of predictions sent to wandb depending on batch size. Enabled above 0. Default is 0
    eval_max_new_tokens: # Total number of tokens generated for predictions sent to wandb. Default is 128
    eval_causal_lm_metrics: # HF evaluate metrics used during evaluation. Default is ["sacrebleu", "comet", "ter", chrf]
    
    loss_watchdog_threshold: # High loss value, indicating the learning has broken down (a good estimate is ~2 times the loss at the start of training)
    loss_watchdog_patience: # Number of high-loss steps in a row before the trainer aborts (default: 3)
    
    # Save model as safetensors (require safetensors package)
    save_safetensors:
    
    # Whether to mask out or include the human's prompt from the training labels
    train_on_inputs: false
    # Group similarly sized data to minimize padding.
    # May be slower to start, as it must download and sort the entire dataset.
    # Note that training loss may have an oscillating pattern with this enabled.
    group_by_length: false
    
    # Whether to use gradient checkpointing https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
    gradient_checkpointing: false
    # additional kwargs to pass to the trainer for gradient checkpointing
    # gradient_checkpointing_kwargs:
    #   use_reentrant: true
    
    # Stop training after this many evaluation losses have increased in a row
    # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
    early_stopping_patience: 3
    
    # Specify a scheduler and kwargs to use with the optimizer
    lr_scheduler: # 'one_cycle' | 'log_sweep' | empty for cosine
    lr_scheduler_kwargs:
    cosine_min_lr_ratio: # decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of peak lr
    cosine_constant_lr_ratio: # freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means start cosine_min_lr at 80% of training step (https://arxiv.org/pdf/2308.04014.pdf)
    
    # For one_cycle optim
    lr_div_factor: # Learning rate div factor
    
    # Specify optimizer
    # Valid values are driven by the Transformers OptimizerNames class, see:
    # https://github.com/huggingface/transformers/blob/95b374952dc27d8511541d6f5a4e22c9ec11fb24/src/transformers/training_args.py#L134
    #
    # Note that not all optimizers may be available in your environment, ex: 'adamw_anyprecision' is part of
    # torchdistx, 'adamw_bnb_8bit' is part of bnb.optim.Adam8bit, etc. When in doubt, it is recommended to start with the optimizer used
    # in the examples/ for your model and fine-tuning use case.
    #
    # Valid values for 'optimizer' include:
    # - adamw_hf
    # - adamw_torch
    # - adamw_torch_fused
    # - adamw_torch_xla
    # - adamw_apex_fused
    # - adafactor
    # - adamw_anyprecision
    # - sgd
    # - adagrad
    # - adamw_bnb_8bit
    # - lion_8bit
    # - lion_32bit
    # - paged_adamw_32bit
    # - paged_adamw_8bit
    # - paged_lion_32bit
    # - paged_lion_8bit
    # - galore_adamw
    # - galore_adamw_8bit
    # - galore_adafactor
    # - galore_adamw_layerwise
    # - galore_adamw_8bit_layerwise
    # - galore_adafactor_layerwise
    optimizer:
    # Dictionary of arguments to pass to the optimizer
    optim_args:
    # For Galore Optimizers the following optim_args are available
    # rank:  # type: int
    # update_proj_gap  # type: int
    # scale  # type: float
    # proj_type:  # type: str, default = std
    
    # The target modules to optimize, i.e. the module names that you would like to train, right now this is used only for GaLore algorithm
    optim_target_modules:
    # - self_attn  # for llama
    # - mlp
    
    # Specify weight decay
    weight_decay:
    # adamw hyperparams
    adam_beta1:
    adam_beta2:
    adam_epsilon:
    # Gradient clipping max norm
    max_grad_norm:
    
    # Augmentation techniques
    # NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
    
    [huggingface/peft] docs/source/accelerate/fsdp.md

    Use PEFT QLoRA and FSDP for finetuning large models on multiple GPUs

    In this section, we will look at how to use QLoRA and FSDP for finetuning 70B llama model on 2X24GB GPUs. Answer.AI in collaboration with bitsandbytes and Hugging Face 🤗 open sourced code enabling the usage of FSDP+QLoRA and explained the whole process in their insightful blogpost You can now train a 70b language model at home. This is now integrated in Hugging Face ecosystem.

    For this, we first need bitsandbytes>=0.43.0, accelerate>=0.28.0, transformers>4.38.2, trl>0.7.11 and peft>0.9.0. We need to set fsdp_cpu_ram_efficient_loading=true, fsdp_use_orig_params=false and fsdp_offload_params=true(cpu offloading) when using Accelerate config. When not using accelerate launcher, you can alternately set the environment variable export FSDP_CPU_RAM_EFFICIENT_LOADING=true. Here, we will be using accelerate config and below is the config which can be found at fsdp_config_qlora.yaml:

    compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: false fsdp_offload_params: true fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

    Launch command is given below which is available at run_peft_qlora_fsdp.sh:

    accelerate launch --config_file "configs/fsdp_config_qlora.yaml"  train.py \
    --seed 100 \
    --model_name_or_path "meta-llama/Llama-2-70b-hf" \
    --dataset_name "smangrul/ultrachat-10k-chatml" \
    --chat_template_format "chatml" \
    --add_special_tokens False \
    --append_concat_token False \
    --splits "train,test" \
    --max_seq_len 2048 \
    --num_train_epochs 1 \
    --logging_steps 5 \
    --log_level "info" \
    --logging_strategy "steps" \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --push_to_hub \
    --hub_private_repo True \
    --hub_strategy "every_save" \
    --bf16 True \
    --packing True \
    --learning_rate 1e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 1e-4 \
    --warmup_ratio 0.0 \
    --max_grad_norm 1.0 \
    --output_dir "llama-sft-qlora-fsdp" \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --gradient_checkpointing True \
    --use_reentrant True \
    --dataset_text_field "content" \
    --use_flash_attn True \
    --use_peft_lora True \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.1 \
    --lora_target_modules "all-linear" \
    --use_4bit_quantization True \
    --use_nested_quant True \
    --bnb_4bit_compute_dtype "bfloat16" \
    --bnb_4bit_quant_storage_dtype "bfloat16"
    

    Notice the new argument being passed, bnb_4bit_quant_storage_dtype, which denotes the data type for packing the 4-bit parameters. For example, when it is set to bfloat16, 32/4 = 8 4-bit params are packed together post quantization. When using mixed precision training with bfloat16, bnb_4bit_quant_storage_dtype can be either bfloat16 for pure bfloat16 finetuning, or float32 for automatic mixed precision (this consumes more GPU memory). When using mixed precision training with float16, bnb_4bit_quant_storage_dtype should be set to float32 for stable automatic mixed precision training.

    In terms of training code, the important code changes are:

    ... bnb_config = BitsAndBytesConfig( load_in_4bit=args.use_4bit_quantization, bnb_4bit_quant_type=args.bnb_4bit_quant_type, bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=args.use_nested_quant, + bnb_4bit_quant_storage=quant_storage_dtype, ) ... model = AutoModelForCausalLM.from_pretrained( args.model_name_or_path, quantization_config=bnb_config, trust_remote_code=True, attn_implementation="flash_attention_2" if args.use_flash_attn else "eager", + torch_dtype=quant_storage_dtype or torch.float32, )

    Notice that torch_dtype for AutoModelForCausalLM is same as the bnb_4bit_quant_storage data type. That's it. Everything else is handled by Trainer and TRL.

    Memory usage

    In the above example, the memory consumed per GPU is 19.6 GB while CPU RAM usage is around 107 GB. When disabling CPU offloading, the GPU memory usage is 35.6 GB/ GPU. Therefore, what took 16X80GB GPUs for full finetuning, 8X80GB GPUs with FSDP+LoRA, and a couple of 80GB GPUs with DDP+QLoRA, now requires 2X24GB GPUs. This makes finetuning of large models more accessible.

    More resources

    You can also refer the llama-recipes repo and Getting started with Llama guide on how to finetune using FSDP and PEFT.

    Caveats

    1. Merging when using PEFT and FSDP is currently unsupported and will raise error.
    2. Passing modules_to_save config parameter to is untested at present.
    3. GPU Memory saving when using CPU Offloading is untested at present.
    4. When using FSDP+QLoRA, paged_adamw_8bit currently results in an error when saving a checkpoint.
    [openaccess-ai-collective/axolotl] src/axolotl/utils/trainer.py
    def calculate_total_num_steps(cfg, train_dataset, update=True): if not cfg.total_num_tokens: total_num_tokens = np.sum( train_dataset.data.column("input_ids") .to_pandas() .apply(lambda x: len(x)) # pylint: disable=unnecessary-lambda .values ) LOG.debug(f"total_num_tokens: {total_num_tokens:_}", main_process_only=True) if update: cfg.total_num_tokens = total_num_tokens skip_estimates = cfg.model_config_type == "mamba" if not skip_estimates and not cfg.total_supervised_tokens: total_supervised_tokens = ( train_dataset.data.column("labels") .to_pandas() .apply(lambda x: np.sum(np.array(x) != -100)) .sum() ) LOG.debug( f"`total_supervised_tokens: {total_supervised_tokens:_}`", main_process_only=True, ) if update: cfg.total_supervised_tokens = total_supervised_tokens if not skip_estimates and cfg.sample_packing: # we have to drop anything longer then sequence len otherwise # flash attention with position ids fails if cfg.sample_packing_eff_est: total_num_steps = ( # match count to len est in dataloader ( math.floor( 0.99 * cfg.total_num_tokens / cfg.sample_packing_eff_est / cfg.sequence_len // cfg.batch_size // int(os.environ.get("WORLD_SIZE", 1)) ) - 1 ) * cfg.num_epochs ) LOG.debug( f"total_num_tokens: {cfg.total_num_tokens:_}, total_num_steps: {total_num_steps:_}", main_process_only=True, ) else: if cfg.flash_attention: batch_size = 1 batch_max_len = cfg.micro_batch_size * cfg.sequence_len else: batch_size = cfg.micro_batch_size batch_max_len = cfg.sequence_len sampler = MultipackBatchSampler( sampler=RandomSampler(train_dataset), batch_size=batch_size, drop_last=True, batch_max_len=batch_max_len, lengths=get_dataset_lengths(train_dataset), ) data_loader = DataLoader( train_dataset.remove_columns(["length"]), batch_sampler=sampler, ) data_loader_len = len(data_loader) // cfg.batch_size actual_eff = sampler.efficiency() LOG.debug(f"data_loader_len: {data_loader_len}", main_process_only=True) # FIXME: is there a bug here somewhere? the total num steps depends # on the agreed on value for sample_packing_eff_est total_num_steps = int( math.floor( data_loader_len * cfg.num_epochs / int(os.environ.get("WORLD_SIZE", 1)) ) ) def calc_sample_packing_eff_est(estimates: List[float]): LOG.info(f"sample_packing_eff_est across ranks: {repr(estimates)}") return max(estimates) sample_packing_actual_eff_all = reduce_and_broadcast( lambda: actual_eff, calc_sample_packing_eff_est, ) sample_packing_eff_est = ( math.ceil(sample_packing_actual_eff_all * 100.0) / 100.0 ) if update: cfg.sample_packing_eff_est = sample_packing_eff_est LOG.debug( f"sample_packing_eff_est: {cfg.sample_packing_eff_est}", main_process_only=True, ) else: total_num_steps = int( math.ceil( len(train_dataset) * cfg.num_epochs / int(os.environ.get("WORLD_SIZE", 1)) / cfg.batch_size ) ) LOG.debug(f"total_num_steps: {total_num_steps}", main_process_only=True) return total_num_steps
    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/models/input/v0_4_1/__init__.py
    def convert_learning_rate(cls, learning_rate): if learning_rate and isinstance(learning_rate, str): learning_rate = float(learning_rate) return learning_rate
    [openaccess-ai-collective/axolotl] examples/code-llama/README.md

    Overview

    This is an example of CodeLLaMA configuration for 7b, 13b and 34b.

    The 7b variant fits on any 24GB VRAM GPU and will take up about 17 GB of VRAM during training if using qlora and 20 GB if using lora. On a RTX 4090 it trains 3 epochs of the default dataset in about 15 minutes.

    The 13b variant will fit if you change these settings to these values: gradient_accumulation_steps: 2 micro_batch_size: 1

    The 34b variant does not fit on 24GB of VRAM - you will need something with +40 gb VRAM that also supports flash attention v2 - A6000 or A100 are good choices.

    accelerate launch scripts/finetune.py examples/code-llama/[MODEL_SIZE]/qlora.yml

    or

    accelerate launch scripts/finetune.py examples/code-llama/[MODEL_SIZE]/lora.yml
    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/__init__.py
    def normalize_config(cfg): # setup some derived config / hyperparams cfg.gradient_accumulation_steps = cfg.gradient_accumulation_steps or ( cfg.batch_size // cfg.micro_batch_size ) cfg.batch_size = ( cfg.batch_size or cfg.micro_batch_size * cfg.gradient_accumulation_steps ) if cfg.eval_batch_size is None: cfg.eval_batch_size = cfg.micro_batch_size cfg.world_size = int(os.environ.get("WORLD_SIZE", 1)) cfg.local_rank = int(os.environ.get("LOCAL_RANK", 0)) cfg.eval_table_size = cfg.eval_table_size or 0 cfg.eval_max_new_tokens = cfg.eval_max_new_tokens or 128 cfg.eval_causal_lm_metrics = cfg.eval_causal_lm_metrics or [ "sacrebleu", "comet", "ter", "chrf", ] choose_device(cfg) cfg.ddp = cfg.ddp if cfg.ddp is not None else cfg.world_size != 1 if cfg.ddp: cfg.device_map = {"": int(os.environ.get("LOCAL_RANK", 0))} cfg.batch_size = cfg.batch_size * cfg.world_size if cfg.bf16 == "auto": if is_torch_bf16_gpu_available(): LOG.debug("bf16 support detected, enabling for this configuration.") cfg.bf16 = True else: LOG.debug("bf16 support not detected, disabling for this configuration.") cfg.bf16 = False if cfg.fp16 is None: cfg.fp16 = True if cfg.device == "mps": cfg.load_in_8bit = False cfg.tf32 = False if cfg.bf16: cfg.fp16 = True cfg.bf16 = False else: torch.backends.cuda.matmul.allow_tf32 = cfg.tf32 or False if cfg.bf16: cfg.fp16 = False if cfg.bf16 or cfg.bfloat16: cfg.torch_dtype = torch.bfloat16 elif cfg.load_in_8bit or cfg.fp16 or cfg.float16: cfg.torch_dtype = torch.float16 else: cfg.torch_dtype = torch.float32 if cfg.saves_per_epoch: save_steps = 1.0 / (cfg.saves_per_epoch * cfg.num_epochs) if save_steps < 1.0: # prevent saves on every step cfg.save_steps = save_steps if (cfg.val_set_size or cfg.test_datasets) and cfg.evals_per_epoch: eval_steps = 1.0 / (cfg.evals_per_epoch * cfg.num_epochs) if eval_steps < 1.0: # prevent evals on every step cfg.eval_steps = eval_steps cfg.dataset_processes = cfg.dataset_processes or os.cpu_count() if not cfg.base_model_config: cfg.base_model_config = cfg.base_model model_config = load_model_config(cfg) cfg.model_config_type = model_config.model_type cfg.tokenizer_config = ( cfg.tokenizer_config or cfg.base_model_config or cfg.base_model ) # figure out if the model is llama cfg.is_llama_derived_model = ( (hasattr(model_config, "model_type") and model_config.model_type == "llama") or cfg.is_llama_derived_model or "llama" in cfg.base_model.lower() or (cfg.type_of_model and "llama" in cfg.type_of_model.lower()) ) # figure out if the model is falcon cfg.is_falcon_derived_model = ( ( hasattr(model_config, "model_type") and model_config.model_type in [ "falcon", "RefinedWebModel", "RefinedWeb", ] ) or cfg.is_falcon_derived_model or "falcon" in cfg.base_model.lower() or (cfg.type_of_model and "rwforcausallm" in cfg.type_of_model.lower()) ) cfg.is_mistral_derived_model = ( ( hasattr(model_config, "model_type") and model_config.model_type in [ "mistral", ] ) or cfg.is_mistral_derived_model or "mistral" in cfg.base_model.lower().split("/")[-1] or (cfg.type_of_model and "mistral" in cfg.type_of_model.lower()) ) cfg.is_qwen_derived_model = ( hasattr(model_config, "model_type") and model_config.model_type in [ "qwen", ] ) or cfg.is_qwen_derived_model if isinstance(cfg.pretraining_dataset, dict): cfg.pretraining_dataset = [cfg.pretraining_dataset] if ( cfg.gradient_checkpointing and cfg.unfrozen_parameters is None and cfg.gradient_checkpointing_kwargs is None and cfg.rl is None ): cfg.gradient_checkpointing_kwargs = {"use_reentrant": True} log_gpu_memory_usage(LOG, "baseline", cfg.device)
    [openaccess-ai-collective/axolotl] examples/llama-2/fft_optimized.yml
    base_model: NousResearch/Llama-2-7b-hf model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true flash_attn_cross_entropy: false flash_attn_rms_norm: true flash_attn_fuse_qkv: false flash_attn_fuse_mlp: true warmup_steps: 100 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 1 debug: deepspeed: #deepspeed_configs/zero2.json # multi-gpu only weight_decay: 0.1 fsdp: fsdp_config: special_tokens:
OpenAccess-AI-Collective/axolotl
huggingface/transformers
huggingface/peft
huggingface/accelerate