
    Here is my config I have tried changing various settings that prevent OOM with no luck I should have more than enough vram as I have 24gb

    To address the issue of running into Out-of-Memory (OOM) errors despite having a GPU with 24GB of VRAM, it's important to consider several factors and settings that can influence memory usage during training. Here are specific steps and methods you can apply to optimize memory usage and potentially prevent OOM errors:

    1. Gradient Accumulation: Increasing the number of gradient accumulation steps can effectively reduce memory usage by allowing you to use smaller micro-batch sizes while maintaining the same effective batch size. This is particularly useful when training with large models or on GPUs with limited memory.

      gradient_accumulation_steps: 4
    2. Mixed Precision Training: Utilizing mixed precision training can significantly reduce memory usage and also speed up training. If you're not already using it, consider enabling FP16 or BF16 mixed precision.

      mixed_precision: 'fp16'
    3. Model Parallelism: For extremely large models, model parallelism can be used to split the model across multiple GPUs. This requires specific model architecture support.

    4. Reduce Batch Size: Reducing the batch size is a straightforward way to decrease memory usage. Although this might affect the learning dynamics, it's often a necessary compromise.

    5. Use Efficient Data Types: Ensure that your data (e.g., input tensors) are using efficient data types. For instance, using float32 instead of float64 can halve the memory usage for those tensors.

    6. Deepspeed and ZeRO Optimization: If you're using DeepSpeed, leveraging ZeRO optimization stages can drastically reduce memory usage. ZeRO-2 and ZeRO-3 are particularly effective at reducing memory footprint.

      "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true } }
    7. CPU and Disk Offloading: For extremely large models, offloading parts of the model or optimizer states to the CPU or even disk can prevent OOM. This can introduce some performance overhead but allows for training larger models.

      offload_optimizer_device: cpu offload_param_device: cpu
    8. Use Efficient Models and Operators: Some model architectures and operations are more memory-efficient than others. Consider using models designed for efficiency, such as those with attention mechanisms that reduce memory usage.

    9. Memory Profiling: Use tools like PyTorch's torch.cuda.memory_summary() to profile memory usage and identify where the most memory is being consumed. This can help pinpoint areas for optimization.

    10. Dynamic Padding and Uniform Length Batching: Implement dynamic padding so that each batch is padded independently to the maximum length in that batch, rather than a fixed maximum length. This can save memory when dealing with variable-length sequences.

    Remember, the effectiveness of these strategies can vary based on your specific model, dataset, and training regime. It may require some experimentation to find the right combination of techniques that works for your setup.

    [openaccess-ai-collective/axolotl] examples/code-llama/


    This is an example of CodeLLaMA configuration for 7b, 13b and 34b.

    The 7b variant fits on any 24GB VRAM GPU and will take up about 17 GB of VRAM during training if using qlora and 20 GB if using lora. On a RTX 4090 it trains 3 epochs of the default dataset in about 15 minutes.

    The 13b variant will fit if you change these settings to these values: gradient_accumulation_steps: 2 micro_batch_size: 1

    The 34b variant does not fit on 24GB of VRAM - you will need something with +40 gb VRAM that also supports flash attention v2 - A6000 or A100 are good choices.

    accelerate launch scripts/ examples/code-llama/[MODEL_SIZE]/qlora.yml


    accelerate launch scripts/ examples/code-llama/[MODEL_SIZE]/lora.yml
    [huggingface/accelerate] docs/source/basic_tutorials/

    CUDA Out-of-Memory

    One of the most frustrating errors when it comes to running training scripts is hitting "CUDA Out-of-Memory". The entire script needs to be restarted and any progress is lost.

    To address this problem, Accelerate provides the [find_executable_batch_size] utility that is heavily based on toma. This utility retries code that fails due to OOM (out-of-memory) conditions and automatically lowers batch sizes. For each OOM condition, the algorithm decreases the batch size by half and retries the code until it succeeds.

    To use [find_executable_batch_size], restructure your training function to include an inner function with find_executable_batch_size and build your dataloaders inside it. At a minimum, this only takes 4 new lines of code.

    <Tip warning={true}>

    The inner function must take batch size as the first parameter, but we do not pass one to it when called. The wrapper will handles this for you. Any object (models, optimizers) that consumes CUDA memory and is passed to the [Accelerator] also must be declared inside the inner function.

    def training_function(args): accelerator = Accelerator() + @find_executable_batch_size(starting_batch_size=args.batch_size) + def inner_training_loop(batch_size): + nonlocal accelerator # Ensure they can be used in our context + accelerator.free_memory() # Free all lingering references model = get_model() optimizer = get_optimizer() train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size) lr_scheduler = get_scheduler( optimizer, num_training_steps=len(train_dataloader)*num_epochs ) model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler ) train(model, optimizer, train_dataloader, lr_scheduler) validate(model, eval_dataloader) + inner_training_loop()
    [openaccess-ai-collective/axolotl] examples/llama-2/


    This is an example of a llama-2 configuration for 7b and 13b. The yaml file contains configuration for the 7b variant, but you can just aswell use the same settings for 13b.

    The 7b variant fits on any 24GB VRAM GPU and will take up about 17 GB of VRAM during training if using qlora and 20 GB if using lora. On a RTX 4090 it trains 3 epochs of the default dataset in about 15 minutes.

    The 13b variant will fit if you change these settings to these values: gradient_accumulation_steps: 2 micro_batch_size: 1

    accelerate launch -m axolotl.cli.train examples/llama-2/qlora.yml


    accelerate launch -m axolotl.cli.train examples/llama-2/lora.yml

    To launch a full finetuning with 16-bit precision:

    accelerate launch -m axolotl.cli.train examples/llama-2/fft_optimized.yml
    [huggingface/transformers] docs/source/en/

    NVMe configuration

    ZeRO-Infinity allows offloading model states to the CPU and/or NVMe to save even more memory. Smart partitioning and tiling algorithms allow each GPU to send and receive very small amounts of data during offloading such that a modern NVMe can fit an even larger total memory pool than is available to your training process. ZeRO-Infinity requires ZeRO-3.

    Depending on the CPU and/or NVMe memory available, you can offload both the optimizer states and parameters, just one of them, or none. You should also make sure the nvme_path is pointing to an NVMe device, because while it still works with a normal hard drive or solid state drive, it'll be significantly slower. With a modern NVMe, you can expect peak transfer speeds of ~3.5GB/s for read and ~3GB/s for write operations. Lastly, run a benchmark on your training setup to determine the optimal aio configuration.

    The example ZeRO-3/Infinity configuration file below sets most of the parameter values to auto, but you could also manually add these values.

    { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 4, "fast_init": false }, "offload_param": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 5, "buffer_size": 1e8, "max_in_cpu": 1e9 }, "aio": { "block_size": 262144, "queue_depth": 32, "thread_count": 1, "single_submit": false, "overlap_events": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
    [huggingface/peft] examples/fp4_finetuning/
    free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3) max_memory = f"{free_in_GB-2}GB" n_gpus = torch.cuda.device_count() max_memory = {i: max_memory for i in range(n_gpus)}
    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/models/input/v0_4_1/
    def check_mem_mismatch(cls, data): if ( data.get("max_memory") is not None and data.get("gpu_memory_limit") is not None ): raise ValueError( "max_memory and gpu_memory_limit are mutually exclusive and cannot be used together." ) return data
    [huggingface/transformers] tests/models/owlv2/
    [openaccess-ai-collective/axolotl] examples/openllama-3b/config.yml
    base_model: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca dataset_prepared_path: val_set_size: 0.02 adapter: lora_model_dir: sequence_len: 1024 sample_packing: true lora_r: lora_alpha: lora_dropout: lora_target_modules: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./openllama-out gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.000003 train_on_inputs: false group_by_length: false float16: true bf16: false fp16: false tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
    [openaccess-ai-collective/axolotl] examples/falcon/config-7b-qlora.yml
    # 1b: tiiuae/falcon-rw-1b # 40b: tiiuae/falcon-40b base_model: tiiuae/falcon-7b # required by falcon custom model code: trust_remote_code: true model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false # enable 4bit for QLoRA load_in_4bit: true gptq: false strict: false push_dataset_to_hub: datasets: - path: QingyiSi/Alpaca-CoT data_files: - Chain-of-Thought/formatted_cot_data/gsm8k_train.json type: "alpaca:chat" dataset_prepared_path: val_set_size: 0.05 # enable QLoRA adapter: qlora lora_model_dir: sequence_len: 2048 max_packed_sequence_len: # hyperparameters from QLoRA paper Appendix B.2 # "We find hyperparameters to be largely robust across datasets" lora_r: 64 lora_alpha: 16 # 0.1 for models up to 13B # 0.05 for 33B and 65B models lora_dropout: 0.05 # add LoRA modules on all linear layers of the base model lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./qlora-out # QLoRA paper Table 9 # - 16 for 7b & 13b # - 32 for 33b, 64 for 64b # Max size tested on A6000 # - 7b: 40 # - 40b: 4 # decrease if OOM, increase for max VRAM utilization micro_batch_size: 1 gradient_accumulation_steps: 2 num_epochs: 4 # Optimizer for QLoRA optimizer: paged_adamw_32bit torchdistx_path: lr_scheduler: cosine # QLoRA paper Table 9 # - 2e-4 for 7b & 13b # - 1e-4 for 33b & 64b learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true # stop training after this many evaluation losses have increased in a row # early_stopping_patience: 3 resume_from_checkpoint: auto_resume_from_checkpoints: true local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.000001 fsdp: fsdp_config: special_tokens: pad_token: "<|endoftext|>" bos_token: "<|endoftext|>" eos_token: "<|endoftext|>"
    [openaccess-ai-collective/axolotl] src/axolotl/utils/gradient_checkpointing/
    class Unsloth_Offloaded_Gradient_Checkpointer( # pylint: disable=invalid-name torch.autograd.Function ): """ Saves VRAM by smartly offloading to RAM. Tiny hit to performance, since we mask the movement via non blocking calls. """ @staticmethod @torch.cuda.amp.custom_fwd def forward(ctx, forward_function, hidden_states, *args): saved_hidden_states ="cpu", non_blocking=True) with torch.no_grad(): output = forward_function(hidden_states, *args) ctx.save_for_backward(saved_hidden_states) ctx.forward_function = forward_function ctx.args = args return output @staticmethod @torch.cuda.amp.custom_bwd def backward(ctx, dY): (hidden_states,) = ctx.saved_tensors hidden_states ="cuda", non_blocking=True).detach() hidden_states.requires_grad = True with torch.enable_grad(): (output,) = ctx.forward_function(hidden_states, *ctx.args) torch.autograd.backward(output, dY) return ( None, hidden_states.grad, ) + ( None, ) * len(ctx.args)
    [huggingface/transformers] docs/source/en/

    Memory requirements

    Before you begin, it is a good idea to check whether you have enough GPU and CPU memory to fit your model. DeepSpeed provides a tool for estimating the required CPU/GPU memory. For example, to estimate the memory requirements for the bigscience/T0_3B model on a single GPU:

    $ python -c 'from transformers import AutoModel; \ from import estimate_zero3_model_states_mem_needs_all_live; \ model = AutoModel.from_pretrained("bigscience/T0_3B"); \ estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)' [...] Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 1 GPU per node. SW: Model with 2783M total params, 65M largest layer params. per CPU | per GPU | Options 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0 0.37GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=1 15.56GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=0

    This means you either need a single 80GB GPU without CPU offload or a 8GB GPU and a ~60GB CPU to offload to (these are just the memory requirements for the parameters, optimizer states and gradients, and you'll need a bit more for the CUDA kernels and activations). You should also consider the tradeoff between cost and speed because it'll be cheaper to rent or buy a smaller GPU but it'll take longer to train your model.

    If you have enough GPU memory make sure you disable CPU/NVMe offload to make everything faster.

    [huggingface/transformers] tests/models/olmo/
    def get_config(self): return OlmoConfig( vocab_size=self.vocab_size, hidden_size=self.hidden_size, num_hidden_layers=self.num_hidden_layers, num_attention_heads=self.num_attention_heads, intermediate_size=self.intermediate_size, hidden_act=self.hidden_act, hidden_dropout_prob=self.hidden_dropout_prob, attention_probs_dropout_prob=self.attention_probs_dropout_prob, max_position_embeddings=self.max_position_embeddings, type_vocab_size=self.type_vocab_size, is_decoder=False, initializer_range=self.initializer_range, pad_token_id=self.pad_token_id, )
    [openaccess-ai-collective/axolotl] examples/redpajama/config-3b.yml
    base_model: togethercomputer/RedPajama-INCITE-Chat-3B-v1 model_type: GPTNeoXForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: load_in_8bit: false datasets: - path: vicgalle/alpaca-gpt4 type: alpaca dataset_prepared_path: val_set_size: 0.02 adapter: lora_model_dir: sequence_len: 2048 max_packed_sequence_len: lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj lora_fan_in_fan_out: false wandb_project: redpajama-alpaca-3b wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./redpajama-alpaca-3b batch_size: 4 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.0000002 train_on_inputs: false group_by_length: false bf16: auto tf32: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 5 xformers_attention: flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0001 fsdp: fsdp_config: tokens: pad_token: "<|padding|>" bos_token: "<|endoftext|>" eos_token: "<|endoftext|>" unk_token: "<|endoftext|>"
    [huggingface/transformers] src/transformers/models/olmo/
    # coding=utf-8 # Copyright 2024 EleutherAI and the HuggingFace Inc. team. All rights reserved. # # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX # and OPT implementations in this library. It has been modified from its # original forms to accommodate minor architectural differences compared # to GPT-NeoX and OPT used by the Meta AI team that trained the model. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ OLMo model configuration"""
    [huggingface/peft] examples/conditional_generation/accelerate_ds_zero3_cpu_offload_config.yaml
    compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true use_cpu: false
    [huggingface/transformers] tests/models/owlv2/
    [openaccess-ai-collective/axolotl] examples/qwen/qwen2-moe-qlora.yaml
    base_model: Qwen/Qwen1.5-MoE-A2.7B trust_remote_code: true load_in_8bit: false load_in_4bit: true strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./out sequence_len: 1024 # supports up to 32k sample_packing: false pad_to_sequence_len: false adapter: qlora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 1 num_epochs: 4 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens:
    [huggingface/peft] examples/causal_language_modeling/accelerate_ds_zero3_cpu_offload_config.yaml
    compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true use_cpu: false
    [openaccess-ai-collective/axolotl] examples/xgen-7b/xgen-7b-8k-qlora.yml
    # An example finetuning Saleforce's XGen-7b model with 8k context using qlora # on Tim Dettmer's Guanaco dataset. base_model: Salesforce/xgen-7b-8k-base trust_remote_code: true model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false # enable 4bit for QLoRA load_in_4bit: true gptq: false strict: false push_dataset_to_hub: datasets: - path: timdettmers/openassistant-guanaco data_files: - openassistant_best_replies_train.jsonl type: "completion" dataset_prepared_path: val_set_size: 0.05 # enable QLoRA adapter: qlora lora_model_dir: sequence_len: 8192 max_packed_sequence_len: # hyperparameters from QLoRA paper Appendix B.2 # "We find hyperparameters to be largely robust across datasets" lora_r: 64 lora_alpha: 16 # 0.1 for models up to 13B # 0.05 for 33B and 65B models lora_dropout: 0.05 # add LoRA modules on all linear layers of the base model lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./qlora-out # QLoRA paper Table 9 # - 16 for 7b & 13b # - 32 for 33b, 64 for 64b # Max size tested on A6000 # - 7b: 40 # - 40b: 4 # decrease if OOM, increase for max VRAM utilization micro_batch_size: 1 gradient_accumulation_steps: 1 num_epochs: 4 # Optimizer for QLoRA optimizer: paged_adamw_32bit torchdistx_path: lr_scheduler: cosine # QLoRA paper Table 9 # - 2e-4 for 7b & 13b # - 1e-4 for 33b & 64b learning_rate: 0.00002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true # stop training after this many evaluation losses have increased in a row # early_stopping_patience: 3 resume_from_checkpoint: auto_resume_from_checkpoints: true local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 special_tokens: eos_token: "<|endoftext|>" bos_token: "<|endoftext|>" unk_token: "<|endoftext|>" pad_token: "<|endoftext|>"
    [openaccess-ai-collective/axolotl] examples/qwen/qwen2-moe-lora.yaml
    base_model: Qwen/Qwen1.5-MoE-A2.7B trust_remote_code: true load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./out sequence_len: 1024 # supports up to 32k sample_packing: false pad_to_sequence_len: false adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 1 num_epochs: 4 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens:
    [openaccess-ai-collective/axolotl] docs/nccl.qmd
    title: NCCL
    description: Troubleshooting NCCL issues
    NVIDIA NCCL is a library to facilitate and optimize multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce, etc. Broadly, NCCL configuration is highly environment-specific and is configured via several [environment variables]( A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
    Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.

    Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised. Nvidia recommends disabling PCI access control services (ACS) as a possible solution if this is available to you.

    Forcing cross-GPU communication via NVLink may help without increasing timeouts. To verify that your configuration is leveraging NVLink run the following command:

    nvidia-smi nvlink --status

    To force NCCL to use NVLink, simply set this in the environment:

    export NCCL_P2P_LEVEL=NVL

    If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL in the table below:

    | NCCL_P2P_LEVEL | Description | | -------------- | ----------- | | PIX | P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication. | | PXB | P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency. | | PHB | P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL) |

    To validate that acceptable data transfer speeds exist for your training job, running NCCL Tests can help pinpoint bottlenecks, for example:

    ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

    It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:


    Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout value in the Axolotl configuration. See PyTorch init_process_group for documentation on this value.



    See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:

    • model

      base_model: ./llama-7b-hf # local or huggingface repo

      Note: The code will load the right architecture.

    • dataset

      datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
    • loading

      load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP

      Note: Repo does not do 4-bit quantization.

    • lora

      adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj

    All Config Options

    See these docs for all config options.

    [huggingface/transformers] tests/models/owlv2/
    [huggingface/transformers] tests/models/qwen2_moe/
    def get_config(self): return Qwen2MoeConfig( vocab_size=self.vocab_size, hidden_size=self.hidden_size, num_hidden_layers=self.num_hidden_layers, max_window_layers=self.max_window_layers, use_sliding_window=self.use_sliding_window, sliding_window=self.sliding_window, num_attention_heads=self.num_attention_heads, num_key_value_heads=self.num_key_value_heads, intermediate_size=self.intermediate_size, hidden_act=self.hidden_act, hidden_dropout_prob=self.hidden_dropout_prob, attention_probs_dropout_prob=self.attention_probs_dropout_prob, max_position_embeddings=self.max_position_embeddings, expert_interval=self.expert_interval, moe_intermediate_size=self.moe_intermediate_size, shared_expert_intermediate_size=self.shared_expert_intermediate_size, shared_expert_gate=self.shared_expert_gate, num_experts_per_tok=self.num_experts_per_tok, num_experts=self.num_experts, norm_topk_prob=self.norm_topk_prob, output_router_logits=self.output_router_logits, router_aux_loss_coef=self.router_aux_loss_coef, type_vocab_size=self.type_vocab_size, is_decoder=False, initializer_range=self.initializer_range, pad_token_id=self.pad_token_id, bos_token_id=self.bos_token_id, )
    [openaccess-ai-collective/axolotl] examples/llama-2/fft_optimized.yml
    base_model: NousResearch/Llama-2-7b-hf model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true flash_attn_cross_entropy: false flash_attn_rms_norm: true flash_attn_fuse_qkv: false flash_attn_fuse_mlp: true warmup_steps: 100 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 1 debug: deepspeed: #deepspeed_configs/zero2.json # multi-gpu only weight_decay: 0.1 fsdp: fsdp_config: special_tokens:
    [openaccess-ai-collective/axolotl] examples/llama-2/relora.yml
    base_model: NousResearch/Llama-2-7b-hf model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: true strict: false datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./relora-out adapter: qlora lora_model_dir: sequence_len: 4096 sample_packing: true pad_to_sequence_len: true lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: relora_steps: 150 relora_warmup_steps: 10 relora_cpu_offload: false wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 4 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
    [huggingface/transformers] tests/models/owlvit/
    def get_config(self): return OwlViTConfig.from_text_vision_configs(self.text_config, self.vision_config, projection_dim=64)
    [huggingface/transformers] tests/models/owlv2/
    [huggingface/accelerate] tests/
    def test_get_balanced_memory(self): model = ModelForTest() # model has size 236: linear1 64, batchnorm 72, linear2 100 max_memory = get_balanced_memory(model, max_memory={0: 200, 1: 200}) assert {0: 200, 1: 200} == max_memory # We should be able to set models on a non-contiguous sub-set of max_memory = get_balanced_memory(model, max_memory={0: 200, 2: 200}) assert {0: 200, 2: 200} == max_memory max_memory = get_balanced_memory(model, max_memory={0: 300, 1: 300}) assert {0: 215, 1: 300} == max_memory # Last device always get max memory to give more buffer and avoid accidental CPU offload max_memory = get_balanced_memory(model, max_memory={0: 300, 1: 500}) assert {0: 215, 1: 500} == max_memory # Last device always get max memory to give more buffer, even if CPU is provided max_memory = get_balanced_memory(model, max_memory={0: 300, "cpu": 1000}) assert {0: 300, "cpu": 1000} == max_memory # If we set a device to 0, it's not counted. max_memory = get_balanced_memory(model, max_memory={0: 0, 1: 300, 2: 300}) assert {0: 0, 1: 215, 2: 300} == max_memory # If we set a device to 0, it's not counted. max_memory = get_balanced_memory(model, max_memory={0: 0, "cpu": 100}) assert {0: 0, "cpu": 100} == max_memory
    [huggingface/transformers] tests/models/owlv2/
    def get_config(self): return Owlv2VisionConfig( image_size=self.image_size, patch_size=self.patch_size, num_channels=self.num_channels, hidden_size=self.hidden_size, num_hidden_layers=self.num_hidden_layers, num_attention_heads=self.num_attention_heads, intermediate_size=self.intermediate_size, dropout=self.dropout, attention_dropout=self.attention_dropout, initializer_range=self.initializer_range, )
    [huggingface/accelerate] tests/test_configs/0_28_0_mpi.yaml
    compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_CPU downcast_bf16: 'no' ipex_config: ipex: true machine_rank: 0 main_process_ip: main_process_port: 29500 main_training_function: main mixed_precision: 'no' mpirun_config: mpirun_ccl: '1' mpirun_hostfile: /home/user/hostfile num_machines: 4 num_processes: 16 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: true
    [huggingface/accelerate] benchmarks/


    On a setup with two Titan RTXs (24GB of RAM) and 32GB of RAM, we get the following benchmarks (T0pp does not run in float16, which is why it's not included).

    | Model | Model load time | Generation time | dtype | GPU 0 use | GPU 1 use | CPU use | Disk offload | |:-----:|:---------------:|:---------------:|:-----:|:---------:|:---------:|:-------:|:------------:| | GPT-J-6B | 8.7s | 0.05s per token | float16 | 11.7GB | 0GB | 0GB | no | | GPT-J-6B | 12.4s | 0.06s per token | float32 | 21.9GB | 1.5GB | 0GB | no | | GPT-Neo-X-20B | 30.9s | 0.08s per token | float16 | 21.5GB | 18GB | 0GB | no | | GPT-Neo-X-20B | 78.2s | 10.72s per token | float32 | 20.3GB | 22.7 GB | 24.4GB | yes | | T0pp (11B) | 29.4s | 0.05s per token | float32 | 21.1GB | 21.3GB | 0GB | no | | OPT-30B | 34.5s | 2.37s per token | float16 | 20.7GB | 22.3GB | 14.1GB | no | | OPT-30B | 112.3s | 33.9s per token | float32 | 20.2GB | 21.2GB | 23.5GB | yes |

    Note on the results:

    • using two GPUs instead of one does not slow down generation
    • using CPU offload slows down a bit (see OPT-30b)
    • using disk offload slows down a lot (need to implement prefetching)

    You will also note that Accelerate does not use anymore GPU and CPU RAM than necessary:

    • peak GPU memory is exactly the size of the model put on a given GPU
    • peak CPU memory is either the size of the biggest checkpoint shard or the part of the model offloaded on CPU, whichever is bigger.
    [huggingface/peft] examples/sft/configs/fsdp_config_qlora.yaml
    compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: false fsdp_offload_params: true fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
    [huggingface/transformers] tests/models/umt5/
    def get_config(self): return UMT5Config( vocab_size=self.vocab_size, d_model=self.hidden_size, d_ff=self.d_ff, d_kv=self.hidden_size // self.num_attention_heads, num_layers=self.num_hidden_layers, num_decoder_layers=self.decoder_layers, num_heads=self.num_attention_heads, relative_attention_num_buckets=self.relative_attention_num_buckets, dropout_rate=self.dropout_rate, initializer_factor=self.initializer_factor, eos_token_id=self.eos_token_id, bos_token_id=self.pad_token_id, pad_token_id=self.pad_token_id, decoder_start_token_id=self.decoder_start_token_id, )
    [huggingface/peft] examples/sft/configs/fsdp_config.yaml
    compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: false fsdp_offload_params: false fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
    [huggingface/transformers] docs/source/en/

    Select a ZeRO stage

    After you've installed DeepSpeed and have a better idea of your memory requirements, the next step is selecting a ZeRO stage to use. In order of fastest and most memory-efficient:

    | Fastest | Memory efficient | |------------------|------------------| | ZeRO-1 | ZeRO-3 + offload | | ZeRO-2 | ZeRO-3 | | ZeRO-2 + offload | ZeRO-2 + offload | | ZeRO-3 | ZeRO-2 | | ZeRO-3 + offload | ZeRO-1 |

    To find what works best for you, start with the fastest approach and if you run out of memory, try the next stage which is slower but more memory efficient. Feel free to work in whichever direction you prefer (starting with the most memory efficient or fastest) to discover the appropriate balance between speed and memory usage.

    A general process you can use is (start with batch size of 1):

    1. enable gradient checkpointing
    2. try ZeRO-2
    3. try ZeRO-2 and offload the optimizer
    4. try ZeRO-3
    5. try ZeRO-3 and offload parameters to the CPU
    6. try ZeRO-3 and offload parameters and the optimizer to the CPU
    7. try lowering various default values like a narrower search beam if you're using the [~GenerationMixin.generate] method
    8. try mixed half-precision (fp16 on older GPU architectures and bf16 on Ampere) over full-precision weights
    9. add more hardware if possible or enable Infinity to offload parameters and the optimizer to a NVMe
    10. once you're not running out of memory, measure effective throughput and then try to increase the batch size as large as you can to maximize GPU efficiency
    11. lastly, try to optimize your training setup by disabling some offload features or use a faster ZeRO stage and increasing/decreasing the batch size to find the best tradeoff between speed and memory usage
    [huggingface/peft] tests/
    CONFIG_TESTING_KWARGS = ( { "text_encoder": { "r": 8, "lora_alpha": 32, "target_modules": ["k_proj", "q_proj", "v_proj", "out_proj", "fc1", "fc2"], "lora_dropout": 0.0, "bias": "none", }, "unet": { "r": 8, "lora_alpha": 32, "target_modules": ["proj_in", "proj_out", "to_k", "to_q", "to_v", "to_out.0", "", ""], "lora_dropout": 0.0, "bias": "none", }, }, { "text_encoder": { "r": 8, "alpha": 32, "target_modules": ["k_proj", "q_proj", "v_proj", "out_proj", "fc1", "fc2"], "rank_dropout": 0.0, "module_dropout": 0.0, }, "unet": { "r": 8, "alpha": 32, "target_modules": ["proj_in", "proj_out", "to_k", "to_q", "to_v", "to_out.0", "", ""], "rank_dropout": 0.0, "module_dropout": 0.0, }, }, { "text_encoder": { "r": 8, "target_modules": ["k_proj", "q_proj", "v_proj", "out_proj", "fc1", "fc2"], "module_dropout": 0.0, }, "unet": { "r": 8, "target_modules": ["proj_in", "proj_out", "to_k", "to_q", "to_v", "to_out.0", "", ""], "module_dropout": 0.0, }, }, { "text_encoder": { "boft_block_num": 1, "boft_block_size": 0, "target_modules": ["k_proj", "q_proj", "v_proj", "out_proj", "fc1", "fc2"], "boft_dropout": 0.0, }, "unet": { "boft_block_num": 1, "boft_block_size": 0, "target_modules": ["proj_in", "proj_out", "to_k", "to_q", "to_v", "to_out.0", "", ""], "boft_dropout": 0.0, }, }, )
    [huggingface/accelerate] tests/test_configs/0_12_0.yaml
    compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: 'NO' downcast_bf16: 'no' fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 1 use_cpu: false
    [huggingface/transformers] src/transformers/models/olmo/
    class OlmoConfig(PretrainedConfig): r""" This is the configuration class to store the configuration of a [`OlmoModel`]. It is used to instantiate an OLMo model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the [allenai/OLMo-7B-hf]( Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from [`PretrainedConfig`] for more information. Args: vocab_size (`int`, *optional*, defaults to 50304): Vocabulary size of the OLMo model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`OlmoModel`] hidden_size (`int`, *optional*, defaults to 4096): Dimension of the hidden representations. intermediate_size (`int`, *optional*, defaults to 11008): Dimension of the MLP representations. num_hidden_layers (`int`, *optional*, defaults to 32): Number of hidden layers in the Transformer decoder. num_attention_heads (`int`, *optional*, defaults to 32): Number of attention heads for each attention layer in the Transformer decoder. num_key_value_heads (`int`, *optional*): This is the number of key_value heads that should be used to implement Grouped Query Attention. If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper]( If it is not specified, will default to `num_attention_heads`. hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): The non-linear activation function (function or string) in the decoder. max_position_embeddings (`int`, *optional*, defaults to 2048): The maximum sequence length that this model might ever be used with. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`. pad_token_id (`int`, *optional*, defaults to 1): Padding token id. bos_token_id (`int`, *optional*): Beginning of stream token id. eos_token_id (`int`, *optional*, defaults to 50279): End of stream token id. tie_word_embeddings (`bool`, *optional*, defaults to `False`): Whether to tie weight embeddings rope_theta (`float`, *optional*, defaults to 10000.0): The base period of the RoPE embeddings. rope_scaling (`Dict`, *optional*): Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update `max_position_embeddings` to the expected new maximum. See the following thread for more information on how these scaling strategies behave: This is an experimental feature, subject to breaking API changes in future versions. attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`): Whether to use a bias in the query, key, value and output projection layers during self-attention. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. clip_qkv (`float`, *optional*): If not `None`, elements of query, key and value attention states are clipped so that their absolute value does not exceed this value. ```python >>> from transformers import OlmoModel, OlmoConfig >>> # Initializing a OLMo 7B style configuration >>> configuration = OlmoConfig() >>> # Initializing a model from the OLMo 7B style configuration >>> model = OlmoModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```""" model_type = "olmo" keys_to_ignore_at_inference = ["past_key_values"] def __init__( self, vocab_size=50304, hidden_size=4096, intermediate_size=11008, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=None, hidden_act="silu", max_position_embeddings=2048, initializer_range=0.02, use_cache=True, pad_token_id=1, bos_token_id=None, eos_token_id=50279, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, attention_bias=False, attention_dropout=0.0, clip_qkv=None, **kwargs, ): self.vocab_size = vocab_size self.max_position_embeddings = max_position_embeddings self.hidden_size = hidden_size self.intermediate_size = intermediate_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads # for backward compatibility if num_key_value_heads is None: num_key_value_heads = num_attention_heads self.num_key_value_heads = num_key_value_heads self.hidden_act = hidden_act self.initializer_range = initializer_range self.use_cache = use_cache self.rope_theta = rope_theta self.rope_scaling = rope_scaling self._rope_scaling_validation() self.attention_bias = attention_bias self.attention_dropout = attention_dropout self.clip_qkv = clip_qkv super().__init__( pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, tie_word_embeddings=tie_word_embeddings, **kwargs, ) # Copied from transformers.models.llama.configuration_llama.LlamaConfig._rope_scaling_validation def _rope_scaling_validation(self): """ Validate the `rope_scaling` configuration. """ if self.rope_scaling is None: return if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2: raise ValueError( "`rope_scaling` must be a dictionary with two fields, `type` and `factor`, " f"got {self.rope_scaling}" ) rope_scaling_type = self.rope_scaling.get("type", None) rope_scaling_factor = self.rope_scaling.get("factor", None) if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]: raise ValueError( f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}" ) if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0: raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")
    [huggingface/peft] examples/sft/configs/deepspeed_config_z3_qlora.yaml
    compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_multinode_launcher: standard offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
    [huggingface/accelerate] tests/test_configs/0_11_0.yaml
    compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: 'NO' fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 1 use_cpu: false

    High performance on consumer hardware

    Consider the memory requirements for training the following models on the ought/raft/twitter_complaints dataset with an A100 80GB GPU with more than 64GB of CPU RAM.

    | Model | Full Finetuning | PEFT-LoRA PyTorch | PEFT-LoRA DeepSpeed with CPU Offloading | | --------- | ---- | ---- | ---- | | bigscience/T0_3B (3B params) | 47.14GB GPU / 2.96GB CPU | 14.4GB GPU / 2.96GB CPU | 9.8GB GPU / 17.8GB CPU | | bigscience/mt0-xxl (12B params) | OOM GPU | 56GB GPU / 3GB CPU | 22GB GPU / 52GB CPU | | bigscience/bloomz-7b1 (7B params) | OOM GPU | 32GB GPU / 3.8GB CPU | 18.1GB GPU / 35GB CPU |

    With LoRA you can fully finetune a 12B parameter model that would've otherwise run out of memory on the 80GB GPU, and comfortably fit and train a 3B parameter model. When you look at the 3B parameter model's performance, it is comparable to a fully finetuned model at a fraction of the GPU memory.

    | Submission Name | Accuracy | | --------- | ---- | | Human baseline (crowdsourced) | 0.897 | | Flan-T5 | 0.892 | | lora-t0-3b | 0.863 |

    [!TIP] The bigscience/T0_3B model performance isn't optimized in the table above. You can squeeze even more performance out of it by playing around with the input instruction templates, LoRA hyperparameters, and other training related hyperparameters. The final checkpoint size of this model is just 19MB compared to 11GB of the full bigscience/T0_3B model. Learn more about the advantages of finetuning with PEFT in this blog post.

    [openaccess-ai-collective/axolotl] docs/config.qmd
    # currently only supported on Llama and Mistral
    # Whether to bettertransformers
    # Whether to use xformers attention patch
    # Whether to use flash attention patch
    flash_attn_cross_entropy:  # Whether to use flash-attention cross entropy implementation - advanced use only
    flash_attn_rms_norm:  # Whether to use flash-attention rms norm implementation - advanced use only
    flash_attn_fuse_qkv: # Whether to fuse QKV into a single operation
    flash_attn_fuse_mlp: # Whether to fuse part of the MLP into a single operation
    # Whether to use scaled-dot-product attention
    # Shifted-sparse attention (only llama) -
    # Resume from a specific checkpoint dir
    # If resume_from_checkpoint isn't set and you simply want it to start where it left off.
    # Be careful with this being turned on between different models.
    auto_resume_from_checkpoints: false
    # Don't mess with this, it's here for accelerate and torchrun
    # Add or change special tokens.
    # If you add tokens here, you don't need to add them to the `tokens` list.
      # bos_token: "<s>"
      # eos_token: "</s>"
      # unk_token: "<unk>"
      # pad_token: "[PAD]"
    # Add extra tokens.
    # FSDP
    # Deepspeed config path. e.g., deepspeed_configs/zero3.json
    # Advanced DDP Arguments
    # Path to torch distx for optim 'adamw_anyprecision'
    # Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
    # Debug mode
    # Seed
    # Allow overwrite yml config using from cli
    [huggingface/accelerate] tests/deepspeed/ds_config_zero3.json
    { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "weight_decay": "auto", "torch_adam": true, "adam_w_mode": true } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": "auto" }, "gradient_accumulation_steps": 1, "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
    [huggingface/accelerate] examples/by_feature/
    [huggingface/peft] examples/sft/configs/deepspeed_config.yaml
    compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_multinode_launcher: standard gradient_accumulation_steps: 4 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
    [huggingface/accelerate] tests/fsdp/
    def test_peak_memory_usage(self): self.test_file_path = self.test_scripts_folder / "" cmd = get_launch_command(num_processes=2, num_machines=1, machine_rank=0) for spec, peak_mem_upper_bound in self.peak_memory_usage_upper_bound.items(): cmd_config = cmd.copy() if "fp16" in spec: cmd_config.extend(["--mixed_precision=fp16"]) else: cmd_config.extend(["--mixed_precision=no"]) if "multi_gpu" in spec: continue else: cmd_config.extend(["--use_fsdp"]) for i, strategy in enumerate(FSDP_SHARDING_STRATEGY): if strategy.lower() in spec: cmd_config.append(f"--fsdp_sharding_strategy={strategy}") break if "cpu_offload" in spec: cmd_config.append("--fsdp_offload_params=True") for policy in FSDP_AUTO_WRAP_POLICY: if policy.lower() in spec: cmd_config.append(f"--fsdp_auto_wrap_policy={policy}") break if policy == "TRANSFORMER_BASED_WRAP": cmd_config.append("--fsdp_transformer_layer_cls_to_wrap=BertLayer") elif policy == "SIZE_BASED_WRAP": cmd_config.append("--fsdp_min_num_params=2000") cmd_config.extend( [ self.test_file_path, f"--output_dir={self.tmpdir}", f"--peak_memory_upper_bound={peak_mem_upper_bound}", f"--n_train={self.n_train}", f"--n_val={self.n_val}", ] ) with patch_environment(omp_num_threads=1): execute_subprocess_async(cmd_config)
    [huggingface/accelerate] tests/deepspeed/ds_config_zero2.json
    { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "weight_decay": "auto", "torch_adam": true, "adam_w_mode": true } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": "auto", "contiguous_gradients": true }, "gradient_accumulation_steps": 1, "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }