To use Axolotl with multiple GPUs, specifically with 4 GPUs, you can follow the instructions provided in the Axolotl documentation for multi-node and multi-GPU training. The key is to configure the accelerate
library correctly for your environment and ensure that your model and training script are compatible with distributed training. Here's a step-by-step guide based on the information from the Axolotl documentation and the accelerate
library:
Install Axolotl and Dependencies: Ensure you have Axolotl and all necessary dependencies installed. This includes accelerate
, torch
, and any other libraries your training script requires.
Configure Accelerate: Use the accelerate config
command to create a configuration for your multi-GPU setup. This command will prompt you to answer several questions regarding your environment.
accelerate config
When prompted, specify that you are using a multi-GPU setup. If you're running on a single machine with 4 GPUs, you can specify 1
for the number of machines and 4
for the number of GPUs per machine.
Prepare Your Training Script: Ensure your training script is set up for distributed training. Axolotl and accelerate
handle a lot of the boilerplate code required for PyTorch distributed training, but you should ensure your model, data loading, and training loop are compatible with distributed training.
Launch Training: Use the accelerate launch
command to start your training script with the configuration you set up. Specify the number of processes equal to the number of GPUs you wish to use. For a 4 GPU setup, you would use:
accelerate launch --num_processes 4 your_training_script.py
Replace your_training_script.py
with the path to your actual training script.
(Optional) Advanced Configuration: For more advanced setups, such as using specific GPUs, you can manually edit the configuration file created by accelerate config
or set environment variables like CUDA_VISIBLE_DEVICES
to control which GPUs are used.
Monitoring and Debugging: Keep an eye on the training process. You can use tools like nvidia-smi
to monitor GPU utilization and ensure that the workload is evenly distributed across all GPUs.
Here's a simplified example of how your training script might look:
import torch from accelerate import Accelerator def training_loop(): # Your model, optimizer, and dataloaders setup here model = ... optimizer = ... train_dataloader = ... accelerator = Accelerator() model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader) for epoch in range(num_epochs): for batch in train_dataloader: # Your training logic here optimizer.zero_grad() outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() if __name__ == "__main__": training_loop()
This script uses accelerate
to prepare your model, optimizer, and dataloaders for distributed training across multiple GPUs.
---
title: Multi Node
description: How to use Axolotl on multiple machines
---
You will need to create a configuration for accelerate, either by using `accelerate config` and follow the instructions or you can use one of the preset below:
~/.cache/huggingface/accelerate/default_config.yaml
```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
machine_rank: 0 # Set to 0 for the main machine, increment by one for other machines
main_process_ip: 10.0.0.4 # Set to main machine's IP
main_process_port: 5000
main_training_function: main
mixed_precision: bf16
num_machines: 2 # Change to the number of machines
num_processes: 4 # That's the total number of GPUs, (for example: if you have 2 machines with 4 GPU, put 8)
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Configure your model to use FSDP with for example:
fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
On each machine you need a copy of Axolotl, we suggest using the same commit to ensure compatibility.
You will also need to have the same configuration file for your model on each machine.
On the main machine only, make sure the port you set as main_process_port
is open in TCP and reachable by other machines.
All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.
Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.
Features:
| | fp16/fp32 | lora | qlora | gptq | gptq w/flash attn | flash attn | xformers attn | |-------------|:----------|:-----|-------|------|-------------------|------------|--------------| | llama | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mistral | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mixtral-MoE | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Mixtral8X22 | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Pythia | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | cerebras | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | btlm | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | mpt | ✅ | ❌ | ❓ | ❌ | ❌ | ❌ | ❓ | | falcon | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | gpt-j | ✅ | ✅ | ✅ | ❌ | ❌ | ❓ | ❓ | | XGen | ✅ | ❓ | ✅ | ❓ | ❓ | ❓ | ✅ | | phi | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | RWKV | ✅ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | | Qwen | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Gemma | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ❓ |
✅: supported ❌: not supported ❓: untested
Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.
Requirements: Python >=3.10 and Pytorch >=2.1.1.
git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]'
# preprocess datasets - optional but recommended CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml # finetune lora accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml # inference accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./outputs/lora-out" # gradio accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./outputs/lora-out" --gradio # remote yaml files - the yaml config can be hosted on a public URL # Note: the yaml config must directly link to the **raw** yaml accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml
docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest
Or run on the current files for development:
docker compose up -d
<details> <summary>Docker advanced</summary>[!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.
A more powerful Docker command to run would be this:
docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest
It additionally:
--ipc
and --ulimit
args.--mount
/-v
args.--name
argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...
) or in your terminal.--privileged
flag gives all capabilities to the container.--shm-size 10g
argument increases the shared memory size. Use this if you see exitcode: -7
errors using deepspeed.More information on nvidia website
</details>Install python >=3.10
Install pytorch stable https://pytorch.org/get-started/locally/
Install Axolotl along with python dependencies
pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
(Optional) Login to Huggingface to use gated models/datasets.
huggingface-cli login
Get the token at huggingface.co/settings/tokens
For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest
sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
Install Pytorch https://pytorch.org/get-started/locally/
Follow instructions on quickstart.
Run
pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
</details>export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.
Make sure to run the below to uninstall xla.
</details>pip uninstall -y torch_xla[tpu]
Please use WSL or Docker!
Use the below instead of the install method in QuickStart.
pip3 install -e '.'
More info: mac.md
Please use this example notebook.
To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:
pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check
Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1
:
git clone https://github.com/skypilot-org/skypilot.git
cd skypilot/llm/axolotl
Use one command to launch:
# On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
To launch on GPU instance (both on-demand and spot instances) on public clouds (GCP, AWS, Azure, Lambda Labs, TensorDock, Vast.ai, and CUDO), you can use dstack.
Write a job description in YAML as below:
# dstack.yaml type: task image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.2 env: - HUGGING_FACE_HUB_TOKEN - WANDB_API_KEY commands: - accelerate launch -m axolotl.cli.train config.yaml ports: - 6006 resources: gpu: memory: 24GB.. count: 2
then, simply run the job with dstack run
command. Append --spot
option if you want spot instance. dstack run
command will show you the instance with cheapest price across multi cloud services:
pip install dstack HUGGING_FACE_HUB_TOKEN=xxx WANDB_API_KEY=xxx dstack run . -f dstack.yaml # --spot
For further and fine-grained use cases, please refer to the official dstack documents and the detailed description of axolotl example on the official repository.
Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.
See these docs for more information on how to use different dataset formats.
See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:
model
base_model: ./llama-7b-hf # local or huggingface repo
Note: The code will load the right architecture.
dataset
datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
loading
load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP
Note: Repo does not do 4-bit quantization.
lora
adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj
See these docs for all config options.
Run
accelerate launch -m axolotl.cli.train your_config.yml
[!TIP] You can also reference a config file that is hosted on a public URL, for example
accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml
You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.
dataset_prepared_path:
to a local folder for saving and loading pre-tokenized dataset.push_dataset_to_hub: hf_user/repo
to push it to Huggingface.--debug
to see preprocessed examples.python -m axolotl.cli.preprocess your_config.yml
Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.
Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated
docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest
Or run on the current files for development:
docker compose up -d
<details> <summary>Docker advanced</summary>[!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.
A more powerful Docker command to run would be this:
docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest
It additionally:
--ipc
and --ulimit
args.--mount
/-v
args.--name
argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...
) or in your terminal.--privileged
flag gives all capabilities to the container.--shm-size 10g
argument increases the shared memory size. Use this if you see exitcode: -7
errors using deepspeed.More information on nvidia website
</details>Install python >=3.10
Install pytorch stable https://pytorch.org/get-started/locally/
Install Axolotl along with python dependencies
pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
(Optional) Login to Huggingface to use gated models/datasets.
huggingface-cli login
Get the token at huggingface.co/settings/tokens
For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest
sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
Install Pytorch https://pytorch.org/get-started/locally/
Follow instructions on quickstart.
Run
pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
</details>export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.
Make sure to run the below to uninstall xla.
</details>pip uninstall -y torch_xla[tpu]
Please use WSL or Docker!
Use the below instead of the install method in QuickStart.
pip3 install -e '.'
More info: mac.md
Please use this example notebook.
To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:
pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check
Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1
:
git clone https://github.com/skypilot-org/skypilot.git
cd skypilot/llm/axolotl
Use one command to launch:
# On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
To launch on GPU instance (both on-demand and spot instances) on public clouds (GCP, AWS, Azure, Lambda Labs, TensorDock, Vast.ai, and CUDO), you can use dstack.
Write a job description in YAML as below:
# dstack.yaml type: task image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.2 env: - HUGGING_FACE_HUB_TOKEN - WANDB_API_KEY commands: - accelerate launch -m axolotl.cli.train config.yaml ports: - 6006 resources: gpu: memory: 24GB.. count: 2
then, simply run the job with dstack run
command. Append --spot
option if you want spot instance. dstack run
command will show you the instance with cheapest price across multi cloud services:
pip install dstack HUGGING_FACE_HUB_TOKEN=xxx WANDB_API_KEY=xxx dstack run . -f dstack.yaml # --spot
For further and fine-grained use cases, please refer to the official dstack documents and the detailed description of axolotl example on the official repository.
Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.
See these docs for more information on how to use different dataset formats.
See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:
model
base_model: ./llama-7b-hf # local or huggingface repo
Note: The code will load the right architecture.
dataset
datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
loading
load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP
Note: Repo does not do 4-bit quantization.
lora
adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj
See these docs for all config options.
Run
accelerate launch -m axolotl.cli.train your_config.yml
[!TIP] You can also reference a config file that is hosted on a public URL, for example
accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml
You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.
dataset_prepared_path:
to a local folder for saving and loading pre-tokenized dataset.push_dataset_to_hub: hf_user/repo
to push it to Huggingface.--debug
to see preprocessed examples.python -m axolotl.cli.preprocess your_config.yml
Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.
Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated
We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.
deepspeed: deepspeed_configs/zero1.json
accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
Axolotl supports training with FSDP and QLoRA, see these docs for more information.
Make sure your WANDB_API_KEY
environment variable is set (recommended) or you login to wandb with wandb login
.
wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"
When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
Axolotl allows you to load your model in an interactive terminal playground for quick experimentation. The config file is the same config file used for training.
Pass the appropriate flag to the inference command, depending upon what kind of model was trained:
python -m axolotl.cli.inference examples/your_config.yml --lora_model_dir="./lora-output-dir"
python -m axolotl.cli.inference examples/your_config.yml --base_model="./completed-model"
cat /tmp/prompt.txt | python -m axolotl.cli.inference examples/your_config.yml \ --base_model="./completed-model" --prompter=None --load_in_8bit=True
-- With gradio hosting
python -m axolotl.cli.inference examples/your_config.yml --gradio
Please use --sample_packing False
if you have it on and receive the error similar to below:
RuntimeError: stack expects each tensor to be equal size, but got [1, 32, 1, 128] at entry 0 and [1, 32, 8, 128] at entry 1
The following command will merge your LORA adapater with your base model. You can optionally pass the argument --lora_model_dir
to specify the directory where your LORA adapter was saved, otherwhise, this will be inferred from output_dir
in your axolotl config file. The merged model is saved in the sub-directory {lora_model_dir}/merged
.
python3 -m axolotl.cli.merge_lora your_config.yml --lora_model_dir="./completed-model"
You may need to use the gpu_memory_limit
and/or lora_on_cpu
config options to avoid running out of memory. If you still run out of CUDA memory, you can try to merge in system RAM with
CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.merge_lora ...
although this will be very slow, and using the config options above are recommended instead.
"""module for gpu capabilities"""
| | fp16/fp32 | lora | qlora | gptq | gptq w/flash attn | flash attn | xformers attn | |-------------|:----------|:-----|-------|------|-------------------|------------|--------------| | llama | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mistral | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mixtral-MoE | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Mixtral8X22 | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Pythia | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | cerebras | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | btlm | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | mpt | ✅ | ❌ | ❓ | ❌ | ❌ | ❌ | ❓ | | falcon | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | gpt-j | ✅ | ✅ | ✅ | ❌ | ❌ | ❓ | ❓ | | XGen | ✅ | ❓ | ✅ | ❓ | ❓ | ❓ | ✅ | | phi | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | RWKV | ✅ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | | Qwen | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Gemma | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ❓ |
✅: supported ❌: not supported ❓: untested
#!/bin/bash #SBATCH --job-name=multigpu #SBATCH -D . #SBATCH --output=O-%x.%j #SBATCH --error=E-%x.%j #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 # number of MP tasks #SBATCH --gres=gpu:4 # number of GPUs per node #SBATCH --cpus-per-task=160 # number of cores per tasks #SBATCH --time=01:59:00 # maximum execution time (HH:MM:SS) ###################### ### Set enviroment ### ###################### source activateEnvironment.sh export GPUS_PER_NODE=4 ###################### export ACCELERATE_DIR="${ACCELERATE_DIR:-/accelerate}" export SCRIPT="${ACCELERATE_DIR}/examples/complete_nlp_example.py" export SCRIPT_ARGS=" \ --mixed_precision fp16 \ --output_dir ${ACCELERATE_DIR}/examples/output \ --with_tracking \ " accelerate launch --num_processes $GPUS_PER_NODE $SCRIPT $SCRIPT_ARGS
Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.
Requirements: Python >=3.10 and Pytorch >=2.1.1.
git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]'
# preprocess datasets - optional but recommended CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml # finetune lora accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml # inference accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./outputs/lora-out" # gradio accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./outputs/lora-out" --gradio # remote yaml files - the yaml config can be hosted on a public URL # Note: the yaml config must directly link to the **raw** yaml accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml
docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest
Or run on the current files for development:
docker compose up -d
<details> <summary>Docker advanced</summary>[!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.
A more powerful Docker command to run would be this:
docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest
It additionally:
--ipc
and --ulimit
args.--mount
/-v
args.--name
argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...
) or in your terminal.--privileged
flag gives all capabilities to the container.--shm-size 10g
argument increases the shared memory size. Use this if you see exitcode: -7
errors using deepspeed.More information on nvidia website
</details>Install python >=3.10
Install pytorch stable https://pytorch.org/get-started/locally/
Install Axolotl along with python dependencies
pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
(Optional) Login to Huggingface to use gated models/datasets.
huggingface-cli login
Get the token at huggingface.co/settings/tokens
For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest
sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
Install Pytorch https://pytorch.org/get-started/locally/
Follow instructions on quickstart.
Run
pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
</details>export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.
Make sure to run the below to uninstall xla.
</details>pip uninstall -y torch_xla[tpu]
Please use WSL or Docker!
Use the below instead of the install method in QuickStart.
pip3 install -e '.'
More info: mac.md
Please use this example notebook.
To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:
pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check
Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1
:
git clone https://github.com/skypilot-org/skypilot.git
cd skypilot/llm/axolotl
Use one command to launch:
# On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
To launch on GPU instance (both on-demand and spot instances) on public clouds (GCP, AWS, Azure, Lambda Labs, TensorDock, Vast.ai, and CUDO), you can use dstack.
Write a job description in YAML as below:
# dstack.yaml type: task image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.2 env: - HUGGING_FACE_HUB_TOKEN - WANDB_API_KEY commands: - accelerate launch -m axolotl.cli.train config.yaml ports: - 6006 resources: gpu: memory: 24GB.. count: 2
then, simply run the job with dstack run
command. Append --spot
option if you want spot instance. dstack run
command will show you the instance with cheapest price across multi cloud services:
pip install dstack HUGGING_FACE_HUB_TOKEN=xxx WANDB_API_KEY=xxx dstack run . -f dstack.yaml # --spot
For further and fine-grained use cases, please refer to the official dstack documents and the detailed description of axolotl example on the official repository.
class Bnb4bitTestMultiGpu(Base4bitTest): def setUp(self): super().setUp() def test_multi_gpu_loading(self): r""" This tests that the model has been loaded and can be used correctly on a multi-GPU setup. Let's just try to load a model on 2 GPUs and see if it works. The model we test has ~2GB of total, 3GB should suffice """ model_parallel = AutoModelForCausalLM.from_pretrained( self.model_name, load_in_4bit=True, device_map="balanced" ) # Check correct device map self.assertEqual(set(model_parallel.hf_device_map.values()), {0, 1}) # Check that inference pass works on the model encoded_input = self.tokenizer(self.input_text, return_tensors="pt") # Second real batch output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
torchrun --nproc_per_node 8 --nnodes 1 train.py \ --seed 100 \ --model_name_or_path "mistralai/Mistral-7B-v0.1" \ --dataset_name "smangrul/ultrachat-10k-chatml" \ --chat_template_format "chatml" \ --add_special_tokens False \ --append_concat_token False \ --splits "train,test" \ --max_seq_len 2048 \ --num_train_epochs 1 \ --logging_steps 5 \ --log_level "info" \ --logging_strategy "steps" \ --evaluation_strategy "epoch" \ --save_strategy "epoch" \ --push_to_hub \ --hub_private_repo True \ --hub_strategy "every_save" \ --bf16 True \ --packing True \ --learning_rate 1e-4 \ --lr_scheduler_type "cosine" \ --weight_decay 1e-4 \ --warmup_ratio 0.0 \ --max_grad_norm 1.0 \ --output_dir "mistral-sft-lora-multigpu" \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --gradient_accumulation_steps 8 \ --gradient_checkpointing True \ --use_reentrant False \ --dataset_text_field "content" \ --use_peft_lora True \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.1 \ --lora_target_modules "all-linear" \ --use_4bit_quantization True \ --use_nested_quant True \ --bnb_4bit_compute_dtype "bfloat16" \ --use_flash_attn True
def multi_gpu_launcher(args): import torch.distributed.run as distrib_run current_env = prepare_multi_gpu_env(args) if not check_cuda_p2p_ib_support(): message = "Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled." warn = False if "NCCL_P2P_DISABLE" not in current_env: current_env["NCCL_P2P_DISABLE"] = "1" warn = True if "NCCL_IB_DISABLE" not in current_env: current_env["NCCL_IB_DISABLE"] = "1" warn = True if warn: logger.warning(message) debug = getattr(args, "debug", False) args = _filter_args( args, distrib_run.get_args_parser(), ["--training_script", args.training_script, "--training_script_args", args.training_script_args], ) with patch_environment(**current_env): try: distrib_run.run(args) except Exception: if is_rich_available() and debug: console = get_console() console.print("\n[bold red]Using --debug, `torch.distributed` Stack Trace:[/bold red]") console.print_exception(suppress=[__file__], show_locals=False) else: raise
# Example notebook for running Axolotl on google colab
import torch
assert (torch.cuda.is_available()==True)
## Install Axolotl and dependencies
!pip install torch=="2.1.2" !pip install -e git+https://github.com/OpenAccess-AI-Collective/axolotl#egg=axolotl !pip install flash-attn=="2.5.0" !pip install deepspeed=="0.13.1"!pip install mlflow=="2.13.0"
## Create an yaml config file
import yaml
yaml_string = """ base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer
load_in_8bit: false load_in_4bit: true strict: false
datasets:
adapter: qlora lora_model_dir:
sequence_len: 4096 sample_packing: true eval_sample_packing: false pad_to_sequence_len: true
lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:
wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: paged_adamw_32bit lr_scheduler: cosine learning_rate: 0.0002
train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false
gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true
warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens:
"""
yaml_dict = yaml.safe_load(yaml_string)
file_path = 'test_axolotl.yaml'
with open(file_path, 'w') as file: yaml.dump(yaml_dict, file)
## Launch the training
!accelerate launch -m axolotl.cli.train /content/test_axolotl.yaml
## Play with inference
!accelerate launch -m axolotl.cli.inference /content/test_axolotl.yaml
--qlora_model_dir="./qlora-out" --gradio
The following command shows how to fine-tune wav2vec2-base for 🌎 Language Identification on the CommonLanguage dataset.
python run_audio_classification.py \ --model_name_or_path facebook/wav2vec2-base \ --dataset_name common_language \ --audio_column_name audio \ --label_column_name language \ --output_dir wav2vec2-base-lang-id \ --overwrite_output_dir \ --remove_unused_columns False \ --do_train \ --do_eval \ --fp16 \ --learning_rate 3e-4 \ --max_length_seconds 16 \ --attention_mask False \ --warmup_ratio 0.1 \ --num_train_epochs 10 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 4 \ --per_device_eval_batch_size 1 \ --dataloader_num_workers 8 \ --logging_strategy steps \ --logging_steps 10 \ --eval_strategy epoch \ --save_strategy epoch \ --load_best_model_at_end True \ --metric_for_best_model accuracy \ --save_total_limit 3 \ --seed 0 \ --push_to_hub
On 4 V100 GPUs (16GB), this script should run in ~1 hour and yield accuracy of 79.45%.
👀 See the results here: anton-l/wav2vec2-base-lang-id
def check_mem_mismatch(cls, data): if ( data.get("max_memory") is not None and data.get("gpu_memory_limit") is not None ): raise ValueError( "max_memory and gpu_memory_limit are mutually exclusive and cannot be used together." ) return data
Use the --nproc_per_node
to select how many GPUs to use.
</hfoption> <hfoption id="Accelerate">torchrun --nproc_per_node=2 trainer-program.py ...
Use --num_processes
to select how many GPUs to use.
</hfoption> <hfoption id="DeepSpeed">accelerate launch --num_processes 2 trainer-program.py ...
Use --num_gpus
to select how many GPUs to use.
</hfoption> </hfoptions>deepspeed --num_gpus 2 trainer-program.py ...
Now, to select which GPUs to use and their order, you'll use the CUDA_VISIBLE_DEVICES
environment variable. It is easiest to set the environment variable in a ~/bashrc
or another startup config file. CUDA_VISIBLE_DEVICES
is used to map which GPUs are used. For example, if you have 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2:
CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to cuda:0
and cuda:1
respectively. You can also reverse the order of the GPUs to use 2 first. Now, the mapping is cuda:1
for GPU 0 and cuda:0
for GPU 2.
CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
You can also set the CUDA_VISIBLE_DEVICES
environment variable to an empty value to create an environment without GPUs.
<Tip warning={true}>CUDA_VISIBLE_DEVICES= python trainer-program.py ...
As with any environment variable, they can be exported instead of being added to the command line. However, this is not recommended because it can be confusing if you forget how the environment variable was setup and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
</Tip>CUDA_DEVICE_ORDER
is an alternative environment variable you can use to control how the GPUs are ordered. You can either order them by:
nvidia-smi
and rocm-smi
for NVIDIA and AMD GPUs respectivelyexport CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_DEVICE_ORDER=FASTEST_FIRST
The CUDA_DEVICE_ORDER
is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. In this case, set CUDA_DEVICE_ORDER=FASTEST_FIRST
to always use the newer and faster GPU first (nvidia-smi
or rocm-smi
still reports the GPUs in their PCIe order). Or you could also set export CUDA_VISIBLE_DEVICES=1,0
.
The following command shows how to use Dataset Streaming mode to fine-tune XLS-R on Common Voice using 4 GPUs in half-precision.
Streaming mode imposes several constraints on training:
--tokenizer_name_or_path
.--num_train_epochs
has to be replaced by --max_steps
. Similarly, all other epoch-based arguments have to be
replaced by step-based ones.--shuffle_buffer_size
argument controls how many examples we can pre-download before shuffling them.**torchrun \ --nproc_per_node 4 run_speech_recognition_ctc_streaming.py \ --dataset_name="common_voice" \ --model_name_or_path="facebook/wav2vec2-xls-r-300m" \ --tokenizer_name_or_path="anton-l/wav2vec2-tokenizer-turkish" \ --dataset_config_name="tr" \ --train_split_name="train+validation" \ --eval_split_name="test" \ --output_dir="wav2vec2-xls-r-common_voice-tr-ft" \ --overwrite_output_dir \ --max_steps="5000" \ --per_device_train_batch_size="8" \ --gradient_accumulation_steps="2" \ --learning_rate="5e-4" \ --warmup_steps="500" \ --eval_strategy="steps" \ --text_column_name="sentence" \ --save_steps="500" \ --eval_steps="500" \ --logging_steps="1" \ --layerdrop="0.0" \ --eval_metrics wer cer \ --save_total_limit="1" \ --mask_time_prob="0.3" \ --mask_time_length="10" \ --mask_feature_prob="0.1" \ --mask_feature_length="64" \ --freeze_feature_encoder \ --chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \ --max_duration_in_seconds="20" \ --shuffle_buffer_size="500" \ --fp16 \ --push_to_hub \ --do_train --do_eval \ --gradient_checkpointing**
On 4 V100 GPUs, this script should run in ca. 3h 31min and yield a CTC loss of 0.35 and word error rate of 0.29.
def gpu_memory_usage(device=0): return torch.cuda.memory_allocated(device) / 1024.0**3
def test_multi_gpu_loading(self): r""" This tests that the model has been loaded and can be used correctly on a multi-GPU setup. Let's just try to load a model on 2 GPUs and see if it works. The model we test has ~2GB of total, 3GB should suffice """ model_parallel = AutoModelForCausalLM.from_pretrained( self.model_name, load_in_4bit=True, device_map="balanced" ) # Check correct device map self.assertEqual(set(model_parallel.hf_device_map.values()), {0, 1}) # Check that inference pass works on the model encoded_input = self.tokenizer(self.input_text, return_tensors="pt") # Second real batch output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
class MixedInt8TestMultiGpu(BaseMixedInt8Test): def setUp(self): super().setUp() def test_multi_gpu_loading(self): r""" This tests that the model has been loaded and can be used correctly on a multi-GPU setup. Let's just try to load a model on 2 GPUs and see if it works. The model we test has ~2GB of total, 3GB should suffice """ model_parallel = AutoModelForCausalLM.from_pretrained( self.model_name, load_in_8bit=True, device_map="balanced" ) # Check correct device map self.assertEqual(set(model_parallel.hf_device_map.values()), {0, 1}) # Check that inference pass works on the model encoded_input = self.tokenizer(self.input_text, return_tensors="pt") # Second real batch output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
When training on multiple GPUs, you can specify the number of GPUs to use and in what order. This can be useful for instance when you have GPUs with different computing power and want to use the faster GPU first. The selection process works for both DistributedDataParallel and DataParallel to use only a subset of the available GPUs, and you don't need Accelerate or the DeepSpeed integration.
For example, if you have 4 GPUs and you only want to use the first 2:
<hfoptions id="select-gpu"> <hfoption id="torchrun">Use the --nproc_per_node
to select how many GPUs to use.
</hfoption> <hfoption id="Accelerate">torchrun --nproc_per_node=2 trainer-program.py ...
Use --num_processes
to select how many GPUs to use.
</hfoption> <hfoption id="DeepSpeed">accelerate launch --num_processes 2 trainer-program.py ...
Use --num_gpus
to select how many GPUs to use.
</hfoption> </hfoptions>deepspeed --num_gpus 2 trainer-program.py ...
Now, to select which GPUs to use and their order, you'll use the CUDA_VISIBLE_DEVICES
environment variable. It is easiest to set the environment variable in a ~/bashrc
or another startup config file. CUDA_VISIBLE_DEVICES
is used to map which GPUs are used. For example, if you have 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2:
CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to cuda:0
and cuda:1
respectively. You can also reverse the order of the GPUs to use 2 first. Now, the mapping is cuda:1
for GPU 0 and cuda:0
for GPU 2.
CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
You can also set the CUDA_VISIBLE_DEVICES
environment variable to an empty value to create an environment without GPUs.
<Tip warning={true}>CUDA_VISIBLE_DEVICES= python trainer-program.py ...
As with any environment variable, they can be exported instead of being added to the command line. However, this is not recommended because it can be confusing if you forget how the environment variable was setup and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
</Tip>CUDA_DEVICE_ORDER
is an alternative environment variable you can use to control how the GPUs are ordered. You can either order them by:
nvidia-smi
and rocm-smi
for NVIDIA and AMD GPUs respectivelyexport CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_DEVICE_ORDER=FASTEST_FIRST
The CUDA_DEVICE_ORDER
is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. In this case, set CUDA_DEVICE_ORDER=FASTEST_FIRST
to always use the newer and faster GPU first (nvidia-smi
or rocm-smi
still reports the GPUs in their PCIe order). Or you could also set export CUDA_VISIBLE_DEVICES=1,0
.
Quando si addestrano modelli più grandi ci sono essenzialmente tre opzioni:
Iniziamo dal caso in cui ci sia una singola GPU.
Se hai acquistato una costosa GPU di fascia alta, assicurati di darle la potenza corretta e un raffreddamento sufficiente.
Potenza:
Alcune schede GPU consumer di fascia alta hanno 2 e talvolta 3 prese di alimentazione PCI-E a 8 pin. Assicurati di avere tanti cavi PCI-E a 8 pin indipendenti da 12 V collegati alla scheda quante sono le prese. Non utilizzare le 2 fessure a un'estremità dello stesso cavo (noto anche come cavo a spirale). Cioè se hai 2 prese sulla GPU, vuoi 2 cavi PCI-E a 8 pin che vanno dall'alimentatore alla scheda e non uno che abbia 2 connettori PCI-E a 8 pin alla fine! In caso contrario, non otterrai tutte le prestazioni ufficiali.
Ciascun cavo di alimentazione PCI-E a 8 pin deve essere collegato a una guida da 12 V sul lato dell'alimentatore e può fornire fino a 150 W di potenza.
Alcune altre schede possono utilizzare connettori PCI-E a 12 pin e questi possono fornire fino a 500-600 W di potenza.
Le schede di fascia bassa possono utilizzare connettori a 6 pin, che forniscono fino a 75 W di potenza.
Inoltre vuoi un alimentatore (PSU) di fascia alta che abbia una tensione stabile. Alcuni PSU di qualità inferiore potrebbero non fornire alla scheda la tensione stabile di cui ha bisogno per funzionare al massimo.
E ovviamente l'alimentatore deve avere abbastanza Watt inutilizzati per alimentare la scheda.
Raffreddamento:
Quando una GPU si surriscalda, inizierà a rallentare e non fornirà le prestazioni mssimali e potrebbe persino spegnersi se diventasse troppo calda.
È difficile dire l'esatta temperatura migliore a cui aspirare quando una GPU è molto caricata, ma probabilmente qualsiasi cosa al di sotto di +80°C va bene, ma più bassa è meglio - forse 70-75°C è un intervallo eccellente in cui trovarsi. È probabile che il rallentamento inizi a circa 84-90°C. Ma oltre alla limitazione delle prestazioni, una temperatura molto elevata prolungata è probabile che riduca la durata di una GPU.
Diamo quindi un'occhiata a uno degli aspetti più importanti quando si hanno più GPU: la connettività.
Se utilizzi più GPU, il modo in cui le schede sono interconnesse può avere un enorme impatto sul tempo totale di allenamento. Se le GPU si trovano sullo stesso nodo fisico, puoi eseguire:
nvidia-smi topo -m
e ti dirà come sono interconnesse le GPU. Su una macchina con doppia GPU e collegata a NVLink, molto probabilmente vedrai qualcosa del tipo:
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X NV2 0-23 N/A
GPU1 NV2 X 0-23 N/A
su una macchina diversa senza NVLink potremmo vedere:
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X PHB 0-11 N/A
GPU1 PHB X 0-11 N/A
Il rapporto include questa legenda:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Quindi il primo rapporto NV2
ci dice che le GPU sono interconnesse con 2 NVLinks e nel secondo report PHB
abbiamo una tipica configurazione PCIe+Bridge a livello di consumatore.
Controlla che tipo di connettività hai sulla tua configurazione. Alcuni di questi renderanno la comunicazione tra le carte più veloce (es. NVLink), altri più lenta (es. PHB).
A seconda del tipo di soluzione di scalabilità utilizzata, la velocità di connettività potrebbe avere un impatto maggiore o minore. Se le GPU devono sincronizzarsi raramente, come in DDP, l'impatto di una connessione più lenta sarà meno significativo. Se le GPU devono scambiarsi messaggi spesso, come in ZeRO-DP, una connettività più veloce diventa estremamente importante per ottenere un addestramento più veloce.
NVLink è un collegamento di comunicazione a corto raggio multilinea seriale basato su cavo sviluppato da Nvidia.
Ogni nuova generazione fornisce una larghezza di banda più veloce, ad es. ecco una citazione da Nvidia Ampere GA102 GPU Architecture:
Third-Generation NVLink® GA102 GPUs utilize NVIDIA’s third-generation NVLink interface, which includes four x4 links, with each link providing 14.0625 GB/sec bandwidth in each direction between two GPUs. Four links provide 56.25 GB/sec bandwidth in each direction, and 112.5 GB/sec total bandwidth between two GPUs. Two RTX 3090 GPUs can be connected together for SLI using NVLink. (Note that 3-Way and 4-Way SLI configurations are not supported.)
Quindi più X
si ottiene nel rapporto di NVX
nell'output di nvidia-smi topo -m
, meglio è. La generazione dipenderà dall'architettura della tua GPU.
Confrontiamo l'esecuzione di un training del modello di linguaggio openai-community/gpt2 su un piccolo campione di wikitext
I risultati sono:
| NVlink | Time | | ----- | ---: | | Y | 101s | | N | 131s |
Puoi vedere che NVLink completa l'addestramento circa il 23% più velocemente. Nel secondo benchmark utilizziamo NCCL_P2P_DISABLE=1
per dire alle GPU di non utilizzare NVLink.
Ecco il codice benchmark completo e gli output:
# DDP w/ NVLink rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 {'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} # DDP w/o NVLink rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 {'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69}
Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2
in nvidia-smi topo -m
)
Software: pytorch-1.8-to-be
+ cuda-11.0
/ transformers==4.3.0.dev0
class AxolotlConfigWCapabilities(AxolotlInputConfig): """wrapper to valdiate gpu capabilities with the configured options""" capabilities: GPUCapabilities @model_validator(mode="after") def check_bf16(self): if self.capabilities.bf16: if not self.bf16 and not self.bfloat16: LOG.info( "bf16 support detected, but not enabled for this configuration." ) else: if ( not self.merge_lora and not self.is_preprocess and (self.bf16 is True or self.bfloat16 is True) ): raise ValueError( "bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above." ) return self @model_validator(mode="before") @classmethod def check_sample_packing_w_sdpa_bf16(cls, data): is_sm_90: bool = ( data["capabilities"] and data["capabilities"].get("compute_capability") == "sm_90" ) if ( data.get("sample_packing") and data.get("sdp_attention") and (data.get("bfloat16") or data.get("bf16")) and not is_sm_90 ): # https://github.com/pytorch/pytorch/blob/1b03423526536b5f3d35bdfa95ccc6197556cf9b/test/test_transformers.py#L2440-L2450 LOG.warning( "sample_packing & torch sdpa with bf16 is unsupported may results in 0.0 loss. " "This may work on H100s." ) return data @model_validator(mode="before") @classmethod def check_fsdp_deepspeed(cls, data): if data.get("deepspeed") and data.get("fsdp"): raise ValueError("deepspeed and fsdp cannot be used together.") return data
options_to_group = { "multi_gpu": "Distributed GPUs", "tpu": "TPU", "use_deepspeed": "DeepSpeed Arguments", "use_fsdp": "FSDP Arguments", "use_megatron_lm": "Megatron-LM Arguments", }
In this section, we will look at how to use QLoRA and DeepSpeed Stage-3 for finetuning 70B llama model on 2X40GB GPUs.
For this, we first need bitsandbytes>=0.43.0
, accelerate>=0.28.0
, transformers>4.38.2
, trl>0.7.11
and peft>0.9.0
. We need to set zero3_init_flag
to true when using Accelerate config. Below is the config which can be found at deepspeed_config_z3_qlora.yaml:
compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_multinode_launcher: standard offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
Launch command is given below which is available at run_peft_qlora_deepspeed_stage3.sh:
accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml" train.py \
--seed 100 \
--model_name_or_path "meta-llama/Llama-2-70b-hf" \
--dataset_name "smangrul/ultrachat-10k-chatml" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "llama-sft-qlora-dsz3" \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 2 \
--gradient_checkpointing True \
--use_reentrant True \
--dataset_text_field "content" \
--use_flash_attn True \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization True \
--use_nested_quant True \
--bnb_4bit_compute_dtype "bfloat16" \
--bnb_4bit_quant_storage_dtype "bfloat16"
Notice the new argument being passed bnb_4bit_quant_storage_dtype
which denotes the data type for packing the 4-bit parameters. For example, when it is set to bfloat16
, 32/4 = 8 4-bit params are packed together post quantization.
In terms of training code, the important code changes are:
... bnb_config = BitsAndBytesConfig( load_in_4bit=args.use_4bit_quantization, bnb_4bit_quant_type=args.bnb_4bit_quant_type, bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=args.use_nested_quant, + bnb_4bit_quant_storage=quant_storage_dtype, ) ... model = AutoModelForCausalLM.from_pretrained( args.model_name_or_path, quantization_config=bnb_config, trust_remote_code=True, attn_implementation="flash_attention_2" if args.use_flash_attn else "eager", + torch_dtype=quant_storage_dtype or torch.float32, )
Notice that torch_dtype
for AutoModelForCausalLM
is same as the bnb_4bit_quant_storage
data type. That's it. Everything else is handled by Trainer and TRL.
In the above example, the memory consumed per GPU is 36.6 GB. Therefore, what took 8X80GB GPUs with DeepSpeed Stage 3+LoRA and a couple of 80GB GPUs with DDP+QLoRA now requires 2X40GB GPUs. This makes finetuning of large models more accessible.
#!/bin/bash #SBATCH --job-name=multinode #SBATCH -D . #SBATCH --output=O-%x.%j #SBATCH --error=E-%x.%j #SBATCH --nodes=4 # number of nodes #SBATCH --ntasks-per-node=1 # number of MP tasks #SBATCH --gres=gpu:4 # number of GPUs per node #SBATCH --cpus-per-task=160 # number of cores per tasks #SBATCH --time=01:59:00 # maximum execution time (HH:MM:SS) ###################### ### Set enviroment ### ###################### source activateEnvironment.sh export GPUS_PER_NODE=4 ###################### ###################### #### Set network ##### ###################### head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) ###################### export LAUNCHER="accelerate launch \ --num_processes $((SLURM_NNODES * GPUS_PER_NODE)) \ --num_machines $SLURM_NNODES \ --rdzv_backend c10d \ --main_process_ip $head_node_ip \ --main_process_port 29500 \ " export ACCELERATE_DIR="${ACCELERATE_DIR:-/accelerate}" export SCRIPT="${ACCELERATE_DIR}/examples/complete_nlp_example.py" export SCRIPT_ARGS=" \ --mixed_precision fp16 \ --output_dir ${ACCELERATE_DIR}/examples/output \ " # This step is necessary because accelerate launch does not handle multiline arguments properly export CMD="$LAUNCHER $PYTHON_FILE $ARGS" srun $CMD
In this example, we'll see how to use PEFT to perform SFT using PEFT on various distributed setups.
QLoRA uses 4-bit quantization of the base model to drastically reduce the GPU memory consumed by the base model while using LoRA for parameter-efficient fine-tuning. The command to use QLoRA is present at run_peft.sh.
Note:
use_reentrant
needs to be True
when using gradient checkpointing with QLoRA else QLoRA leads to high GPU memory consumption.Unsloth enables finetuning Mistral/Llama 2-5x faster with 70% less memory. It achieves this by reducing data upcasting, using Flash Attention 2, custom Triton kernels for RoPE embeddings, RMS Layernorm & Cross Entropy Loss and manual clever autograd computation to reduce the FLOPs during QLoRA finetuning. Below is the list of the optimizations from the Unsloth blogpost mistral-benchmark. The command to use QLoRA with Unsloth is present at run_unsloth_peft.sh.
<div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/Unsloth.png"/> </div> <small>Optimization in Unsloth to speed up QLoRA finetuning while reducing GPU memory usage</small>To speed up QLoRA finetuning when you have access to multiple GPUs, look at the launch command at run_peft_multigpu.sh. This example to performs DDP on 8 GPUs.
Note:
use_reentrant
needs to be False
when using gradient checkpointing with Multi-GPU QLoRA else it will lead to errors. However, this leads to huge GPU memory consumption.When you have access to multiple GPUs, it would be better to use normal LoRA with DeepSpeed/FSDP. To use LoRA with DeepSpeed, refer the docs at PEFT with DeepSpeed.
When you have access to multiple GPUs, it would be better to use normal LoRA with DeepSpeed/FSDP. To use LoRA with DeepSpeed, refer the docs at PEFT with FSDP.
#!/bin/bash -l #SBATCH --job-name=multicpu #SBATCH --nodes=2 # number of Nodes #SBATCH --ntasks-per-node=1 # number of MP tasks #SBATCH --exclusive #SBATCH --output=O-%x.%j #SBATCH --error=E-%x.%j ###################### ### Set enviroment ### ###################### source activateEnvironment.sh ###################### #### Set network ##### ###################### head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) ###################### # Setup env variables for distributed jobs export MASTER_PORT="${MASTER_PORT:-29555 }" echo "head_node_ip=${head_node_ip}" echo "MASTER_PORT=${MASTER_PORT}" INSTANCES_PER_NODE="${INSTANCES_PER_NODE:-1}" if [[ $SLURM_NNODES == 1 ]] && [[ $INSTANCES_PER_NODE == 1 ]]; then export CCL_WORKER_COUNT=0 LAUNCHER="" else # Setup env variables for distributed jobs export CCL_WORKER_COUNT="${CCL_WORKER_COUNT:-2}" echo "CCL_WORKER_COUNT=${CCL_WORKER_COUNT}" # Write hostfile HOSTFILE_PATH=hostfile scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'> ${HOSTFILE_PATH} export LAUNCHER="accelerate launch \ --num_processes $((SLURM_NNODES * ${INSTANCES_PER_NODE})) \ --num_machines $SLURM_NNODES \ --rdzv_backend c10d \ --main_process_ip $head_node_ip \ --main_process_port $MASTER_PORT \ --mpirun_hostfile $HOSTFILE_PATH \ --mpirun_ccl $CCL_WORKER_COUNT" fi # This step is necessary because accelerate launch does not handle multiline arguments properly export ACCELERATE_DIR="${ACCELERATE_DIR:-/accelerate}" export SCRIPT="${ACCELERATE_DIR}/examples/complete_nlp_example.py" export SCRIPT_ARGS=" \ --cpu \ --output_dir ${ACCELERATE_DIR}/examples/output \ " # This step is necessary because accelerate launch does not handle multiline arguments properly export CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS" # Print the command echo $CMD echo "" # Run the command eval $CMD
🤗 Accelerate has integrated IPEX, all you need to do is enabling it through the config.
Scenario 1: Acceleration of No distributed CPU training
Run <u>accelerate config</u> on your machine:
$ accelerate config ----------------------------------------------------------------------------------------------------------------------------------------------------------- In which compute environment are you running? This machine ----------------------------------------------------------------------------------------------------------------------------------------------------------- Which type of machine are you using? No distributed training Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:yes Do you want to use Intel PyTorch Extension (IPEX) to speed up training on CPU? [yes/NO]:yes Do you wish to optimize your script with torch dynamo?[yes/NO]:NO Do you want to use DeepSpeed? [yes/NO]: NO ----------------------------------------------------------------------------------------------------------------------------------------------------------- Do you wish to use FP16 or BF16 (mixed precision)? bf16
This will generate a config file that will be used automatically to properly set the default options when doing
accelerate launch my_script.py --args_to_my_script
For instance, here is how you would run the NLP example examples/nlp_example.py
(from the root of the repo) with IPEX enabled.
default_config.yaml that is generated after accelerate config
compute_environment: LOCAL_MACHINE distributed_type: 'NO' downcast_bf16: 'no' ipex_config: ipex: true machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: true
accelerate launch examples/nlp_example.py
Scenario 2: Acceleration of distributed CPU training we use Intel oneCCL for communication, combined with Intel® MPI library to deliver flexible, efficient, scalable cluster messaging on Intel® architecture. you could refer the here for the installation guide
Run <u>accelerate config</u> on your machine(node0):
$ accelerate config ----------------------------------------------------------------------------------------------------------------------------------------------------------- In which compute environment are you running? This machine ----------------------------------------------------------------------------------------------------------------------------------------------------------- Which type of machine are you using? multi-CPU How many different machines will you use (use more than 1 for multi-node training)? [1]: 4 ----------------------------------------------------------------------------------------------------------------------------------------------------------- What is the rank of this machine? 0 What is the IP address of the machine that will host the main process? 36.112.23.24 What is the port you will use to communicate with the main process? 29500 Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: yes Do you want to use Intel PyTorch Extension (IPEX) to speed up training on CPU? [yes/NO]:yes Do you want accelerate to launch mpirun? [yes/NO]: yes Please enter the path to the hostfile to use with mpirun [~/hostfile]: ~/hostfile Enter the number of oneCCL worker threads [1]: 1 Do you wish to optimize your script with torch dynamo?[yes/NO]:NO How many processes should be used for distributed training? [1]:16 ----------------------------------------------------------------------------------------------------------------------------------------------------------- Do you wish to use FP16 or BF16 (mixed precision)? bf16
For instance, here is how you would run the NLP example examples/nlp_example.py
(from the root of the repo) with IPEX enabled for distributed CPU training.
default_config.yaml that is generated after accelerate config
compute_environment: LOCAL_MACHINE distributed_type: MULTI_CPU downcast_bf16: 'no' ipex_config: ipex: true machine_rank: 0 main_process_ip: 36.112.23.24 main_process_port: 29500 main_training_function: main mixed_precision: bf16 mpirun_config: mpirun_ccl: '1' mpirun_hostfile: /home/user/hostfile num_machines: 4 num_processes: 16 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: true
Set following env and using intel MPI to launch the training
In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.
If you selected to have Accelerate launch mpirun
, ensure that the location of your hostfile matches the path in the config.
$ cat hostfile xxx.xxx.xxx.xxx #node0 ip xxx.xxx.xxx.xxx #node1 ip xxx.xxx.xxx.xxx #node2 ip xxx.xxx.xxx.xxx #node3 ip
When Accelerate is launching mpirun
, source the oneCCL bindings setvars.sh to get your Intel MPI environment, and then
run your script using accelerate launch
. Note that the python script and environment needs to exist on all of the
machines being used for multi-CPU training.
oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") source $oneccl_bindings_for_pytorch_path/env/setvars.sh accelerate launch examples/nlp_example.py
Otherwise, if you selected not to have Accelerate launch mpirun
, run the following command in node0 and 16DDP will
be enabled in node0,node1,node2,node3 with BF16 mixed precision. When using this method, the python script, python
environment, and accelerate config file need to be present on all of the machines used for multi-CPU training.
oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") source $oneccl_bindings_for_pytorch_path/env/setvars.sh export CCL_WORKER_COUNT=1 export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip export CCL_ATL_TRANSPORT=ofi mpirun -f hostfile -n 16 -ppn 4 accelerate launch examples/nlp_example.py