Phorm.ai
OpenAccess-AI-Collective/axolotl

    how to use multi gpu with axolotl 4x gpu

    To use Axolotl with multiple GPUs, specifically with 4 GPUs, you can follow the instructions provided in the Axolotl documentation for multi-node and multi-GPU training. The key is to configure the accelerate library correctly for your environment and ensure that your model and training script are compatible with distributed training. Here's a step-by-step guide based on the information from the Axolotl documentation and the accelerate library:

    1. Install Axolotl and Dependencies: Ensure you have Axolotl and all necessary dependencies installed. This includes accelerate, torch, and any other libraries your training script requires.

    2. Configure Accelerate: Use the accelerate config command to create a configuration for your multi-GPU setup. This command will prompt you to answer several questions regarding your environment.

      accelerate config

      When prompted, specify that you are using a multi-GPU setup. If you're running on a single machine with 4 GPUs, you can specify 1 for the number of machines and 4 for the number of GPUs per machine.

    3. Prepare Your Training Script: Ensure your training script is set up for distributed training. Axolotl and accelerate handle a lot of the boilerplate code required for PyTorch distributed training, but you should ensure your model, data loading, and training loop are compatible with distributed training.

    4. Launch Training: Use the accelerate launch command to start your training script with the configuration you set up. Specify the number of processes equal to the number of GPUs you wish to use. For a 4 GPU setup, you would use:

      accelerate launch --num_processes 4 your_training_script.py

      Replace your_training_script.py with the path to your actual training script.

    5. (Optional) Advanced Configuration: For more advanced setups, such as using specific GPUs, you can manually edit the configuration file created by accelerate config or set environment variables like CUDA_VISIBLE_DEVICES to control which GPUs are used.

    6. Monitoring and Debugging: Keep an eye on the training process. You can use tools like nvidia-smi to monitor GPU utilization and ensure that the workload is evenly distributed across all GPUs.

    Here's a simplified example of how your training script might look:

    import torch from accelerate import Accelerator def training_loop(): # Your model, optimizer, and dataloaders setup here model = ... optimizer = ... train_dataloader = ... accelerator = Accelerator() model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader) for epoch in range(num_epochs): for batch in train_dataloader: # Your training logic here optimizer.zero_grad() outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() if __name__ == "__main__": training_loop()

    This script uses accelerate to prepare your model, optimizer, and dataloaders for distributed training across multiple GPUs.

    Search results
    [openaccess-ai-collective/axolotl] docs/multi-node.qmd
    ---
    title: Multi Node
    description: How to use Axolotl on multiple machines
    ---
    
    You will need to create a configuration for accelerate, either by using `accelerate config` and follow the instructions or you can use one of the preset below:
    
    ~/.cache/huggingface/accelerate/default_config.yaml
    ```yaml
    compute_environment: LOCAL_MACHINE
    debug: false
    distributed_type: FSDP
    downcast_bf16: 'no'
    machine_rank: 0 # Set to 0 for the main machine, increment by one for other machines
    main_process_ip: 10.0.0.4 # Set to main machine's IP
    main_process_port: 5000
    main_training_function: main
    mixed_precision: bf16
    num_machines: 2 # Change to the number of machines
    num_processes: 4 # That's the total number of GPUs, (for example: if you have 2 machines with 4 GPU, put 8)
    rdzv_backend: static
    same_network: true
    tpu_env: []
    tpu_use_cluster: false
    tpu_use_sudo: false
    use_cpu: false
    

    Configure your model to use FSDP with for example:

    fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

    Machine configuration

    On each machine you need a copy of Axolotl, we suggest using the same commit to ensure compatibility.

    You will also need to have the same configuration file for your model on each machine.

    On the main machine only, make sure the port you set as main_process_port is open in TCP and reachable by other machines.

    All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.

    [openaccess-ai-collective/axolotl] README.md

    Axolotl

    Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.

    Features:

    • Train various Huggingface models such as llama, pythia, falcon, mpt
    • Supports fullfinetune, lora, qlora, relora, and gptq
    • Customize configurations using a simple yaml file or CLI overwrite
    • Load different dataset formats, use custom formats, or bring your own tokenized datasets
    • Integrated with xformer, flash attention, rope scaling, and multipacking
    • Works with single GPU or multiple GPUs via FSDP or Deepspeed
    • Easily run with Docker locally or on the cloud
    • Log results and optionally checkpoints to wandb or mlflow
    • And more!
    <a href="https://www.phorm.ai/query?projectId=e315ba4a-4e14-421f-ab05-38a1f9076f25"> </a> <table> <tr> <td>

    Table of Contents

    </td> <td> <div align="center"> <img src="image/axolotl.png" alt="axolotl" width="160"> <div> <p> <b>Axolotl provides a unified repository for fine-tuning <br />a variety of AI models with ease</b> </p> <p> Go ahead and Axolotl questions!! </p> <img src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/pre-commit.yml/badge.svg?branch=main" alt="pre-commit"> <img alt="PyTest Status" src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/tests.yml/badge.svg?branch=main"> </div> </div> </td> </tr> </table>

    Axolotl supports

    | | fp16/fp32 | lora | qlora | gptq | gptq w/flash attn | flash attn | xformers attn | |-------------|:----------|:-----|-------|------|-------------------|------------|--------------| | llama | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mistral | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mixtral-MoE | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Mixtral8X22 | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Pythia | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | cerebras | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | btlm | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | mpt | ✅ | ❌ | ❓ | ❌ | ❌ | ❌ | ❓ | | falcon | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | gpt-j | ✅ | ✅ | ✅ | ❌ | ❌ | ❓ | ❓ | | XGen | ✅ | ❓ | ✅ | ❓ | ❓ | ❓ | ✅ | | phi | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | RWKV | ✅ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | | Qwen | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Gemma | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ❓ |

    ✅: supported ❌: not supported ❓: untested

    Quickstart ⚡

    Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.

    Requirements: Python >=3.10 and Pytorch >=2.1.1.

    git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]'

    Usage

    # preprocess datasets - optional but recommended CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml # finetune lora accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml # inference accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./outputs/lora-out" # gradio accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./outputs/lora-out" --gradio # remote yaml files - the yaml config can be hosted on a public URL # Note: the yaml config must directly link to the **raw** yaml accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml

    Advanced Setup

    Environment

    Docker

    docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest

    Or run on the current files for development:

    docker compose up -d

    [!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.

    <details> <summary>Docker advanced</summary>

    A more powerful Docker command to run would be this:

    docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

    It additionally:

    • Prevents memory issues when running e.g. deepspeed (e.g. you could hit SIGBUS/signal 7 error) through --ipc and --ulimit args.
    • Persists the downloaded HF data (models etc.) and your modifications to axolotl code through --mount/-v args.
    • The --name argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...) or in your terminal.
    • The --privileged flag gives all capabilities to the container.
    • The --shm-size 10g argument increases the shared memory size. Use this if you see exitcode: -7 errors using deepspeed.

    More information on nvidia website

    </details>

    Conda/Pip venv

    1. Install python >=3.10

    2. Install pytorch stable https://pytorch.org/get-started/locally/

    3. Install Axolotl along with python dependencies

      pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
    4. (Optional) Login to Huggingface to use gated models/datasets.

      huggingface-cli login

      Get the token at huggingface.co/settings/tokens

    Cloud GPU

    For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest

    Bare Metal Cloud GPU

    LambdaLabs
    <details> <summary>Click to Expand</summary>
    1. Install python
    sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
    1. Install pip
    wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
    1. Install Pytorch https://pytorch.org/get-started/locally/

    2. Follow instructions on quickstart.

    3. Run

    pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
    1. Set path
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
    </details>
    GCP
    <details> <summary>Click to Expand</summary>

    Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.

    Make sure to run the below to uninstall xla.

    pip uninstall -y torch_xla[tpu]
    </details>

    Windows

    Please use WSL or Docker!

    Mac

    Use the below instead of the install method in QuickStart.

    pip3 install -e '.'
    

    More info: mac.md

    Google Colab

    Please use this example notebook.

    Launching on public clouds via SkyPilot

    To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:

    pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check

    Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1:

    git clone https://github.com/skypilot-org/skypilot.git
    cd skypilot/llm/axolotl
    

    Use one command to launch:

    # On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET

    Launching on public clouds via dstack

    To launch on GPU instance (both on-demand and spot instances) on public clouds (GCP, AWS, Azure, Lambda Labs, TensorDock, Vast.ai, and CUDO), you can use dstack.

    Write a job description in YAML as below:

    # dstack.yaml type: task image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.2 env: - HUGGING_FACE_HUB_TOKEN - WANDB_API_KEY commands: - accelerate launch -m axolotl.cli.train config.yaml ports: - 6006 resources: gpu: memory: 24GB.. count: 2

    then, simply run the job with dstack run command. Append --spot option if you want spot instance. dstack run command will show you the instance with cheapest price across multi cloud services:

    pip install dstack HUGGING_FACE_HUB_TOKEN=xxx WANDB_API_KEY=xxx dstack run . -f dstack.yaml # --spot

    For further and fine-grained use cases, please refer to the official dstack documents and the detailed description of axolotl example on the official repository.

    Dataset

    Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.

    See these docs for more information on how to use different dataset formats.

    Config

    See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:

    • model

      base_model: ./llama-7b-hf # local or huggingface repo

      Note: The code will load the right architecture.

    • dataset

      datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
    • loading

      load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP

      Note: Repo does not do 4-bit quantization.

    • lora

      adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj

    All Config Options

    See these docs for all config options.

    Train

    Run

    accelerate launch -m axolotl.cli.train your_config.yml

    [!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

    Preprocess dataset

    You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

    • Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
    • (Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
    • (Optional): Use --debug to see preprocessed examples.
    python -m axolotl.cli.preprocess your_config.yml

    Multi-GPU

    Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

    DeepSpeed

    Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated

    [openaccess-ai-collective/axolotl] README.md

    Advanced Setup

    Environment

    Docker

    docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest

    Or run on the current files for development:

    docker compose up -d

    [!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.

    <details> <summary>Docker advanced</summary>

    A more powerful Docker command to run would be this:

    docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

    It additionally:

    • Prevents memory issues when running e.g. deepspeed (e.g. you could hit SIGBUS/signal 7 error) through --ipc and --ulimit args.
    • Persists the downloaded HF data (models etc.) and your modifications to axolotl code through --mount/-v args.
    • The --name argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...) or in your terminal.
    • The --privileged flag gives all capabilities to the container.
    • The --shm-size 10g argument increases the shared memory size. Use this if you see exitcode: -7 errors using deepspeed.

    More information on nvidia website

    </details>

    Conda/Pip venv

    1. Install python >=3.10

    2. Install pytorch stable https://pytorch.org/get-started/locally/

    3. Install Axolotl along with python dependencies

      pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
    4. (Optional) Login to Huggingface to use gated models/datasets.

      huggingface-cli login

      Get the token at huggingface.co/settings/tokens

    Cloud GPU

    For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest

    Bare Metal Cloud GPU

    LambdaLabs
    <details> <summary>Click to Expand</summary>
    1. Install python
    sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
    1. Install pip
    wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
    1. Install Pytorch https://pytorch.org/get-started/locally/

    2. Follow instructions on quickstart.

    3. Run

    pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
    1. Set path
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
    </details>
    GCP
    <details> <summary>Click to Expand</summary>

    Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.

    Make sure to run the below to uninstall xla.

    pip uninstall -y torch_xla[tpu]
    </details>

    Windows

    Please use WSL or Docker!

    Mac

    Use the below instead of the install method in QuickStart.

    pip3 install -e '.'
    

    More info: mac.md

    Google Colab

    Please use this example notebook.

    Launching on public clouds via SkyPilot

    To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:

    pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check

    Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1:

    git clone https://github.com/skypilot-org/skypilot.git
    cd skypilot/llm/axolotl
    

    Use one command to launch:

    # On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET

    Launching on public clouds via dstack

    To launch on GPU instance (both on-demand and spot instances) on public clouds (GCP, AWS, Azure, Lambda Labs, TensorDock, Vast.ai, and CUDO), you can use dstack.

    Write a job description in YAML as below:

    # dstack.yaml type: task image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.2 env: - HUGGING_FACE_HUB_TOKEN - WANDB_API_KEY commands: - accelerate launch -m axolotl.cli.train config.yaml ports: - 6006 resources: gpu: memory: 24GB.. count: 2

    then, simply run the job with dstack run command. Append --spot option if you want spot instance. dstack run command will show you the instance with cheapest price across multi cloud services:

    pip install dstack HUGGING_FACE_HUB_TOKEN=xxx WANDB_API_KEY=xxx dstack run . -f dstack.yaml # --spot

    For further and fine-grained use cases, please refer to the official dstack documents and the detailed description of axolotl example on the official repository.

    Dataset

    Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.

    See these docs for more information on how to use different dataset formats.

    Config

    See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:

    • model

      base_model: ./llama-7b-hf # local or huggingface repo

      Note: The code will load the right architecture.

    • dataset

      datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
    • loading

      load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP

      Note: Repo does not do 4-bit quantization.

    • lora

      adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj

    All Config Options

    See these docs for all config options.

    Train

    Run

    accelerate launch -m axolotl.cli.train your_config.yml

    [!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

    Preprocess dataset

    You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

    • Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
    • (Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
    • (Optional): Use --debug to see preprocessed examples.
    python -m axolotl.cli.preprocess your_config.yml

    Multi-GPU

    Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

    DeepSpeed

    Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated

    We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.

    deepspeed: deepspeed_configs/zero1.json
    accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
    FSDP
    • llama FSDP
    fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
    FSDP + QLoRA

    Axolotl supports training with FSDP and QLoRA, see these docs for more information.

    Weights & Biases Logging

    Make sure your WANDB_API_KEY environment variable is set (recommended) or you login to wandb with wandb login.

    • wandb options
    wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
    Special Tokens

    It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:

    special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"

    When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.

    Inference Playground

    Axolotl allows you to load your model in an interactive terminal playground for quick experimentation. The config file is the same config file used for training.

    Pass the appropriate flag to the inference command, depending upon what kind of model was trained:

    • Pretrained LORA:
      python -m axolotl.cli.inference examples/your_config.yml --lora_model_dir="./lora-output-dir"
    • Full weights finetune:
      python -m axolotl.cli.inference examples/your_config.yml --base_model="./completed-model"
    • Full weights finetune w/ a prompt from a text file:
      cat /tmp/prompt.txt | python -m axolotl.cli.inference examples/your_config.yml \ --base_model="./completed-model" --prompter=None --load_in_8bit=True

    -- With gradio hosting

    python -m axolotl.cli.inference examples/your_config.yml --gradio

    Please use --sample_packing False if you have it on and receive the error similar to below:

    RuntimeError: stack expects each tensor to be equal size, but got [1, 32, 1, 128] at entry 0 and [1, 32, 8, 128] at entry 1

    Merge LORA to base

    The following command will merge your LORA adapater with your base model. You can optionally pass the argument --lora_model_dir to specify the directory where your LORA adapter was saved, otherwhise, this will be inferred from output_dir in your axolotl config file. The merged model is saved in the sub-directory {lora_model_dir}/merged.

    python3 -m axolotl.cli.merge_lora your_config.yml --lora_model_dir="./completed-model"

    You may need to use the gpu_memory_limit and/or lora_on_cpu config options to avoid running out of memory. If you still run out of CUDA memory, you can try to merge in system RAM with

    CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.merge_lora ...

    although this will be very slow, and using the config options above are recommended instead.

    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/models/internals/__init__.py
    """module for gpu capabilities"""
    [openaccess-ai-collective/axolotl] README.md

    Axolotl supports

    | | fp16/fp32 | lora | qlora | gptq | gptq w/flash attn | flash attn | xformers attn | |-------------|:----------|:-----|-------|------|-------------------|------------|--------------| | llama | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mistral | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mixtral-MoE | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Mixtral8X22 | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Pythia | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | cerebras | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | btlm | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | mpt | ✅ | ❌ | ❓ | ❌ | ❌ | ❌ | ❓ | | falcon | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | gpt-j | ✅ | ✅ | ✅ | ❌ | ❌ | ❓ | ❓ | | XGen | ✅ | ❓ | ✅ | ❓ | ❓ | ❓ | ✅ | | phi | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | RWKV | ✅ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | | Qwen | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Gemma | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ❓ |

    ✅: supported ❌: not supported ❓: untested

    [huggingface/accelerate] examples/slurm/submit_multigpu.sh
    #!/bin/bash #SBATCH --job-name=multigpu #SBATCH -D . #SBATCH --output=O-%x.%j #SBATCH --error=E-%x.%j #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 # number of MP tasks #SBATCH --gres=gpu:4 # number of GPUs per node #SBATCH --cpus-per-task=160 # number of cores per tasks #SBATCH --time=01:59:00 # maximum execution time (HH:MM:SS) ###################### ### Set enviroment ### ###################### source activateEnvironment.sh export GPUS_PER_NODE=4 ###################### export ACCELERATE_DIR="${ACCELERATE_DIR:-/accelerate}" export SCRIPT="${ACCELERATE_DIR}/examples/complete_nlp_example.py" export SCRIPT_ARGS=" \ --mixed_precision fp16 \ --output_dir ${ACCELERATE_DIR}/examples/output \ --with_tracking \ " accelerate launch --num_processes $GPUS_PER_NODE $SCRIPT $SCRIPT_ARGS
    [openaccess-ai-collective/axolotl] README.md

    Quickstart ⚡

    Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.

    Requirements: Python >=3.10 and Pytorch >=2.1.1.

    git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]'

    Usage

    # preprocess datasets - optional but recommended CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml # finetune lora accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml # inference accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./outputs/lora-out" # gradio accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./outputs/lora-out" --gradio # remote yaml files - the yaml config can be hosted on a public URL # Note: the yaml config must directly link to the **raw** yaml accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml
    [openaccess-ai-collective/axolotl] README.md

    Environment

    Docker

    docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest

    Or run on the current files for development:

    docker compose up -d

    [!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.

    <details> <summary>Docker advanced</summary>

    A more powerful Docker command to run would be this:

    docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

    It additionally:

    • Prevents memory issues when running e.g. deepspeed (e.g. you could hit SIGBUS/signal 7 error) through --ipc and --ulimit args.
    • Persists the downloaded HF data (models etc.) and your modifications to axolotl code through --mount/-v args.
    • The --name argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...) or in your terminal.
    • The --privileged flag gives all capabilities to the container.
    • The --shm-size 10g argument increases the shared memory size. Use this if you see exitcode: -7 errors using deepspeed.

    More information on nvidia website

    </details>

    Conda/Pip venv

    1. Install python >=3.10

    2. Install pytorch stable https://pytorch.org/get-started/locally/

    3. Install Axolotl along with python dependencies

      pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
    4. (Optional) Login to Huggingface to use gated models/datasets.

      huggingface-cli login

      Get the token at huggingface.co/settings/tokens

    Cloud GPU

    For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest

    Bare Metal Cloud GPU

    LambdaLabs
    <details> <summary>Click to Expand</summary>
    1. Install python
    sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
    1. Install pip
    wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
    1. Install Pytorch https://pytorch.org/get-started/locally/

    2. Follow instructions on quickstart.

    3. Run

    pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
    1. Set path
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
    </details>
    GCP
    <details> <summary>Click to Expand</summary>

    Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.

    Make sure to run the below to uninstall xla.

    pip uninstall -y torch_xla[tpu]
    </details>

    Windows

    Please use WSL or Docker!

    Mac

    Use the below instead of the install method in QuickStart.

    pip3 install -e '.'
    

    More info: mac.md

    Google Colab

    Please use this example notebook.

    Launching on public clouds via SkyPilot

    To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:

    pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check

    Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1:

    git clone https://github.com/skypilot-org/skypilot.git
    cd skypilot/llm/axolotl
    

    Use one command to launch:

    # On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET

    Launching on public clouds via dstack

    To launch on GPU instance (both on-demand and spot instances) on public clouds (GCP, AWS, Azure, Lambda Labs, TensorDock, Vast.ai, and CUDO), you can use dstack.

    Write a job description in YAML as below:

    # dstack.yaml type: task image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.2 env: - HUGGING_FACE_HUB_TOKEN - WANDB_API_KEY commands: - accelerate launch -m axolotl.cli.train config.yaml ports: - 6006 resources: gpu: memory: 24GB.. count: 2

    then, simply run the job with dstack run command. Append --spot option if you want spot instance. dstack run command will show you the instance with cheapest price across multi cloud services:

    pip install dstack HUGGING_FACE_HUB_TOKEN=xxx WANDB_API_KEY=xxx dstack run . -f dstack.yaml # --spot

    For further and fine-grained use cases, please refer to the official dstack documents and the detailed description of axolotl example on the official repository.

    [huggingface/transformers] tests/quantization/bnb/test_4bit.py
    class Bnb4bitTestMultiGpu(Base4bitTest): def setUp(self): super().setUp() def test_multi_gpu_loading(self): r""" This tests that the model has been loaded and can be used correctly on a multi-GPU setup. Let's just try to load a model on 2 GPUs and see if it works. The model we test has ~2GB of total, 3GB should suffice """ model_parallel = AutoModelForCausalLM.from_pretrained( self.model_name, load_in_4bit=True, device_map="balanced" ) # Check correct device map self.assertEqual(set(model_parallel.hf_device_map.values()), {0, 1}) # Check that inference pass works on the model encoded_input = self.tokenizer(self.input_text, return_tensors="pt") # Second real batch output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
    [huggingface/peft] examples/sft/run_peft_multigpu.sh
    torchrun --nproc_per_node 8 --nnodes 1 train.py \ --seed 100 \ --model_name_or_path "mistralai/Mistral-7B-v0.1" \ --dataset_name "smangrul/ultrachat-10k-chatml" \ --chat_template_format "chatml" \ --add_special_tokens False \ --append_concat_token False \ --splits "train,test" \ --max_seq_len 2048 \ --num_train_epochs 1 \ --logging_steps 5 \ --log_level "info" \ --logging_strategy "steps" \ --evaluation_strategy "epoch" \ --save_strategy "epoch" \ --push_to_hub \ --hub_private_repo True \ --hub_strategy "every_save" \ --bf16 True \ --packing True \ --learning_rate 1e-4 \ --lr_scheduler_type "cosine" \ --weight_decay 1e-4 \ --warmup_ratio 0.0 \ --max_grad_norm 1.0 \ --output_dir "mistral-sft-lora-multigpu" \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --gradient_accumulation_steps 8 \ --gradient_checkpointing True \ --use_reentrant False \ --dataset_text_field "content" \ --use_peft_lora True \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.1 \ --lora_target_modules "all-linear" \ --use_4bit_quantization True \ --use_nested_quant True \ --bnb_4bit_compute_dtype "bfloat16" \ --use_flash_attn True
    [huggingface/accelerate] src/accelerate/commands/launch.py
    def multi_gpu_launcher(args): import torch.distributed.run as distrib_run current_env = prepare_multi_gpu_env(args) if not check_cuda_p2p_ib_support(): message = "Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled." warn = False if "NCCL_P2P_DISABLE" not in current_env: current_env["NCCL_P2P_DISABLE"] = "1" warn = True if "NCCL_IB_DISABLE" not in current_env: current_env["NCCL_IB_DISABLE"] = "1" warn = True if warn: logger.warning(message) debug = getattr(args, "debug", False) args = _filter_args( args, distrib_run.get_args_parser(), ["--training_script", args.training_script, "--training_script_args", args.training_script_args], ) with patch_environment(**current_env): try: distrib_run.run(args) except Exception: if is_rich_available() and debug: console = get_console() console.print("\n[bold red]Using --debug, `torch.distributed` Stack Trace:[/bold red]") console.print_exception(suppress=[__file__], show_locals=False) else: raise
    [openaccess-ai-collective/axolotl] examples/colab-notebooks/colab-axolotl-example.ipynb
    # Example notebook for running Axolotl on google colab
    

    import torch

    Check so there is a gpu available, a T4(free tier) is enough to run this notebook

    assert (torch.cuda.is_available()==True)

    ## Install Axolotl and dependencies
    

    !pip install torch=="2.1.2" !pip install -e git+https://github.com/OpenAccess-AI-Collective/axolotl#egg=axolotl !pip install flash-attn=="2.5.0" !pip install deepspeed=="0.13.1"!pip install mlflow=="2.13.0"

    ## Create an yaml config file
    

    import yaml

    Your YAML string

    yaml_string = """ base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer

    load_in_8bit: false load_in_4bit: true strict: false

    datasets:

    • path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./outputs/qlora-out

    adapter: qlora lora_model_dir:

    sequence_len: 4096 sample_packing: true eval_sample_packing: false pad_to_sequence_len: true

    lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:

    wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:

    gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: paged_adamw_32bit lr_scheduler: cosine learning_rate: 0.0002

    train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false

    gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true

    warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens:

    """

    Convert the YAML string to a Python dictionary

    yaml_dict = yaml.safe_load(yaml_string)

    Specify your file path

    file_path = 'test_axolotl.yaml'

    Write the YAML file

    with open(file_path, 'w') as file: yaml.dump(yaml_dict, file)

    ## Launch the training
    

    Buy using the ! the comand will be executed as a bash command

    !accelerate launch -m axolotl.cli.train /content/test_axolotl.yaml

    ## Play with inference
    

    Buy using the ! the comand will be executed as a bash command

    !accelerate launch -m axolotl.cli.inference /content/test_axolotl.yaml
    --qlora_model_dir="./qlora-out" --gradio

    [huggingface/transformers] examples/pytorch/audio-classification/README.md

    Multi-GPU

    The following command shows how to fine-tune wav2vec2-base for 🌎 Language Identification on the CommonLanguage dataset.

    python run_audio_classification.py \ --model_name_or_path facebook/wav2vec2-base \ --dataset_name common_language \ --audio_column_name audio \ --label_column_name language \ --output_dir wav2vec2-base-lang-id \ --overwrite_output_dir \ --remove_unused_columns False \ --do_train \ --do_eval \ --fp16 \ --learning_rate 3e-4 \ --max_length_seconds 16 \ --attention_mask False \ --warmup_ratio 0.1 \ --num_train_epochs 10 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 4 \ --per_device_eval_batch_size 1 \ --dataloader_num_workers 8 \ --logging_strategy steps \ --logging_steps 10 \ --eval_strategy epoch \ --save_strategy epoch \ --load_best_model_at_end True \ --metric_for_best_model accuracy \ --save_total_limit 3 \ --seed 0 \ --push_to_hub

    On 4 V100 GPUs (16GB), this script should run in ~1 hour and yield accuracy of 79.45%.

    👀 See the results here: anton-l/wav2vec2-base-lang-id

    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/models/input/v0_4_1/__init__.py
    def check_mem_mismatch(cls, data): if ( data.get("max_memory") is not None and data.get("gpu_memory_limit") is not None ): raise ValueError( "max_memory and gpu_memory_limit are mutually exclusive and cannot be used together." ) return data
    [huggingface/transformers] docs/source/en/perf_train_gpu_many.md

    Use the --nproc_per_node to select how many GPUs to use.

    torchrun --nproc_per_node=2 trainer-program.py ...
    </hfoption> <hfoption id="Accelerate">

    Use --num_processes to select how many GPUs to use.

    accelerate launch --num_processes 2 trainer-program.py ...
    </hfoption> <hfoption id="DeepSpeed">

    Use --num_gpus to select how many GPUs to use.

    deepspeed --num_gpus 2 trainer-program.py ...
    </hfoption> </hfoptions>

    Order of GPUs

    Now, to select which GPUs to use and their order, you'll use the CUDA_VISIBLE_DEVICES environment variable. It is easiest to set the environment variable in a ~/bashrc or another startup config file. CUDA_VISIBLE_DEVICES is used to map which GPUs are used. For example, if you have 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2:

    CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...

    Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to cuda:0 and cuda:1 respectively. You can also reverse the order of the GPUs to use 2 first. Now, the mapping is cuda:1 for GPU 0 and cuda:0 for GPU 2.

    CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...

    You can also set the CUDA_VISIBLE_DEVICES environment variable to an empty value to create an environment without GPUs.

    CUDA_VISIBLE_DEVICES= python trainer-program.py ...
    <Tip warning={true}>

    As with any environment variable, they can be exported instead of being added to the command line. However, this is not recommended because it can be confusing if you forget how the environment variable was setup and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line.

    </Tip>

    CUDA_DEVICE_ORDER is an alternative environment variable you can use to control how the GPUs are ordered. You can either order them by:

    1. PCIe bus ID's that matches the order of nvidia-smi and rocm-smi for NVIDIA and AMD GPUs respectively
    export CUDA_DEVICE_ORDER=PCI_BUS_ID
    1. GPU compute ability
    export CUDA_DEVICE_ORDER=FASTEST_FIRST

    The CUDA_DEVICE_ORDER is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. In this case, set CUDA_DEVICE_ORDER=FASTEST_FIRST to always use the newer and faster GPU first (nvidia-smi or rocm-smi still reports the GPUs in their PCIe order). Or you could also set export CUDA_VISIBLE_DEVICES=1,0.

    [huggingface/transformers] examples/pytorch/speech-recognition/README.md

    Multi GPU CTC with Dataset Streaming

    The following command shows how to use Dataset Streaming mode to fine-tune XLS-R on Common Voice using 4 GPUs in half-precision.

    Streaming mode imposes several constraints on training:

    1. We need to construct a tokenizer beforehand and define it via --tokenizer_name_or_path.
    2. --num_train_epochs has to be replaced by --max_steps. Similarly, all other epoch-based arguments have to be replaced by step-based ones.
    3. Full dataset shuffling on each epoch is not possible, since we don't have the whole dataset available at once. However, the --shuffle_buffer_size argument controls how many examples we can pre-download before shuffling them.
    **torchrun \ --nproc_per_node 4 run_speech_recognition_ctc_streaming.py \ --dataset_name="common_voice" \ --model_name_or_path="facebook/wav2vec2-xls-r-300m" \ --tokenizer_name_or_path="anton-l/wav2vec2-tokenizer-turkish" \ --dataset_config_name="tr" \ --train_split_name="train+validation" \ --eval_split_name="test" \ --output_dir="wav2vec2-xls-r-common_voice-tr-ft" \ --overwrite_output_dir \ --max_steps="5000" \ --per_device_train_batch_size="8" \ --gradient_accumulation_steps="2" \ --learning_rate="5e-4" \ --warmup_steps="500" \ --eval_strategy="steps" \ --text_column_name="sentence" \ --save_steps="500" \ --eval_steps="500" \ --logging_steps="1" \ --layerdrop="0.0" \ --eval_metrics wer cer \ --save_total_limit="1" \ --mask_time_prob="0.3" \ --mask_time_length="10" \ --mask_feature_prob="0.1" \ --mask_feature_length="64" \ --freeze_feature_encoder \ --chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \ --max_duration_in_seconds="20" \ --shuffle_buffer_size="500" \ --fp16 \ --push_to_hub \ --do_train --do_eval \ --gradient_checkpointing**

    On 4 V100 GPUs, this script should run in ca. 3h 31min and yield a CTC loss of 0.35 and word error rate of 0.29.

    [openaccess-ai-collective/axolotl] src/axolotl/utils/bench.py
    def gpu_memory_usage(device=0): return torch.cuda.memory_allocated(device) / 1024.0**3
    [huggingface/transformers] tests/quantization/bnb/test_4bit.py
    def test_multi_gpu_loading(self): r""" This tests that the model has been loaded and can be used correctly on a multi-GPU setup. Let's just try to load a model on 2 GPUs and see if it works. The model we test has ~2GB of total, 3GB should suffice """ model_parallel = AutoModelForCausalLM.from_pretrained( self.model_name, load_in_4bit=True, device_map="balanced" ) # Check correct device map self.assertEqual(set(model_parallel.hf_device_map.values()), {0, 1}) # Check that inference pass works on the model encoded_input = self.tokenizer(self.input_text, return_tensors="pt") # Second real batch output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
    [huggingface/transformers] tests/quantization/bnb/test_mixed_int8.py
    class MixedInt8TestMultiGpu(BaseMixedInt8Test): def setUp(self): super().setUp() def test_multi_gpu_loading(self): r""" This tests that the model has been loaded and can be used correctly on a multi-GPU setup. Let's just try to load a model on 2 GPUs and see if it works. The model we test has ~2GB of total, 3GB should suffice """ model_parallel = AutoModelForCausalLM.from_pretrained( self.model_name, load_in_8bit=True, device_map="balanced" ) # Check correct device map self.assertEqual(set(model_parallel.hf_device_map.values()), {0, 1}) # Check that inference pass works on the model encoded_input = self.tokenizer(self.input_text, return_tensors="pt") # Second real batch output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
    [huggingface/transformers] docs/source/en/perf_train_gpu_many.md

    GPU selection

    When training on multiple GPUs, you can specify the number of GPUs to use and in what order. This can be useful for instance when you have GPUs with different computing power and want to use the faster GPU first. The selection process works for both DistributedDataParallel and DataParallel to use only a subset of the available GPUs, and you don't need Accelerate or the DeepSpeed integration.

    Number of GPUs

    For example, if you have 4 GPUs and you only want to use the first 2:

    <hfoptions id="select-gpu"> <hfoption id="torchrun">

    Use the --nproc_per_node to select how many GPUs to use.

    torchrun --nproc_per_node=2 trainer-program.py ...
    </hfoption> <hfoption id="Accelerate">

    Use --num_processes to select how many GPUs to use.

    accelerate launch --num_processes 2 trainer-program.py ...
    </hfoption> <hfoption id="DeepSpeed">

    Use --num_gpus to select how many GPUs to use.

    deepspeed --num_gpus 2 trainer-program.py ...
    </hfoption> </hfoptions>

    Order of GPUs

    Now, to select which GPUs to use and their order, you'll use the CUDA_VISIBLE_DEVICES environment variable. It is easiest to set the environment variable in a ~/bashrc or another startup config file. CUDA_VISIBLE_DEVICES is used to map which GPUs are used. For example, if you have 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2:

    CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...

    Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to cuda:0 and cuda:1 respectively. You can also reverse the order of the GPUs to use 2 first. Now, the mapping is cuda:1 for GPU 0 and cuda:0 for GPU 2.

    CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...

    You can also set the CUDA_VISIBLE_DEVICES environment variable to an empty value to create an environment without GPUs.

    CUDA_VISIBLE_DEVICES= python trainer-program.py ...
    <Tip warning={true}>

    As with any environment variable, they can be exported instead of being added to the command line. However, this is not recommended because it can be confusing if you forget how the environment variable was setup and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line.

    </Tip>

    CUDA_DEVICE_ORDER is an alternative environment variable you can use to control how the GPUs are ordered. You can either order them by:

    1. PCIe bus ID's that matches the order of nvidia-smi and rocm-smi for NVIDIA and AMD GPUs respectively
    export CUDA_DEVICE_ORDER=PCI_BUS_ID
    1. GPU compute ability
    export CUDA_DEVICE_ORDER=FASTEST_FIRST

    The CUDA_DEVICE_ORDER is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. In this case, set CUDA_DEVICE_ORDER=FASTEST_FIRST to always use the newer and faster GPU first (nvidia-smi or rocm-smi still reports the GPUs in their PCIe order). Or you could also set export CUDA_VISIBLE_DEVICES=1,0.

    [huggingface/transformers] docs/source/it/perf_hardware.md

    GPU

    Quando si addestrano modelli più grandi ci sono essenzialmente tre opzioni:

    Iniziamo dal caso in cui ci sia una singola GPU.

    Potenza e Raffreddamento

    Se hai acquistato una costosa GPU di fascia alta, assicurati di darle la potenza corretta e un raffreddamento sufficiente.

    Potenza:

    Alcune schede GPU consumer di fascia alta hanno 2 e talvolta 3 prese di alimentazione PCI-E a 8 pin. Assicurati di avere tanti cavi PCI-E a 8 pin indipendenti da 12 V collegati alla scheda quante sono le prese. Non utilizzare le 2 fessure a un'estremità dello stesso cavo (noto anche come cavo a spirale). Cioè se hai 2 prese sulla GPU, vuoi 2 cavi PCI-E a 8 pin che vanno dall'alimentatore alla scheda e non uno che abbia 2 connettori PCI-E a 8 pin alla fine! In caso contrario, non otterrai tutte le prestazioni ufficiali.

    Ciascun cavo di alimentazione PCI-E a 8 pin deve essere collegato a una guida da 12 V sul lato dell'alimentatore e può fornire fino a 150 W di potenza.

    Alcune altre schede possono utilizzare connettori PCI-E a 12 pin e questi possono fornire fino a 500-600 W di potenza.

    Le schede di fascia bassa possono utilizzare connettori a 6 pin, che forniscono fino a 75 W di potenza.

    Inoltre vuoi un alimentatore (PSU) di fascia alta che abbia una tensione stabile. Alcuni PSU di qualità inferiore potrebbero non fornire alla scheda la tensione stabile di cui ha bisogno per funzionare al massimo.

    E ovviamente l'alimentatore deve avere abbastanza Watt inutilizzati per alimentare la scheda.

    Raffreddamento:

    Quando una GPU si surriscalda, inizierà a rallentare e non fornirà le prestazioni mssimali e potrebbe persino spegnersi se diventasse troppo calda.

    È difficile dire l'esatta temperatura migliore a cui aspirare quando una GPU è molto caricata, ma probabilmente qualsiasi cosa al di sotto di +80°C va bene, ma più bassa è meglio - forse 70-75°C è un intervallo eccellente in cui trovarsi. È probabile che il rallentamento inizi a circa 84-90°C. Ma oltre alla limitazione delle prestazioni, una temperatura molto elevata prolungata è probabile che riduca la durata di una GPU.

    Diamo quindi un'occhiata a uno degli aspetti più importanti quando si hanno più GPU: la connettività.

    Connettività multi-GPU

    Se utilizzi più GPU, il modo in cui le schede sono interconnesse può avere un enorme impatto sul tempo totale di allenamento. Se le GPU si trovano sullo stesso nodo fisico, puoi eseguire:

    nvidia-smi topo -m

    e ti dirà come sono interconnesse le GPU. Su una macchina con doppia GPU e collegata a NVLink, molto probabilmente vedrai qualcosa del tipo:

            GPU0    GPU1    CPU Affinity    NUMA Affinity
    GPU0     X      NV2     0-23            N/A
    GPU1    NV2      X      0-23            N/A
    

    su una macchina diversa senza NVLink potremmo vedere:

            GPU0    GPU1    CPU Affinity    NUMA Affinity
    GPU0     X      PHB     0-11            N/A
    GPU1    PHB      X      0-11            N/A
    

    Il rapporto include questa legenda:

      X    = Self
      SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
      NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
      PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
      PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
      PIX  = Connection traversing at most a single PCIe bridge
      NV#  = Connection traversing a bonded set of # NVLinks
    

    Quindi il primo rapporto NV2 ci dice che le GPU sono interconnesse con 2 NVLinks e nel secondo report PHB abbiamo una tipica configurazione PCIe+Bridge a livello di consumatore.

    Controlla che tipo di connettività hai sulla tua configurazione. Alcuni di questi renderanno la comunicazione tra le carte più veloce (es. NVLink), altri più lenta (es. PHB).

    A seconda del tipo di soluzione di scalabilità utilizzata, la velocità di connettività potrebbe avere un impatto maggiore o minore. Se le GPU devono sincronizzarsi raramente, come in DDP, l'impatto di una connessione più lenta sarà meno significativo. Se le GPU devono scambiarsi messaggi spesso, come in ZeRO-DP, una connettività più veloce diventa estremamente importante per ottenere un addestramento più veloce.

    NVlink

    NVLink è un collegamento di comunicazione a corto raggio multilinea seriale basato su cavo sviluppato da Nvidia.

    Ogni nuova generazione fornisce una larghezza di banda più veloce, ad es. ecco una citazione da Nvidia Ampere GA102 GPU Architecture:

    Third-Generation NVLink® GA102 GPUs utilize NVIDIA’s third-generation NVLink interface, which includes four x4 links, with each link providing 14.0625 GB/sec bandwidth in each direction between two GPUs. Four links provide 56.25 GB/sec bandwidth in each direction, and 112.5 GB/sec total bandwidth between two GPUs. Two RTX 3090 GPUs can be connected together for SLI using NVLink. (Note that 3-Way and 4-Way SLI configurations are not supported.)

    Quindi più X si ottiene nel rapporto di NVX nell'output di nvidia-smi topo -m, meglio è. La generazione dipenderà dall'architettura della tua GPU.

    Confrontiamo l'esecuzione di un training del modello di linguaggio openai-community/gpt2 su un piccolo campione di wikitext

    I risultati sono:

    | NVlink | Time | | ----- | ---: | | Y | 101s | | N | 131s |

    Puoi vedere che NVLink completa l'addestramento circa il 23% più velocemente. Nel secondo benchmark utilizziamo NCCL_P2P_DISABLE=1 per dire alle GPU di non utilizzare NVLink.

    Ecco il codice benchmark completo e gli output:

    # DDP w/ NVLink rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 {'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} # DDP w/o NVLink rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 {'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69}

    Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m) Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0

    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/models/input/v0_4_1/__init__.py
    class AxolotlConfigWCapabilities(AxolotlInputConfig): """wrapper to valdiate gpu capabilities with the configured options""" capabilities: GPUCapabilities @model_validator(mode="after") def check_bf16(self): if self.capabilities.bf16: if not self.bf16 and not self.bfloat16: LOG.info( "bf16 support detected, but not enabled for this configuration." ) else: if ( not self.merge_lora and not self.is_preprocess and (self.bf16 is True or self.bfloat16 is True) ): raise ValueError( "bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above." ) return self @model_validator(mode="before") @classmethod def check_sample_packing_w_sdpa_bf16(cls, data): is_sm_90: bool = ( data["capabilities"] and data["capabilities"].get("compute_capability") == "sm_90" ) if ( data.get("sample_packing") and data.get("sdp_attention") and (data.get("bfloat16") or data.get("bf16")) and not is_sm_90 ): # https://github.com/pytorch/pytorch/blob/1b03423526536b5f3d35bdfa95ccc6197556cf9b/test/test_transformers.py#L2440-L2450 LOG.warning( "sample_packing & torch sdpa with bf16 is unsupported may results in 0.0 loss. " "This may work on H100s." ) return data @model_validator(mode="before") @classmethod def check_fsdp_deepspeed(cls, data): if data.get("deepspeed") and data.get("fsdp"): raise ValueError("deepspeed and fsdp cannot be used together.") return data
    [huggingface/accelerate] src/accelerate/commands/launch.py
    options_to_group = { "multi_gpu": "Distributed GPUs", "tpu": "TPU", "use_deepspeed": "DeepSpeed Arguments", "use_fsdp": "FSDP Arguments", "use_megatron_lm": "Megatron-LM Arguments", }
    [huggingface/peft] docs/source/accelerate/deepspeed.md

    Use PEFT QLoRA and DeepSpeed with ZeRO3 for finetuning large models on multiple GPUs

    In this section, we will look at how to use QLoRA and DeepSpeed Stage-3 for finetuning 70B llama model on 2X40GB GPUs. For this, we first need bitsandbytes>=0.43.0, accelerate>=0.28.0, transformers>4.38.2, trl>0.7.11 and peft>0.9.0. We need to set zero3_init_flag to true when using Accelerate config. Below is the config which can be found at deepspeed_config_z3_qlora.yaml:

    compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_multinode_launcher: standard offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

    Launch command is given below which is available at run_peft_qlora_deepspeed_stage3.sh:

    accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml"  train.py \
    --seed 100 \
    --model_name_or_path "meta-llama/Llama-2-70b-hf" \
    --dataset_name "smangrul/ultrachat-10k-chatml" \
    --chat_template_format "chatml" \
    --add_special_tokens False \
    --append_concat_token False \
    --splits "train,test" \
    --max_seq_len 2048 \
    --num_train_epochs 1 \
    --logging_steps 5 \
    --log_level "info" \
    --logging_strategy "steps" \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --push_to_hub \
    --hub_private_repo True \
    --hub_strategy "every_save" \
    --bf16 True \
    --packing True \
    --learning_rate 1e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 1e-4 \
    --warmup_ratio 0.0 \
    --max_grad_norm 1.0 \
    --output_dir "llama-sft-qlora-dsz3" \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --gradient_checkpointing True \
    --use_reentrant True \
    --dataset_text_field "content" \
    --use_flash_attn True \
    --use_peft_lora True \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.1 \
    --lora_target_modules "all-linear" \
    --use_4bit_quantization True \
    --use_nested_quant True \
    --bnb_4bit_compute_dtype "bfloat16" \
    --bnb_4bit_quant_storage_dtype "bfloat16"
    

    Notice the new argument being passed bnb_4bit_quant_storage_dtype which denotes the data type for packing the 4-bit parameters. For example, when it is set to bfloat16, 32/4 = 8 4-bit params are packed together post quantization.

    In terms of training code, the important code changes are:

    ... bnb_config = BitsAndBytesConfig( load_in_4bit=args.use_4bit_quantization, bnb_4bit_quant_type=args.bnb_4bit_quant_type, bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=args.use_nested_quant, + bnb_4bit_quant_storage=quant_storage_dtype, ) ... model = AutoModelForCausalLM.from_pretrained( args.model_name_or_path, quantization_config=bnb_config, trust_remote_code=True, attn_implementation="flash_attention_2" if args.use_flash_attn else "eager", + torch_dtype=quant_storage_dtype or torch.float32, )

    Notice that torch_dtype for AutoModelForCausalLM is same as the bnb_4bit_quant_storage data type. That's it. Everything else is handled by Trainer and TRL.

    Memory usage

    In the above example, the memory consumed per GPU is 36.6 GB. Therefore, what took 8X80GB GPUs with DeepSpeed Stage 3+LoRA and a couple of 80GB GPUs with DDP+QLoRA now requires 2X40GB GPUs. This makes finetuning of large models more accessible.

    [huggingface/accelerate] examples/slurm/submit_multinode.sh
    #!/bin/bash #SBATCH --job-name=multinode #SBATCH -D . #SBATCH --output=O-%x.%j #SBATCH --error=E-%x.%j #SBATCH --nodes=4 # number of nodes #SBATCH --ntasks-per-node=1 # number of MP tasks #SBATCH --gres=gpu:4 # number of GPUs per node #SBATCH --cpus-per-task=160 # number of cores per tasks #SBATCH --time=01:59:00 # maximum execution time (HH:MM:SS) ###################### ### Set enviroment ### ###################### source activateEnvironment.sh export GPUS_PER_NODE=4 ###################### ###################### #### Set network ##### ###################### head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) ###################### export LAUNCHER="accelerate launch \ --num_processes $((SLURM_NNODES * GPUS_PER_NODE)) \ --num_machines $SLURM_NNODES \ --rdzv_backend c10d \ --main_process_ip $head_node_ip \ --main_process_port 29500 \ " export ACCELERATE_DIR="${ACCELERATE_DIR:-/accelerate}" export SCRIPT="${ACCELERATE_DIR}/examples/complete_nlp_example.py" export SCRIPT_ARGS=" \ --mixed_precision fp16 \ --output_dir ${ACCELERATE_DIR}/examples/output \ " # This step is necessary because accelerate launch does not handle multiline arguments properly export CMD="$LAUNCHER $PYTHON_FILE $ARGS" srun $CMD
    [huggingface/peft] examples/sft/README.md

    Supervised Fine-tuning (SFT) with PEFT

    In this example, we'll see how to use PEFT to perform SFT using PEFT on various distributed setups.

    Single GPU SFT with QLoRA

    QLoRA uses 4-bit quantization of the base model to drastically reduce the GPU memory consumed by the base model while using LoRA for parameter-efficient fine-tuning. The command to use QLoRA is present at run_peft.sh.

    Note:

    1. At present, use_reentrant needs to be True when using gradient checkpointing with QLoRA else QLoRA leads to high GPU memory consumption.

    Single GPU SFT with QLoRA using Unsloth

    Unsloth enables finetuning Mistral/Llama 2-5x faster with 70% less memory. It achieves this by reducing data upcasting, using Flash Attention 2, custom Triton kernels for RoPE embeddings, RMS Layernorm & Cross Entropy Loss and manual clever autograd computation to reduce the FLOPs during QLoRA finetuning. Below is the list of the optimizations from the Unsloth blogpost mistral-benchmark. The command to use QLoRA with Unsloth is present at run_unsloth_peft.sh.

    <div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/Unsloth.png"/> </div> <small>Optimization in Unsloth to speed up QLoRA finetuning while reducing GPU memory usage</small>

    Multi-GPU SFT with QLoRA

    To speed up QLoRA finetuning when you have access to multiple GPUs, look at the launch command at run_peft_multigpu.sh. This example to performs DDP on 8 GPUs.

    Note:

    1. At present, use_reentrant needs to be False when using gradient checkpointing with Multi-GPU QLoRA else it will lead to errors. However, this leads to huge GPU memory consumption.

    Multi-GPU SFT with LoRA and DeepSpeed

    When you have access to multiple GPUs, it would be better to use normal LoRA with DeepSpeed/FSDP. To use LoRA with DeepSpeed, refer the docs at PEFT with DeepSpeed.

    Multi-GPU SFT with LoRA and FSDP

    When you have access to multiple GPUs, it would be better to use normal LoRA with DeepSpeed/FSDP. To use LoRA with DeepSpeed, refer the docs at PEFT with FSDP.

    [huggingface/accelerate] examples/slurm/submit_multicpu.sh
    #!/bin/bash -l #SBATCH --job-name=multicpu #SBATCH --nodes=2 # number of Nodes #SBATCH --ntasks-per-node=1 # number of MP tasks #SBATCH --exclusive #SBATCH --output=O-%x.%j #SBATCH --error=E-%x.%j ###################### ### Set enviroment ### ###################### source activateEnvironment.sh ###################### #### Set network ##### ###################### head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) ###################### # Setup env variables for distributed jobs export MASTER_PORT="${MASTER_PORT:-29555 }" echo "head_node_ip=${head_node_ip}" echo "MASTER_PORT=${MASTER_PORT}" INSTANCES_PER_NODE="${INSTANCES_PER_NODE:-1}" if [[ $SLURM_NNODES == 1 ]] && [[ $INSTANCES_PER_NODE == 1 ]]; then export CCL_WORKER_COUNT=0 LAUNCHER="" else # Setup env variables for distributed jobs export CCL_WORKER_COUNT="${CCL_WORKER_COUNT:-2}" echo "CCL_WORKER_COUNT=${CCL_WORKER_COUNT}" # Write hostfile HOSTFILE_PATH=hostfile scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'> ${HOSTFILE_PATH} export LAUNCHER="accelerate launch \ --num_processes $((SLURM_NNODES * ${INSTANCES_PER_NODE})) \ --num_machines $SLURM_NNODES \ --rdzv_backend c10d \ --main_process_ip $head_node_ip \ --main_process_port $MASTER_PORT \ --mpirun_hostfile $HOSTFILE_PATH \ --mpirun_ccl $CCL_WORKER_COUNT" fi # This step is necessary because accelerate launch does not handle multiline arguments properly export ACCELERATE_DIR="${ACCELERATE_DIR:-/accelerate}" export SCRIPT="${ACCELERATE_DIR}/examples/complete_nlp_example.py" export SCRIPT_ARGS=" \ --cpu \ --output_dir ${ACCELERATE_DIR}/examples/output \ " # This step is necessary because accelerate launch does not handle multiline arguments properly export CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS" # Print the command echo $CMD echo "" # Run the command eval $CMD
    [huggingface/accelerate] docs/source/usage_guides/ipex.md

    How It Works For Training optimization in CPU

    🤗 Accelerate has integrated IPEX, all you need to do is enabling it through the config.

    Scenario 1: Acceleration of No distributed CPU training

    Run <u>accelerate config</u> on your machine:

    $ accelerate config ----------------------------------------------------------------------------------------------------------------------------------------------------------- In which compute environment are you running? This machine ----------------------------------------------------------------------------------------------------------------------------------------------------------- Which type of machine are you using? No distributed training Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:yes Do you want to use Intel PyTorch Extension (IPEX) to speed up training on CPU? [yes/NO]:yes Do you wish to optimize your script with torch dynamo?[yes/NO]:NO Do you want to use DeepSpeed? [yes/NO]: NO ----------------------------------------------------------------------------------------------------------------------------------------------------------- Do you wish to use FP16 or BF16 (mixed precision)? bf16

    This will generate a config file that will be used automatically to properly set the default options when doing

    accelerate launch my_script.py --args_to_my_script

    For instance, here is how you would run the NLP example examples/nlp_example.py (from the root of the repo) with IPEX enabled. default_config.yaml that is generated after accelerate config

    compute_environment: LOCAL_MACHINE distributed_type: 'NO' downcast_bf16: 'no' ipex_config: ipex: true machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: true
    accelerate launch examples/nlp_example.py

    Scenario 2: Acceleration of distributed CPU training we use Intel oneCCL for communication, combined with Intel® MPI library to deliver flexible, efficient, scalable cluster messaging on Intel® architecture. you could refer the here for the installation guide

    Run <u>accelerate config</u> on your machine(node0):

    $ accelerate config ----------------------------------------------------------------------------------------------------------------------------------------------------------- In which compute environment are you running? This machine ----------------------------------------------------------------------------------------------------------------------------------------------------------- Which type of machine are you using? multi-CPU How many different machines will you use (use more than 1 for multi-node training)? [1]: 4 ----------------------------------------------------------------------------------------------------------------------------------------------------------- What is the rank of this machine? 0 What is the IP address of the machine that will host the main process? 36.112.23.24 What is the port you will use to communicate with the main process? 29500 Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: yes Do you want to use Intel PyTorch Extension (IPEX) to speed up training on CPU? [yes/NO]:yes Do you want accelerate to launch mpirun? [yes/NO]: yes Please enter the path to the hostfile to use with mpirun [~/hostfile]: ~/hostfile Enter the number of oneCCL worker threads [1]: 1 Do you wish to optimize your script with torch dynamo?[yes/NO]:NO How many processes should be used for distributed training? [1]:16 ----------------------------------------------------------------------------------------------------------------------------------------------------------- Do you wish to use FP16 or BF16 (mixed precision)? bf16

    For instance, here is how you would run the NLP example examples/nlp_example.py (from the root of the repo) with IPEX enabled for distributed CPU training.

    default_config.yaml that is generated after accelerate config

    compute_environment: LOCAL_MACHINE distributed_type: MULTI_CPU downcast_bf16: 'no' ipex_config: ipex: true machine_rank: 0 main_process_ip: 36.112.23.24 main_process_port: 29500 main_training_function: main mixed_precision: bf16 mpirun_config: mpirun_ccl: '1' mpirun_hostfile: /home/user/hostfile num_machines: 4 num_processes: 16 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: true

    Set following env and using intel MPI to launch the training

    In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument. If you selected to have Accelerate launch mpirun, ensure that the location of your hostfile matches the path in the config.

    $ cat hostfile xxx.xxx.xxx.xxx #node0 ip xxx.xxx.xxx.xxx #node1 ip xxx.xxx.xxx.xxx #node2 ip xxx.xxx.xxx.xxx #node3 ip

    When Accelerate is launching mpirun, source the oneCCL bindings setvars.sh to get your Intel MPI environment, and then run your script using accelerate launch. Note that the python script and environment needs to exist on all of the machines being used for multi-CPU training.

    oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") source $oneccl_bindings_for_pytorch_path/env/setvars.sh accelerate launch examples/nlp_example.py

    Otherwise, if you selected not to have Accelerate launch mpirun, run the following command in node0 and 16DDP will be enabled in node0,node1,node2,node3 with BF16 mixed precision. When using this method, the python script, python environment, and accelerate config file need to be present on all of the machines used for multi-CPU training.

    oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") source $oneccl_bindings_for_pytorch_path/env/setvars.sh export CCL_WORKER_COUNT=1 export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip export CCL_ATL_TRANSPORT=ofi mpirun -f hostfile -n 16 -ppn 4 accelerate launch examples/nlp_example.py
OpenAccess-AI-Collective/axolotl
huggingface/transformers
huggingface/peft
huggingface/accelerate