OpenAccess-AI-Collective/axolotl

Do you see anything wrong with this config?

base_model: Qwen/Qwen2.5-32B
base_model_config: Qwen/Qwen2.5-32B
model_type: AutoModel
tokenizer_type: AutoTokenizer
trust_remote_code: true

shuffle_merged_datasets: true

datasets:
  - path: /workspace/finetuning/outputs/orion-v1.0.0.jsonl
    ds_type: json
    type: input_output

output_dir: /workspace/orion

warmup_steps: 50
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

wandb_project: lunalust
wandb_entity: orionsoftware

gradient_accumulation_steps: 1
micro_batch_size: 1
eval_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: constant
learning_rate: 0.00002

bf16: true

gradient_checkpointing: true
logging_steps: 1
flash_attention: true

saves_per_epoch: 1
deepspeed: /workspace/axolotl/deepspeed_configs/zero3.json
weight_decay: 0.01

special_tokens:
  eos_token: "<|im_end|>"
  pad_token: "<|endoftext|>"
  sep_token: "<sep>"
  unk_token: "<unk>"
  additional_special_tokens: ["<|im_start|>", "<sep>"]
tokens:
  - "<|im_start|>"
  - "_[username]"
  - "_[moanshort]"
  - "_[moanlong]"

The configuration file you've provided seems mostly correct for a typical fine-tuning task using a model from the Qwen repository with specific datasets, tokenizer, and training parameters. However, there are a few potential issues or points that might need clarification or correction based on common practices and the information provided:

model_type and tokenizer_type: The values AutoModel and AutoTokenizer are placeholders. You should specify the exact model and tokenizer classes depending on your task, such as GPT2LMHeadModel for causal language modeling with a GPT-2 model. The AutoModel and AutoTokenizer classes are used to automatically infer the specific model/tokenizer classes from a model repository, but they require the model repository to contain configuration files that specify the model architecture.
base_model_config: This field is not typically required unless you're specifying a custom configuration that differs from the default configuration of base_model. If Qwen/Qwen2.5-32B already includes a configuration file, you might not need to specify this unless you're overriding default settings.
trust_remote_code: Setting this to true means you're allowing the execution of custom code from the model repository (if any exists). This is fine as long as you trust the source, but it's worth mentioning for security reasons.
ds_type: In the datasets section, the field ds_type should be type, based on common usage in similar configuration files. This specifies the type of dataset being used (e.g., json, csv).
deepspeed Configuration Path: The deepspeed field expects a path to a DeepSpeed configuration file. Ensure that /workspace/axolotl/deepspeed_configs/zero3.json exists and is correctly configured for your training environment. This file controls various aspects of DeepSpeed's optimizations, such as ZeRO stages, offloading, etc.
optimizer: The value adamw_torch is specified. Ensure that this aligns with your intended use of either PyTorch's native AdamW optimizer or another variant. In DeepSpeed configurations, the optimizer can also be specified within the DeepSpeed config JSON.
lr_scheduler: You've specified constant, which means the learning rate will not change during training. Make sure this is intended, as other options like linear or cosine might be more suitable for some tasks.
bf16: You've enabled bfloat16 mixed precision training. Ensure your hardware (GPUs) supports BF16 operations for efficient training.
flash_attention: This option is specified as true. Verify that your DeepSpeed configuration and environment support FlashAttention for efficient attention computation.
special_tokens and tokens: You've defined custom special tokens and additional tokens. Ensure these are correctly integrated into your tokenizer and model for them to be recognized during training and inference.

Overall, the configuration seems tailored for a specific fine-tuning task with DeepSpeed optimization. Ensure that all paths, model types, and custom settings align with your project's requirements and the capabilities of your training environment.

Search results

[openaccess-ai-collective/axolotl] examples/redpajama/config-3b.yml

base_model: togethercomputer/RedPajama-INCITE-Chat-3B-v1
model_type: GPTNeoXForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code:
load_in_8bit: false
datasets:
  - path: vicgalle/alpaca-gpt4
    type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter:
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len:
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
lora_fan_in_fan_out: false
wandb_project: redpajama-alpaca-3b
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/redpajama-alpaca-3b
batch_size: 4
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0000002
train_on_inputs: false
group_by_length: false
bf16: auto
tf32: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 5
xformers_attention:
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0001
fsdp:
fsdp_config:
tokens:
  pad_token: "<|padding|>"
  bos_token: "<|endoftext|>"
  eos_token: "<|endoftext|>"
  unk_token: "<|endoftext|>"

[openaccess-ai-collective/axolotl] examples/falcon/config-7b.yml

base_model: tiiuae/falcon-7b
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
gptq: false
strict: false
push_dataset_to_hub:
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca:chat
dataset_prepared_path:
val_set_size: 0.05
adapter:
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len:
lora_r: 64
lora_alpha: 32
lora_dropout: 0.0
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/falcon-7b
batch_size: 2
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.00003
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 40
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"
  bos_token: "<|endoftext|>"
  eos_token: "<|endoftext|>"

[openaccess-ai-collective/axolotl] examples/mamba/config.yml

base_model: state-spaces/mamba-2.8b
model_type: MambaLMHeadModel
tokenizer_type: AutoTokenizer
tokenizer_config: EleutherAI/gpt-neox-20b

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./outputs/out

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: false

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 2
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 5e-5

train_on_inputs: false
group_by_length: true

bf16: auto
fp16:
tf32: true

gradient_checkpointing: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention:

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
tokens:
save_safetensors: False

[openaccess-ai-collective/axolotl] examples/pythia-12b/config.yml

base_model: EleutherAI/pythia-12b-deduped
base_model_ignore_patterns: pytorch*  # prefer safetensors
model_type: GPTNeoXForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
gptq: false
device_map: auto
datasets:
  - path: vicgalle/alpaca-gpt4
    type: alpaca
dataset_prepared_path:
val_set_size: 0.05
adapter:
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len: 2048
lora_r: 64
lora_alpha: 32
lora_dropout: 0.0
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out: true  # pythia/GPTNeoX lora specific
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/pythia-12b
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 5
learning_rate: 0.00003
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
train_on_inputs: false
group_by_length: false
bf16: false
fp16: false
float16: true
tf32: true
flash_optimum: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
gradient_checkpointing: true
fsdp:
fsdp_config:

[openaccess-ai-collective/axolotl] examples/mpt-7b/config.yml

base_model: mosaicml/mpt-7b
tokenizer_type: AutoTokenizer
trust_remote_code: true  # required for mpt as their model class is not merged into transformers yet
load_in_8bit: false
datasets:
  - path: vicgalle/alpaca-gpt4
    type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter:
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len:
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
lora_fan_in_fan_out: false
wandb_project: mpt-alpaca-7b
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/mpt-alpaca-7b
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0000002
train_on_inputs: false
group_by_length: false
bf16: auto
tf32: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 5
xformers_attention:
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0001
fsdp:
fsdp_config:
tokens:
  pad_token: "<|padding|>"
  bos_token: "<|endoftext|>"
  eos_token: "<|endoftext|>"
  unk_token: "<|endoftext|>"

[openaccess-ai-collective/axolotl] examples/stablelm-2/1.6b/fft.yml

base_model: stabilityai/stablelm-2-1_6b
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
flash_attn_cross_entropy: false
flash_attn_rms_norm: true
flash_attn_fuse_qkv: false
flash_attn_fuse_mlp: true

warmup_steps: 100
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: #deepspeed_configs/zero2.json # multi-gpu only
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:

[openaccess-ai-collective/axolotl] examples/openllama-3b/config.yml

base_model: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter:
lora_model_dir:
sequence_len: 1024
sample_packing: true
lora_r:
lora_alpha:
lora_dropout:
lora_target_modules:
lora_target_linear:
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/openllama-out
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.000003
train_on_inputs: false
group_by_length: false
float16: true
bf16: false
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
gptq_groupsize:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

[huggingface/transformers] tests/deepspeed/ds_config_zero3.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

[huggingface/transformers] tests/deepspeed/ds_config_zero2.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

[huggingface/peft] examples/conditional_generation/multitask_prompt_tuning.ipynb

from datasets import load_dataset from transformers import set_seed, AutoModelForSeq2SeqLM, AutoTokenizer from peft import get_peft_model, MultitaskPromptTuningConfig, TaskType, MultitaskPromptTuningInit

set_seed(42)

model_name = "google/flan-t5-base"

peft_config = MultitaskPromptTuningConfig( tokenizer_name_or_path=model_name, num_tasks=2, task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=MultitaskPromptTuningInit.TEXT, num_virtual_tokens=50, num_transformer_submodules=1, prompt_tuning_init_text="classify the following into either positive or negative, or entailment, neutral or contradiction:", )

tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) model = get_peft_model(model, peft_config)

model = model.cuda()

def send_to_device(batch): for i in batch: batch[i] = batch[i].cuda() return batch

def get_sst2(split: str): examples = load_dataset("sst2")[split] result_examples = [] for example in examples: result_examples.append({})

    result_examples[-1]["input"] = example["sentence"].strip() + "</s>"
    result_examples[-1]["output"] = (
        f"positive{tokenizer.eos_token}" if example["label"] == 1 else f"negative{tokenizer.eos_token}"
    )
    result_examples[-1]["task_id"] = 0

return result_examples

def get_mnli(split: str): examples = load_dataset("multi_nli")[split] result_examples = [] for example in examples: result_examples.append({})

    result_examples[-1]["input"] = example["premise"].strip() + " " + example["hypothesis"].strip() + "</s>"

    if example["label"] == 0:
        result_examples[-1]["output"] = f"entailment{tokenizer.eos_token}"
    elif example["label"] == 1:
        result_examples[-1]["output"] = f"neutral{tokenizer.eos_token}"
    else:
        result_examples[-1]["output"] = f"contradiction{tokenizer.eos_token}"

    result_examples[-1]["task_id"] = 1

return result_examples

from typing import Tuple from torch.utils.data import Dataset, DataLoader import torch

class MyDataset(Dataset): def init(self, split: str, mode: str = "source") -> None: super().init()

    if split == "train":
        if mode == "source":
            self.examples = get_sst2(split) + get_mnli(split)
        elif mode == "target":
            self.examples = get_sst2(split)
    if split == "val":
        self.examples = get_sst2("validation")
    if split == "test":
        self.examples = get_sst2("validation")

def __getitem__(self, index) -> dict:
    return self.examples[index]

def __len__(self) -> int:
    return len(self.examples)

def __getitem__(self, index) -> dict:
    return self.examples[index]

def __len__(self) -> int:
    return len(self.examples)

def collate_fn(batch: dict) -> Tuple[torch.Tensor, torch.Tensor]: input = [i["input"] for i in batch] input = tokenizer(input, add_special_tokens=False, return_tensors="pt", padding=True)

output = [i["output"] for i in batch]
output = tokenizer(output, add_special_tokens=False, return_tensors="pt", padding=True).input_ids
output[output == tokenizer.pad_token_id] = -100

task_ids = [i["task_id"] for i in batch]
task_ids = torch.tensor(task_ids)

return {
    "input_ids": input.input_ids,
    "attention_mask": input.attention_mask,
    "labels": output,
    "task_ids": task_ids,
}

train = DataLoader(MyDataset("train"), shuffle=True, batch_size=8, collate_fn=collate_fn) val = DataLoader(MyDataset("val"), shuffle=False, batch_size=8, collate_fn=collate_fn) test = DataLoader(MyDataset("test"), shuffle=False, batch_size=8, collate_fn=collate_fn)

## source training

from torch.optim.adamw import AdamW from transformers import get_cosine_schedule_with_warmup from tqdm import tqdm from sklearn.metrics import f1_score

POSITIVE_TOKEN_ID = tokenizer(" positive", add_special_tokens=False)["input_ids"][0] NEGATIVE_TOKEN_ID = tokenizer(" negative", add_special_tokens=False)["input_ids"][0]

def classify(batch): batch = send_to_device(batch) # we pass labels here since we need to generate and peft doesn't support generation yet. # No clue how to get around this scores = model(**batch).logits preds = [] for i in range(scores.shape[0]): if scores[i, 0, POSITIVE_TOKEN_ID] > scores[i, 0, NEGATIVE_TOKEN_ID]: preds.append(POSITIVE_TOKEN_ID) else: preds.append(NEGATIVE_TOKEN_ID) return preds

@torch.inference_mode() def evaluate(model, data): loss = 0 preds = [] golds = []

for batch in tqdm(data):
    batch = send_to_device(batch)
    loss += model(**batch).loss
    golds.extend(batch["labels"][:, 0].tolist())
    preds.extend(classify(batch))

return loss / len(val), f1_score(golds, preds, pos_label=POSITIVE_TOKEN_ID)

optimizer = AdamW(model.parameters(), lr=1e-4) scheduler = get_cosine_schedule_with_warmup(optimizer, 200, len(train))

n = 1000 step = 0 train_ = tqdm(train)

val_loss, f1 = evaluate(model, val) print( f""" before source training val loss = {val_loss} f1 = {f1}""" )

for batch in train_: if step % n == 0: val_loss, f1 = evaluate(model, val) print( f""" step = {step} val loss = {val_loss} f1 = {f1}""" ) model.save_pretrained(f"checkpoints_source/{step}")

step += 1
batch = send_to_device(batch)
loss = model(**batch).loss
loss.backward()
optimizer.step()
scheduler.step()
train_.set_postfix(train_loss=loss)

## target training

train = DataLoader(MyDataset("train", "target"), shuffle=True, batch_size=8, collate_fn=collate_fn) val = DataLoader(MyDataset("val", "target"), shuffle=False, batch_size=8, collate_fn=collate_fn) test = DataLoader(MyDataset("test", "target"), shuffle=False, batch_size=8, collate_fn=collate_fn)

#### create a fresh model

peft_config = MultitaskPromptTuningConfig( tokenizer_name_or_path=model_name, num_tasks=1, task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=MultitaskPromptTuningInit.EXACT_SOURCE_TASK, prompt_tuning_init_state_dict_path="checkpoints_source/50000/adapter_model.bin", num_virtual_tokens=50, num_transformer_submodules=1, )

tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) model = get_peft_model(model, peft_config)

model = model.cuda()

optimizer = AdamW(model.parameters(), lr=1e-4) scheduler = get_cosine_schedule_with_warmup(optimizer, 200, len(train))

n = 1000 step = 0 train_ = tqdm(train)

val_loss, f1 = evaluate(model, val) print( f""" before target training val loss = {val_loss} f1 = {f1}""" )

for batch in train_: if step % n == 0: val_loss, f1 = evaluate(model, val) print( f""" step = {step} val loss = {val_loss} f1 = {f1}""" ) model.save_pretrained(f"checkpoints_target/{step}")

step += 1
batch = send_to_device(batch)
loss = model(**batch).loss
loss.backward()
optimizer.step()
scheduler.step()
train_.set_postfix(train_loss=loss)

load last checkpoint for now

from peft import set_peft_model_state_dict

sd_6000 = torch.load("checkpoints_target/6000/adapter_model.bin") set_peft_model_state_dict(model, sd_6000)

evaluate val

val_loss, f1 = evaluate(model, val) print( f""" final val loss = {val_loss} f1 = {f1}""" )

evaluate test

test_loss, f1 = evaluate(model, test) print( f""" final test loss = {test_loss} f1 = {f1}""" )

[huggingface/transformers] utils/check_config_attributes.py

SPECIAL_CASES_TO_ALLOW = {
    # 'max_position_embeddings' is not used in modeling file, but needed for eval frameworks like Huggingface's lighteval (https://github.com/huggingface/lighteval/blob/af24080ea4f16eaf1683e353042a2dfc9099f038/src/lighteval/models/base_model.py#L264).
    # periods and offsers are not used in modeling file, but used in the configuration file to define `layers_block_type` and `layers_num_experts`.
    "JambaConfig": [
        "max_position_embeddings",
        "attn_layer_offset",
        "attn_layer_period",
        "expert_layer_offset",
        "expert_layer_period",
    ],
    # used to compute the property `self.chunk_length`
    "EncodecConfig": ["overlap"],
    # used to compute the property `self.layers_block_type`
    "RecurrentGemmaConfig": ["block_types"],
    # used as in the config to define `intermediate_size`
    "MambaConfig": ["expand"],
    # used as `self.bert_model = BertModel(config, ...)`
    "DPRConfig": True,
    "FuyuConfig": True,
    # not used in modeling files, but it's an important information
    "FSMTConfig": ["langs"],
    # used internally in the configuration class file
    "GPTNeoConfig": ["attention_types"],
    # used internally in the configuration class file
    "EsmConfig": ["is_folding_model"],
    # used during training (despite we don't have training script for these models yet)
    "Mask2FormerConfig": ["ignore_value"],
    # `ignore_value` used during training (despite we don't have training script for these models yet)
    # `norm` used in conversion script (despite not using in the modeling file)
    "OneFormerConfig": ["ignore_value", "norm"],
    # used during preprocessing and collation, see `collating_graphormer.py`
    "GraphormerConfig": ["spatial_pos_max"],
    # used internally in the configuration class file
    "T5Config": ["feed_forward_proj"],
    # used internally in the configuration class file
    # `tokenizer_class` get default value `T5Tokenizer` intentionally
    "MT5Config": ["feed_forward_proj", "tokenizer_class"],
    "UMT5Config": ["feed_forward_proj", "tokenizer_class"],
    # used internally in the configuration class file
    "LongT5Config": ["feed_forward_proj"],
    # used internally in the configuration class file
    "Pop2PianoConfig": ["feed_forward_proj"],
    # used internally in the configuration class file
    "SwitchTransformersConfig": ["feed_forward_proj"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "BioGptConfig": ["layer_norm_eps"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "GLPNConfig": ["layer_norm_eps"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "SegformerConfig": ["layer_norm_eps"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "CvtConfig": ["layer_norm_eps"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "PerceiverConfig": ["layer_norm_eps"],
    # used internally to calculate the feature size
    "InformerConfig": ["num_static_real_features", "num_time_features"],
    # used internally to calculate the feature size
    "TimeSeriesTransformerConfig": ["num_static_real_features", "num_time_features"],
    # used internally to calculate the feature size
    "AutoformerConfig": ["num_static_real_features", "num_time_features"],
    # used internally to calculate `mlp_dim`
    "SamVisionConfig": ["mlp_ratio"],
    # For (head) training, but so far not implemented
    "ClapAudioConfig": ["num_classes"],
    # Not used, but providing useful information to users
    "SpeechT5HifiGanConfig": ["sampling_rate"],
    # used internally in the configuration class file
    "UdopConfig": ["feed_forward_proj"],
    # Actually used in the config or generation config, in that case necessary for the sub-components generation
    "SeamlessM4TConfig": [
        "max_new_tokens",
        "t2u_max_new_tokens",
        "t2u_decoder_attention_heads",
        "t2u_decoder_ffn_dim",
        "t2u_decoder_layers",
        "t2u_encoder_attention_heads",
        "t2u_encoder_ffn_dim",
        "t2u_encoder_layers",
        "t2u_max_position_embeddings",
    ],
    # Actually used in the config or generation config, in that case necessary for the sub-components generation
    "SeamlessM4Tv2Config": [
        "max_new_tokens",
        "t2u_decoder_attention_heads",
        "t2u_decoder_ffn_dim",
        "t2u_decoder_layers",
        "t2u_encoder_attention_heads",
        "t2u_encoder_ffn_dim",
        "t2u_encoder_layers",
        "t2u_max_position_embeddings",
        "t2u_variance_pred_dropout",
        "t2u_variance_predictor_embed_dim",
        "t2u_variance_predictor_hidden_dim",
        "t2u_variance_predictor_kernel_size",
    ],
}

[huggingface/peft] examples/conditional_generation/peft_prefix_tuning_seq2seq.ipynb

from transformers import AutoModelForSeq2SeqLM from peft import get_peft_config, get_peft_model, get_peft_model_state_dict, PrefixTuningConfig, TaskType import torch from datasets import load_dataset import os

os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["CUDA_VISIBLE_DEVICES"] = "3" from transformers import AutoTokenizer from torch.utils.data import DataLoader from transformers import default_data_collator, get_linear_schedule_with_warmup from tqdm import tqdm from datasets import load_dataset

device = "cuda" model_name_or_path = "t5-large" tokenizer_name_or_path = "t5-large"

checkpoint_name = "financial_sentiment_analysis_prefix_tuning_v1.pt" text_column = "sentence" label_column = "text_label" max_length = 128 lr = 1e-2 num_epochs = 5 batch_size = 8

creating model

peft_config = PrefixTuningConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, num_virtual_tokens=20)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path) model = get_peft_model(model, peft_config) model.print_trainable_parameters() model

loading dataset

dataset = load_dataset("financial_phrasebank", "sentences_allagree") dataset = dataset["train"].train_test_split(test_size=0.1) dataset["validation"] = dataset["test"] del dataset["test"]

classes = dataset["train"].features["label"].names dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["label"]]}, batched=True, num_proc=1, )

dataset["train"][0]

data preprocessing

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt") labels = tokenizer(targets, max_length=2, padding="max_length", truncation=True, return_tensors="pt") labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs

processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=False, desc="Running tokenizer on dataset", )

train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["validation"]

train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True ) eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)

optimizer and lr scheduler

optimizer = torch.optim.AdamW(model.parameters(), lr=lr) lr_scheduler = get_linear_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=0, num_training_steps=(len(train_dataloader) * num_epochs), )

training and evaluation

model = model.to(device)

for epoch in range(num_epochs): model.train() total_loss = 0 for step, batch in enumerate(tqdm(train_dataloader)): batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss total_loss += loss.detach().float() loss.backward() optimizer.step() lr_scheduler.step() optimizer.zero_grad()

model.eval()
eval_loss = 0
eval_preds = []
for step, batch in enumerate(tqdm(eval_dataloader)):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    loss = outputs.loss
    eval_loss += loss.detach().float()
    eval_preds.extend(
        tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
    )

eval_epoch_loss = eval_loss / len(eval_dataloader)
eval_ppl = torch.exp(eval_epoch_loss)
train_epoch_loss = total_loss / len(train_dataloader)
train_ppl = torch.exp(train_epoch_loss)
print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")

print accuracy

correct = 0 total = 0 for pred, true in zip(eval_preds, dataset["validation"]["text_label"]): if pred.strip() == true.strip(): correct += 1 total += 1 accuracy = correct / total * 100 print(f"{accuracy=} % on the evaluation dataset") print(f"{eval_preds[:10]=}") print(f"{dataset['validation']['text_label'][:10]=}")

saving model

peft_model_id = f"{model_name_or_path}{peft_config.peft_type}{peft_config.task_type}" model.save_pretrained(peft_model_id)

ckpt = f"{peft_model_id}/adapter_model.bin" !du -h $ckpt

from peft import PeftModel, PeftConfig

peft_model_id = f"{model_name_or_path}{peft_config.peft_type}{peft_config.task_type}"

config = PeftConfig.from_pretrained(peft_model_id) model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path) model = PeftModel.from_pretrained(model, peft_model_id)

model.eval() i = 107 inputs = tokenizer(dataset["validation"][text_column][i], return_tensors="pt") print(dataset["validation"][text_column][i]) print(inputs)

with torch.no_grad(): outputs = model.generate(input_ids=inputs["input_ids"], max_new_tokens=10) print(outputs) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

[huggingface/peft] examples/boft_dreambooth/train_dreambooth.sh

IDX=$1
PROMPT_IDX=$((IDX % 25))
CLASS_IDX=$((IDX % 30))

# Define the UNIQUE_TOKEN, CLASS_TOKENs, and SUBJECT_NAMES
UNIQUE_TOKEN="qwe"

SUBJECT_NAMES=(
    "backpack" "backpack_dog" "bear_plushie" "berry_bowl" "can"
    "candle" "cat" "cat2" "clock" "colorful_sneaker"
    "dog" "dog2" "dog3" "dog5" "dog6"
    "dog7" "dog8" "duck_toy" "fancy_boot" "grey_sloth_plushie"
    "monster_toy" "pink_sunglasses" "poop_emoji" "rc_car" "red_cartoon"
    "robot_toy" "shiny_sneaker" "teapot" "vase" "wolf_plushie"
)

CLASS_TOKENs=(
    "backpack" "backpack" "stuffed animal" "bowl" "can"
    "candle" "cat" "cat" "clock" "sneaker"
    "dog" "dog" "dog" "dog" "dog"
    "dog" "dog" "toy" "boot" "stuffed animal"
    "toy" "glasses" "toy" "toy" "cartoon"
    "toy" "sneaker" "teapot" "vase" "stuffed animal"
)

CLASS_TOKEN=${CLASS_TOKENs[$CLASS_IDX]}
SELECTED_SUBJECT=${SUBJECT_NAMES[$CLASS_IDX]}

if [[ $CLASS_IDX =~ ^(0|1|2|3|4|5|8|9|17|18|19|20|21|22|23|24|25|26|27|28|29)$ ]]; then
  PROMPT_LIST=(
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the jungle."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the snow."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on the beach."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on a cobblestone street."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of pink fabric."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a wooden floor."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a city in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a mountain in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a blue house in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a purple rug in a forest."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a wheat field in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a tree and autumn leaves in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with the Eiffel Tower in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} floating on top of water."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} floating in an ocean of milk."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of green grass with sunflowers around it."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a mirror."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of the sidewalk in a crowded street."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a dirt road."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a white rug."
    "a red ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a purple ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a shiny ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a wet ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a cube shaped ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
  )

  prompt_test_list=(
    "a ${CLASS_TOKEN} in the jungle"
    "a ${CLASS_TOKEN} in the snow"
    "a ${CLASS_TOKEN} on the beach"
    "a ${CLASS_TOKEN} on a cobblestone street"
    "a ${CLASS_TOKEN} on top of pink fabric"
    "a ${CLASS_TOKEN} on top of a wooden floor"
    "a ${CLASS_TOKEN} with a city in the background"
    "a ${CLASS_TOKEN} with a mountain in the background"
    "a ${CLASS_TOKEN} with a blue house in the background"
    "a ${CLASS_TOKEN} on top of a purple rug in a forest"
    "a ${CLASS_TOKEN} with a wheat field in the background"
    "a ${CLASS_TOKEN} with a tree and autumn leaves in the background"
    "a ${CLASS_TOKEN} with the Eiffel Tower in the background"
    "a ${CLASS_TOKEN} floating on top of water"
    "a ${CLASS_TOKEN} floating in an ocean of milk"
    "a ${CLASS_TOKEN} on top of green grass with sunflowers around it"
    "a ${CLASS_TOKEN} on top of a mirror"
    "a ${CLASS_TOKEN} on top of the sidewalk in a crowded street"
    "a ${CLASS_TOKEN} on top of a dirt road"
    "a ${CLASS_TOKEN} on top of a white rug"
    "a red ${CLASS_TOKEN}"
    "a purple ${CLASS_TOKEN}"
    "a shiny ${CLASS_TOKEN}"
    "a wet ${CLASS_TOKEN}"
    "a cube shaped ${CLASS_TOKEN}"
  )

else
  PROMPT_LIST=(
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the jungle."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the snow."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on the beach."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on a cobblestone street."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of pink fabric."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a wooden floor."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a city in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a mountain in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a blue house in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a purple rug in a forest."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a red hat."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a santa hat."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a rainbow scarf."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a black top hat and a monocle."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a chef outfit."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a firefighter outfit."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a police outfit."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing pink glasses."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a yellow shirt."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a purple wizard outfit."
    "a red ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a purple ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a shiny ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a wet ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a cube shaped ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
  )

  prompt_test_list=(
    "a ${CLASS_TOKEN} in the jungle"
    "a ${CLASS_TOKEN} in the snow"
    "a ${CLASS_TOKEN} on the beach"
    "a ${CLASS_TOKEN} on a cobblestone street"
    "a ${CLASS_TOKEN} on top of pink fabric"
    "a ${CLASS_TOKEN} on top of a wooden floor"
    "a ${CLASS_TOKEN} with a city in the background"
    "a ${CLASS_TOKEN} with a mountain in the background"
    "a ${CLASS_TOKEN} with a blue house in the background"
    "a ${CLASS_TOKEN} on top of a purple rug in a forest"
    "a ${CLASS_TOKEN} wearing a red hat"
    "a ${CLASS_TOKEN} wearing a santa hat"
    "a ${CLASS_TOKEN} wearing a rainbow scarf"
    "a ${CLASS_TOKEN} wearing a black top hat and a monocle"
    "a ${CLASS_TOKEN} in a chef outfit"
    "a ${CLASS_TOKEN} in a firefighter outfit"
    "a ${CLASS_TOKEN} in a police outfit"
    "a ${CLASS_TOKEN} wearing pink glasses"
    "a ${CLASS_TOKEN} wearing a yellow shirt"
    "a ${CLASS_TOKEN} in a purple wizard outfit"
    "a red ${CLASS_TOKEN}"
    "a purple ${CLASS_TOKEN}"
    "a shiny ${CLASS_TOKEN}"
    "a wet ${CLASS_TOKEN}"
    "a cube shaped ${CLASS_TOKEN}"
  )
fi

VALIDATION_PROMPT=${PROMPT_LIST[@]}
INSTANCE_PROMPT="a photo of ${UNIQUE_TOKEN} ${CLASS_TOKEN}"
CLASS_PROMPT="a photo of ${CLASS_TOKEN}"

export MODEL_NAME="stabilityai/stable-diffusion-2-1" 
# export MODEL_NAME="runwayml/stable-diffusion-v1-5"

PEFT_TYPE="boft"
BLOCK_NUM=8
BLOCK_SIZE=0
N_BUTTERFLY_FACTOR=1

export PROJECT_NAME="dreambooth_${PEFT_TYPE}"
export RUN_NAME="${SELECTED_SUBJECT}_${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}"
export INSTANCE_DIR="./data/dreambooth/dataset/${SELECTED_SUBJECT}"
export CLASS_DIR="./data/class_data/${CLASS_TOKEN}"
export OUTPUT_DIR="./data/output/${PEFT_TYPE}"


accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir="$CLASS_DIR" \
  --output_dir=$OUTPUT_DIR \
  --wandb_project_name=$PROJECT_NAME \
  --wandb_run_name=$RUN_NAME \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="$INSTANCE_PROMPT" \
  --validation_prompt="$VALIDATION_PROMPT" \
  --class_prompt="$CLASS_PROMPT" \
  --resolution=512 \
  --train_batch_size=1 \
  --num_dataloader_workers=2 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --use_boft \
  --boft_block_num=$BLOCK_NUM \
  --boft_block_size=$BLOCK_SIZE \
  --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \
  --boft_dropout=0.1 \
  --boft_bias="boft_only" \
  --learning_rate=3e-5 \
  --max_train_steps=1010 \
  --checkpointing_steps=200 \
  --validation_steps=200 \
  --enable_xformers_memory_efficient_attention \
  --report_to="wandb" \

[huggingface/peft] examples/causal_language_modeling/accelerate_ds_zero3_cpu_offload_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false

[huggingface/peft] examples/conditional_generation/peft_prompt_tuning_seq2seq.ipynb

import os

import torch from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit from torch.utils.data import DataLoader from tqdm import tqdm from datasets import load_dataset

os.environ["TOKENIZERS_PARALLELISM"] = "false"

device = "cuda" model_name_or_path = "t5-large" tokenizer_name_or_path = "t5-large"

checkpoint_name = "financial_sentiment_analysis_prompt_tuning_v1.pt" text_column = "sentence" label_column = "text_label" max_length = 128 lr = 1 num_epochs = 5 batch_size = 8

creating model

peft_config = PromptTuningConfig( task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=PromptTuningInit.TEXT, num_virtual_tokens=20, prompt_tuning_init_text="What is the sentiment of this article?\n", inference_mode=False, tokenizer_name_or_path=model_name_or_path, )

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path) model = get_peft_model(model, peft_config) model.print_trainable_parameters() model

loading dataset

dataset = load_dataset("financial_phrasebank", "sentences_allagree") dataset = dataset["train"].train_test_split(test_size=0.1) dataset["validation"] = dataset["test"] del dataset["test"]

classes = dataset["train"].features["label"].names dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["label"]]}, batched=True, num_proc=1, )

dataset["train"][0]

data preprocessing

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes])

def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt") labels = tokenizer( targets, max_length=target_max_length, padding="max_length", truncation=True, return_tensors="pt" ) labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs

processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=False, desc="Running tokenizer on dataset", )

train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["validation"]

optimizer and lr scheduler

training and evaluation

model = model.to(device)

model.eval()
eval_loss = 0
eval_preds = []
for step, batch in enumerate(tqdm(eval_dataloader)):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    loss = outputs.loss
    eval_loss += loss.detach().float()
    eval_preds.extend(
        tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
    )

eval_epoch_loss = eval_loss / len(eval_dataloader)
eval_ppl = torch.exp(eval_epoch_loss)
train_epoch_loss = total_loss / len(train_dataloader)
train_ppl = torch.exp(train_epoch_loss)
print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")

print accuracy

saving model

peft_model_id = f"{model_name_or_path}{peft_config.peft_type}{peft_config.task_type}" model.save_pretrained(peft_model_id)

ckpt = f"{peft_model_id}/adapter_model.bin" !du -h $ckpt

from peft import PeftModel, PeftConfig

peft_model_id = f"{model_name_or_path}{peft_config.peft_type}{peft_config.task_type}"

config = PeftConfig.from_pretrained(peft_model_id) model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path) model = PeftModel.from_pretrained(model, peft_model_id)

model.eval() i = 107 input_ids = tokenizer(dataset["validation"][text_column][i], return_tensors="pt").input_ids print(dataset["validation"][text_column][i]) print(input_ids)

with torch.no_grad(): outputs = model.generate(input_ids=input_ids, max_new_tokens=10) print(outputs) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

[huggingface/accelerate] examples/by_feature/megatron_lm_gpt_pretraining.py

                    continue

            with accelerator.accumulate(model):
                outputs = model(**batch)
                loss = outputs.loss
                # We keep track of the loss at each epoch
                if args.with_tracking:
                    total_loss += loss.detach().float()
                accelerator.backward(loss)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

            # Checks if the accelerator has performed an optimization step behind the scenes
            if accelerator.sync_gradients:
                progress_bar.update(1)
                completed_steps += 1

            if isinstance(checkpointing_steps, int):
                if completed_steps % checkpointing_steps == 0:
                    output_dir = f"step_{completed_steps }"
                    if args.output_dir is not None:
                        output_dir = os.path.join(args.output_dir, output_dir)
                    accelerator.save_state(output_dir)
            if completed_steps >= args.max_train_steps:
                break

        model.eval()
        losses = []
        for step, batch in enumerate(eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)

            loss = outputs.loss
            # New Code
            # For Megatron-LM, the losses are already averaged across the data parallel group
            if accelerator.distributed_type == DistributedType.MEGATRON_LM:
                losses.append(loss)
            else:
                losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size)))
        try:
            if accelerator.distributed_type == DistributedType.MEGATRON_LM:
                losses = torch.tensor(losses)
            else:
                losses = torch.cat(losses)
            eval_loss = torch.mean(losses)
            perplexity = math.exp(eval_loss)
        except OverflowError:
            perplexity = float("inf")

        logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}")

        if args.with_tracking:
            accelerator.log(
                {
                    "perplexity": perplexity,
                    "eval_loss": eval_loss,
                    "train_loss": total_loss.item() / len(train_dataloader),
                    "epoch": epoch,
                    "step": completed_steps,
                },
                step=completed_steps,
            )

        if args.push_to_hub and epoch < args.num_train_epochs - 1:
            accelerator.wait_for_everyone()
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(
                args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save
            )
            if accelerator.is_main_process:
                tokenizer.save_pretrained(args.output_dir)
                api.upload_folder(
                    repo_id=repo_id,
                    folder_path=args.output_dir,
                    commit_message=f"Training in progress epoch {epoch}",
                    run_as_future=True,
                )

        if args.checkpointing_steps == "epoch":
            output_dir = f"epoch_{epoch}"
            if args.output_dir is not None:
                output_dir = os.path.join(args.output_dir, output_dir)
            accelerator.save_state(output_dir)

    # this is causing some issue with Megatron-LM when using `wandb` at the end of the main function.
    # Everything works fine inspite of commenting this out. (wandb finishes/closes the run without error)
    # if args.with_tracking:
    #     accelerator.end_training()

    if args.output_dir is not None:
        accelerator.wait_for_everyone()
        # New Code
        # For Megatron-LM, we need to save the model using `accelerator.save_state`
        if accelerator.distributed_type == DistributedType.MEGATRON_LM:
            accelerator.save_state(args.output_dir)
        else:
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(
                args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save
            )
        if accelerator.is_main_process:
            tokenizer.save_pretrained(args.output_dir)
            if args.push_to_hub:
                api.upload_folder(
                    repo_id=repo_id,
                    folder_path=args.output_dir,
                    commit_message="End of training",
                )

        with open(os.path.join(args.output_dir, "all_results.json"), "w") as f:
            json.dump({"perplexity": perplexity}, f)

[huggingface/accelerate] examples/by_feature/deepspeed_with_config_support.py

def parse_args():
    parser = argparse.ArgumentParser(description="Finetune a transformers model on a causal language modeling task")
    parser.add_argument(
        "--dataset_name",
        type=str,
        default=None,
        help="The name of the dataset to use (via the datasets library).",
    )
    parser.add_argument(
        "--dataset_config_name",
        type=str,
        default=None,
        help="The configuration name of the dataset to use (via the datasets library).",
    )
    parser.add_argument(
        "--train_file", type=str, default=None, help="A csv or a json file containing the training data."
    )
    parser.add_argument(
        "--validation_file", type=str, default=None, help="A csv or a json file containing the validation data."
    )
    parser.add_argument(
        "--validation_split_percentage",
        default=5,
        help="The percentage of the train set used as validation set in case there's no validation split",
    )
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to pretrained model or model identifier from huggingface.co/models.",
        required=False,
    )
    parser.add_argument(
        "--config_name",
        type=str,
        default=None,
        help="Pretrained config name or path if not the same as model_name",
    )
    parser.add_argument(
        "--tokenizer_name",
        type=str,
        default=None,
        help="Pretrained tokenizer name or path if not the same as model_name",
    )
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
        type=int,
        default=8,
        help="Batch size (per device) for the training dataloader.",
    )
    parser.add_argument(
        "--per_device_eval_batch_size",
        type=int,
        default=8,
        help="Batch size (per device) for the evaluation dataloader.",
    )
    parser.add_argument(
        "--learning_rate",
        type=float,
        default=5e-5,
        help="Initial learning rate (after the potential warmup period) to use.",
    )
    parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.")
    parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.")
    parser.add_argument(
        "--max_train_steps",
        type=int,
        default=None,
        help="Total number of training steps to perform. If provided, overrides num_train_epochs.",
    )
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument(
        "--lr_scheduler_type",
        type=SchedulerType,
        default="linear",
        help="The scheduler type to use.",
        choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"],
    )
    parser.add_argument(
        "--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler."
    )
    parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.")
    parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.")
    parser.add_argument(
        "--model_type",
        type=str,
        default=None,
        help="Model type to use if training from scratch.",
        choices=MODEL_TYPES,
    )
    parser.add_argument(
        "--block_size",
        type=int,
        default=None,
        help=(
            "Optional input sequence length after tokenization. The training dataset will be truncated in block of"
            " this size for training. Default to the model max input length for single sentence inputs (take into"
            " account special tokens)."
        ),
    )
    parser.add_argument(
        "--preprocessing_num_workers",
        type=int,
        default=None,
        help="The number of processes to use for the preprocessing.",
    )
    parser.add_argument(
        "--overwrite_cache", type=bool, default=False, help="Overwrite the cached training and evaluation sets"
    )
    parser.add_argument(
        "--no_keep_linebreaks", action="store_true", help="Do not keep line breaks when using TXT files."
    )
    parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.")
    parser.add_argument(
        "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`."
    )
    parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.")
    parser.add_argument(
        "--checkpointing_steps",
        type=str,
        default=None,
        help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
    )
    parser.add_argument(
        "--resume_from_checkpoint",
        type=str,
        default=None,
        help="If the training should continue from a checkpoint folder.",
    )
    # New Code #
    # Whether to load the best model at the end of training
    parser.add_argument(
        "--load_best_model",
        action="store_true",
        help="Whether to load the best model at the end of training",
    )
    parser.add_argument(
        "--with_tracking",
        action="store_true",
        help="Whether to enable experiment trackers for logging.",
    )
    parser.add_argument(
        "--report_to",
        type=str,
        default="all",
        help=(
            'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,'
            ' `"wandb"`, `"comet_ml"`, and `"dvclive"`. Use `"all"` (default) to report to all integrations.'
            "Only applicable when `--with_tracking` is passed."
        ),
    )
    args = parser.parse_args()

    # Sanity checks
    if args.dataset_name is None and args.train_file is None and args.validation_file is None:
        raise ValueError("Need either a dataset name or a training/validation file.")
    else:
        if args.train_file is not None:
            extension = args.train_file.split(".")[-1]
            assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, json or txt file."
        if args.validation_file is not None:
            extension = args.validation_file.split(".")[-1]
            assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, json or txt file."

    if args.push_to_hub:
        assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed."

    return args

[huggingface/accelerate] docs/source/usage_guides/deepspeed.md

How it works?

Pre-Requisites: Install DeepSpeed version >=0.6.5. Please refer to the DeepSpeed Installation details for more information.

We will first look at easy to use integration via accelerate config. Followed by more flexible and feature rich deepspeed config file integration.

Accelerate DeepSpeed Plugin

On your machine(s) just run:

accelerate config

and answer the questions asked. It will ask whether you want to use a config file for DeepSpeed to which you should answer no. Then answer the following questions to generate a basic DeepSpeed config. This will generate a config file that will be used automatically to properly set the default options when doing

accelerate launch my_script.py --args_to_my_script

For instance, here is how you would run the NLP example examples/nlp_example.py (from the root of the repo) with DeepSpeed Plugin:

ZeRO Stage-2 DeepSpeed Plugin Example

compute_environment: LOCAL_MACHINE
deepspeed_config:
 gradient_accumulation_steps: 1
 gradient_clipping: 1.0
 offload_optimizer_device: none
 offload_param_device: none
 zero3_init_flag: true
 zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

accelerate launch examples/nlp_example.py --mixed_precision fp16

ZeRO Stage-3 with CPU Offload DeepSpeed Plugin Example

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

accelerate launch examples/nlp_example.py --mixed_precision fp16

Currently, Accelerate supports following config through the CLI:

`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them.
`gradient_clipping`: Enable gradient clipping with value.
`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2.
`offload_optimizer_nvme_path`: Decides Nvme Path to offload optimizer states. If unspecified, will default to 'none'.
`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3.
`offload_param_nvme_path`: Decides Nvme Path to offload parameters. If unspecified, will default to 'none'.
`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3.
`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3.
`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training.
`deepspeed_moe_layer_cls_names`: Comma-separated list of transformer Mixture-of-Experts (MoE) layer class names (case-sensitive) to wrap ,e.g, `MixtralSparseMoeBlock`, `Qwen2MoeSparseMoeBlock`, `JetMoEAttention,JetMoEBlock` ...
`deepspeed_hostfile`: DeepSpeed hostfile for configuring multi-node compute resources.
`deepspeed_exclusion_filter`: DeepSpeed exclusion filter string when using mutli-node setup.
`deepspeed_inclusion_filter`: DeepSpeed inclusion filter string when using mutli-node setup.
`deepspeed_multinode_launcher`: DeepSpeed multi-node launcher to use. If unspecified, will default to `pdsh`.
`deepspeed_config_file`: path to the DeepSpeed config file in `json` format. See the next section for more details on this.

To be able to tweak more options, you will need to use a DeepSpeed config file.

DeepSpeed Config File

On your machine(s) just run:

accelerate config

and answer the questions asked. It will ask whether you want to use a config file for deepspeed to which you answer yes and provide the path to the deepspeed config file. This will generate a config file that will be used automatically to properly set the default options when doing

accelerate launch my_script.py --args_to_my_script

For instance, here is how you would run the NLP example examples/by_feature/deepspeed_with_config_support.py (from the root of the repo) with DeepSpeed Config File:

ZeRO Stage-2 DeepSpeed Config File Example

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage2_config.json
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

with the contents of zero_stage2_config.json being:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

accelerate launch examples/by_feature/deepspeed_with_config_support.py \
--config_name "gpt2-large" \
--tokenizer_name "gpt2-large" \
--dataset_name "wikitext" \
--dataset_config_name "wikitext-2-raw-v1" \
--block_size 128 \
--output_dir "./clm/clm_deepspeed_stage2_accelerate" \
--learning_rate 5e-4 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 24 \
--num_train_epochs 3 \
--with_tracking \
--report_to "wandb"\

ZeRO Stage-3 with CPU offload DeepSpeed Config File Example

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage3_offload_config.json
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

with the contents of zero_stage3_offload_config.json being:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "sub_group_size": 1e9,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto"
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

accelerate launch examples/by_feature/deepspeed_with_config_support.py \
--config_name "gpt2-large" \
--tokenizer_name "gpt2-large" \
--dataset_name "wikitext" \
--dataset_config_name "wikitext-2-raw-v1" \
--block_size 128 \
--output_dir "./clm/clm_deepspeed_stage3_offload_accelerate" \
--learning_rate 5e-4 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--num_train_epochs 3 \
--with_tracking \
--report_to "wandb"\

ZeRO++ Config Example You can use the features of ZeRO++ by using the appropriate config parameters. Note that ZeRO++ is an extension for ZeRO Stage 3. Here is how the config file can be modified, from DeepSpeed's ZeRO++ tutorial:

{
    "zero_optimization": {
        "stage": 3,
        "reduce_bucket_size": "auto",

        "zero_quantized_weights": true,
        "zero_hpz_partition_size": 8,
        "zero_quantized_gradients": true,

        "contiguous_gradients": true,
        "overlap_comm": true
    }
}

For hierarchical partitioning, the partition size zero_hpz_partition_size should ideally be set to the number of GPUs per node. (For example, the above config file assumes 8 GPUs per node)

Important code changes when using DeepSpeed Config File

DeepSpeed Optimizers and Schedulers. For more information on these, see the DeepSpeed Optimizers and DeepSpeed Schedulers documentation. We will look at the changes needed in the code when using these.

a. DS Optim + DS Scheduler: The case when both optimizer and scheduler keys are present in the DeepSpeed config file. In this situation, those will be used and the user has to use accelerate.utils.DummyOptim and accelerate.utils.DummyScheduler to replace the PyTorch/Custom optimizers and schedulers in their code. Below is the snippet from examples/by_feature/deepspeed_with_config_support.py showing this:
```
 # Creates Dummy Optimizer if `optimizer` was specified in the config file else creates Adam Optimizer
 optimizer_cls = (
     torch.optim.AdamW
     if accelerator.state.deepspeed_plugin is None
     or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
     else DummyOptim
 )
 optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate)

 # Creates Dummy Scheduler if `scheduler` was specified in the config file else creates `args.lr_scheduler_type` Scheduler
 if (
     accelerator.state.deepspeed_plugin is None
     or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
 ):
     lr_scheduler = get_scheduler(
         name=args.lr_scheduler_type,
         optimizer=optimizer,
         num_warmup_steps=args.num_warmup_steps,
         num_training_steps=args.max_train_steps,
     )
 else:
     lr_scheduler = DummyScheduler(
         optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps
     )
```
b. Custom Optim + Custom Scheduler: The case when both optimizer and scheduler keys are absent in the DeepSpeed config file. In this situation, no code changes are needed from the user and this is the case when using integration via DeepSpeed Plugin. In the above example we can see that the code remains unchanged if the optimizer and scheduler keys are absent in the DeepSpeed config file.

c. Custom Optim + DS Scheduler: The case when only scheduler key is present in the DeepSpeed config file. In this situation, the user has to use accelerate.utils.DummyScheduler to replace the PyTorch/Custom scheduler in their code.

d. DS Optim + Custom Scheduler: The case when only optimizer key is present in the DeepSpeed config file. This will result in an error because you can only use DS Scheduler when using DS Optim.
Notice the auto values in the above example DeepSpeed config files. These are automatically handled by prepare method based on model, dataloaders, dummy optimizer and dummy schedulers provided to prepare method. Only the auto fields specified in above examples are handled by prepare method and the rest have to be explicitly specified by the user.

The auto values are calculated as:

reduce_bucket_size: hidden_size * hidden_size
stage3_prefetch_bucket_size: int(0.9 * hidden_size * hidden_size)
stage3_param_persistence_threshold: 10 * hidden_size

For the auto feature to work for these 3 config entries - Accelerate will use model.config.hidden_size or max(model.config.hidden_sizes) as hidden_size. If neither of these is available, the launching will fail and you will have to set these 3 config entries manually. Remember the first 2 config entries are the communication buffers - the larger they are the more efficient the comms will be, and the larger they are the more GPU memory they will consume, so it's a tunable performance trade-off.

Things to note when using DeepSpeed Config File

Below is a sample script using deepspeed_config_file in different scenarios.

Code test.py:

from accelerate import Accelerator
from accelerate.state import AcceleratorState


def main():
    accelerator = Accelerator()
    accelerator.print(f"{AcceleratorState()}")


if __name__ == "__main__":
    main()

Scenario 1: Manually tampered accelerate config file having deepspeed_config_file along with other entries.

Content of the accelerate config:

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: 'cpu'
  offload_param_device: 'cpu'
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  deepspeed_config_file: 'ds_config.json'
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

ds_config.json:

{
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "stage3_gather_16bit_weights_on_model_save": false,
        "offload_optimizer": {
            "device": "none"
        },
        "offload_param": {
            "device": "none"
        }
    },
    "gradient_clipping": 1.0,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": 10,
    "steps_per_print": 2000000
}

Output of accelerate launch test.py:

ValueError: When using `deepspeed_config_file`, the following accelerate config variables will be ignored:
['gradient_accumulation_steps', 'gradient_clipping', 'zero_stage', 'offload_optimizer_device', 'offload_param_device',
'zero3_save_16bit_model', 'mixed_precision'].
Please specify them appropriately in the DeepSpeed config file.
If you are using an accelerate config file, remove other config variables mentioned in the above specified list.
The easiest method is to create a new config following the questionnaire via `accelerate config`.
It will only ask for the necessary config variables when using `deepspeed_config_file`.

Scenario 2: Use the solution of the error to create new accelerate config and check that no ambiguity error is now thrown.

Run accelerate config:

$ accelerate config
-------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-------------------------------------------------------------------------------------------------------------------------------

[huggingface/accelerate] docs/source/usage_guides/deepspeed.md

Which type of machine are you using? multi-GPU How many different machines will you use (use more than 1 for multi-node training)? [1]: Do you wish to optimize your script with torch dynamo?[yes/NO]: Do you want to use DeepSpeed? [yes/NO]: yes Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes Please enter the path to the json DeepSpeed config file: ds_config.json Do you want to enable deepspeed.zero.Init when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes How many GPU(s) should be used for distributed training? [1]:4 accelerate configuration saved at ds_config_sample.yaml


2. Content of the `accelerate` config:

```yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: ds_config.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

Output of accelerate launch test.py:

Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
ds_config: {'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'stage3_gather_16bit_weights_on_model_save': False, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}}, 'gradient_clipping': 1.0, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 10, 'steps_per_print': inf, 'fp16': {'enabled': False}}

Scenario 3: Setting the accelerate launch command arguments related to DeepSpeed as "auto" in the DeepSpeed` configuration file and check that things work as expected.

New ds_config.json with "auto" for the accelerate launch DeepSpeed command arguments:

{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": "auto",
        "stage3_gather_16bit_weights_on_model_save": "auto",
        "offload_optimizer": {
            "device": "auto"
        },
        "offload_param": {
            "device": "auto"
        }
    },
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "steps_per_print": 2000000
}

Output of accelerate launch --mixed_precision="fp16" --zero_stage=3 --gradient_accumulation_steps=5 --gradient_clipping=1.0 --offload_param_device="cpu" --offload_optimizer_device="nvme" --zero3_save_16bit_model="true" test.py:

Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
ds_config: {'bf16': {'enabled': False}, 'zero_optimization': {'stage': 3, 'stage3_gather_16bit_weights_on_model_save': True, 'offload_optimizer': {'device': 'nvme'}, 'offload_param': {'device': 'cpu'}}, 'gradient_clipping': 1.0, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 5, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}

Note:

Remaining "auto" values are handled in accelerator.prepare() call as explained in point 2 of Important code changes when using DeepSpeed Config File.
Only when gradient_accumulation_steps is auto, the value passed while creating Accelerator object via Accelerator(gradient_accumulation_steps=k) will be used. When using DeepSpeed Plugin, the value from it will be used and it will overwrite the value passed while creating Accelerator object.

OpenAccess-AI-Collective/axolotl

huggingface/transformers

huggingface/peft

huggingface/accelerate