OpenAccess-AI-Collective/axolotl

Can you see anything wrong with this config file?

base_model: meta-llama/Meta-Llama-3-8B

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: "Replete-AI/Sandevistan"
    type: sharegpt
    conversation: llama3
    field_human: human
    field_model: pneuma
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./outputs/out

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project: bab687a65317f91849c9af35a924b78fe525ec37
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token:

Yes, there are a few issues and potential improvements in the provided configuration file:

Incorrect or Unusual Dataset Configuration:
- The datasets section includes fields like conversation, field_human, and field_model which are not standard fields in the configuration examples provided in the OpenAccess-AI-Collective/axolotl codebase. If these fields are specific to a custom dataset processing script, ensure that the script correctly interprets and uses these fields. Otherwise, these fields might be unnecessary or incorrectly named.
wandb_project Token as Plain Text:
- The wandb_project field seems to contain a token or key (bab687a65317f91849c9af35a924b78fe525ec37). Typically, this field is expected to be the name of the project in Weights & Biases (wandb). If this is a sensitive token, it should not be included in plain text in the configuration file. It's better to use environment variables or wandb's login mechanism to handle authentication securely.
Empty or Unset Fields:
- Several fields such as wandb_entity, wandb_watch, wandb_name, wandb_log_model, early_stopping_patience, resume_from_checkpoint, xformers_attention, eval_table_size, deepspeed, fsdp, and fsdp_config are either empty or not set. If these fields are not required for your training, consider removing them to clean up the configuration file. If they are needed, ensure they are correctly set.
special_tokens Configuration:
- The special_tokens section only specifies pad_token but does not assign any value to it. If a special padding token is required for your model or tokenizer, ensure it is correctly specified. Otherwise, you might want to remove this section if it's not needed.
Potential Misconfiguration with bf16, fp16, and tf32:
- The fields bf16, fp16, and tf32 are related to mixed precision training. The bf16 is set to auto, while fp16 is empty, and tf32 is set to false. Ensure that this configuration aligns with your hardware capabilities and training requirements. Typically, you would use one precision mode based on your hardware (e.g., fp16 for NVIDIA GPUs with Tensor Cores).
Gradient Accumulation and Micro Batch Size:
- The configuration specifies gradient_accumulation_steps: 8 and micro_batch_size: 1. Ensure that this setup matches your memory constraints and training efficiency requirements. Sometimes adjusting these values can significantly impact training speed and memory usage.
load_in_8bit and load_in_4bit:
- Both load_in_8bit and load_in_4bit are set to false. If you are exploring memory-efficient training techniques, you might want to investigate these options further. However, if your training setup does not support these modes, leaving them as false is appropriate.
gradient_checkpointing_kwargs:
- The gradient_checkpointing_kwargs is provided but only includes use_reentrant: false. If you are not using any other keyword arguments for gradient checkpointing, you might simplify the configuration by removing this section or ensure that all necessary arguments are included.

Review and adjust the configuration based on these observations and the specific requirements of your training setup and model architecture.

Search results

[openaccess-ai-collective/axolotl] examples/mamba/config.yml

base_model: state-spaces/mamba-2.8b
model_type: MambaLMHeadModel
tokenizer_type: AutoTokenizer
tokenizer_config: EleutherAI/gpt-neox-20b

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./outputs/out

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: false

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 2
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 5e-5

train_on_inputs: false
group_by_length: true

bf16: auto
fp16:
tf32: true

gradient_checkpointing: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention:

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
tokens:
save_safetensors: False

[openaccess-ai-collective/axolotl] examples/llama-3/fft-8b.yaml

base_model: meta-llama/Meta-Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/out

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>

[openaccess-ai-collective/axolotl] examples/openllama-3b/config.yml

base_model: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter:
lora_model_dir:
sequence_len: 1024
sample_packing: true
lora_r:
lora_alpha:
lora_dropout:
lora_target_modules:
lora_target_linear:
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/openllama-out
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.000003
train_on_inputs: false
group_by_length: false
float16: true
bf16: false
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
gptq_groupsize:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

[openaccess-ai-collective/axolotl] examples/llama-2/fft_optimized.yml

base_model: NousResearch/Llama-2-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
flash_attn_cross_entropy: false
flash_attn_rms_norm: true
flash_attn_fuse_qkv: false
flash_attn_fuse_mlp: true

warmup_steps: 100
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: #deepspeed_configs/zero2.json # multi-gpu only
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:

[openaccess-ai-collective/axolotl] examples/redpajama/config-3b.yml

base_model: togethercomputer/RedPajama-INCITE-Chat-3B-v1
model_type: GPTNeoXForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code:
load_in_8bit: false
datasets:
  - path: vicgalle/alpaca-gpt4
    type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter:
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len:
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
lora_fan_in_fan_out: false
wandb_project: redpajama-alpaca-3b
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/redpajama-alpaca-3b
batch_size: 4
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0000002
train_on_inputs: false
group_by_length: false
bf16: auto
tf32: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 5
xformers_attention:
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0001
fsdp:
fsdp_config:
tokens:
  pad_token: "<|padding|>"
  bos_token: "<|endoftext|>"
  eos_token: "<|endoftext|>"
  unk_token: "<|endoftext|>"

[openaccess-ai-collective/axolotl] examples/falcon/config-7b.yml

base_model: tiiuae/falcon-7b
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
gptq: false
strict: false
push_dataset_to_hub:
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca:chat
dataset_prepared_path:
val_set_size: 0.05
adapter:
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len:
lora_r: 64
lora_alpha: 32
lora_dropout: 0.0
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/falcon-7b
batch_size: 2
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.00003
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 40
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"
  bos_token: "<|endoftext|>"
  eos_token: "<|endoftext|>"

[openaccess-ai-collective/axolotl] examples/pythia-12b/config.yml

base_model: EleutherAI/pythia-12b-deduped
base_model_ignore_patterns: pytorch*  # prefer safetensors
model_type: GPTNeoXForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
gptq: false
device_map: auto
datasets:
  - path: vicgalle/alpaca-gpt4
    type: alpaca
dataset_prepared_path:
val_set_size: 0.05
adapter:
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len: 2048
lora_r: 64
lora_alpha: 32
lora_dropout: 0.0
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out: true  # pythia/GPTNeoX lora specific
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/pythia-12b
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 5
learning_rate: 0.00003
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
train_on_inputs: false
group_by_length: false
bf16: false
fp16: false
float16: true
tf32: true
flash_optimum: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
gradient_checkpointing: true
fsdp:
fsdp_config:

[huggingface/transformers] tests/deepspeed/ds_config_zero3.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

[huggingface/transformers] tests/deepspeed/ds_config_zero2.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

[huggingface/peft] examples/boft_dreambooth/train_dreambooth.sh

IDX=$1
PROMPT_IDX=$((IDX % 25))
CLASS_IDX=$((IDX % 30))

# Define the UNIQUE_TOKEN, CLASS_TOKENs, and SUBJECT_NAMES
UNIQUE_TOKEN="qwe"

SUBJECT_NAMES=(
    "backpack" "backpack_dog" "bear_plushie" "berry_bowl" "can"
    "candle" "cat" "cat2" "clock" "colorful_sneaker"
    "dog" "dog2" "dog3" "dog5" "dog6"
    "dog7" "dog8" "duck_toy" "fancy_boot" "grey_sloth_plushie"
    "monster_toy" "pink_sunglasses" "poop_emoji" "rc_car" "red_cartoon"
    "robot_toy" "shiny_sneaker" "teapot" "vase" "wolf_plushie"
)

CLASS_TOKENs=(
    "backpack" "backpack" "stuffed animal" "bowl" "can"
    "candle" "cat" "cat" "clock" "sneaker"
    "dog" "dog" "dog" "dog" "dog"
    "dog" "dog" "toy" "boot" "stuffed animal"
    "toy" "glasses" "toy" "toy" "cartoon"
    "toy" "sneaker" "teapot" "vase" "stuffed animal"
)

CLASS_TOKEN=${CLASS_TOKENs[$CLASS_IDX]}
SELECTED_SUBJECT=${SUBJECT_NAMES[$CLASS_IDX]}

if [[ $CLASS_IDX =~ ^(0|1|2|3|4|5|8|9|17|18|19|20|21|22|23|24|25|26|27|28|29)$ ]]; then
  PROMPT_LIST=(
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the jungle."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the snow."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on the beach."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on a cobblestone street."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of pink fabric."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a wooden floor."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a city in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a mountain in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a blue house in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a purple rug in a forest."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a wheat field in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a tree and autumn leaves in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with the Eiffel Tower in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} floating on top of water."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} floating in an ocean of milk."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of green grass with sunflowers around it."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a mirror."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of the sidewalk in a crowded street."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a dirt road."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a white rug."
    "a red ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a purple ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a shiny ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a wet ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a cube shaped ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
  )

  prompt_test_list=(
    "a ${CLASS_TOKEN} in the jungle"
    "a ${CLASS_TOKEN} in the snow"
    "a ${CLASS_TOKEN} on the beach"
    "a ${CLASS_TOKEN} on a cobblestone street"
    "a ${CLASS_TOKEN} on top of pink fabric"
    "a ${CLASS_TOKEN} on top of a wooden floor"
    "a ${CLASS_TOKEN} with a city in the background"
    "a ${CLASS_TOKEN} with a mountain in the background"
    "a ${CLASS_TOKEN} with a blue house in the background"
    "a ${CLASS_TOKEN} on top of a purple rug in a forest"
    "a ${CLASS_TOKEN} with a wheat field in the background"
    "a ${CLASS_TOKEN} with a tree and autumn leaves in the background"
    "a ${CLASS_TOKEN} with the Eiffel Tower in the background"
    "a ${CLASS_TOKEN} floating on top of water"
    "a ${CLASS_TOKEN} floating in an ocean of milk"
    "a ${CLASS_TOKEN} on top of green grass with sunflowers around it"
    "a ${CLASS_TOKEN} on top of a mirror"
    "a ${CLASS_TOKEN} on top of the sidewalk in a crowded street"
    "a ${CLASS_TOKEN} on top of a dirt road"
    "a ${CLASS_TOKEN} on top of a white rug"
    "a red ${CLASS_TOKEN}"
    "a purple ${CLASS_TOKEN}"
    "a shiny ${CLASS_TOKEN}"
    "a wet ${CLASS_TOKEN}"
    "a cube shaped ${CLASS_TOKEN}"
  )

else
  PROMPT_LIST=(
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the jungle."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the snow."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on the beach."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on a cobblestone street."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of pink fabric."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a wooden floor."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a city in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a mountain in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a blue house in the background."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a purple rug in a forest."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a red hat."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a santa hat."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a rainbow scarf."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a black top hat and a monocle."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a chef outfit."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a firefighter outfit."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a police outfit."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing pink glasses."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a yellow shirt."
    "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a purple wizard outfit."
    "a red ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a purple ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a shiny ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a wet ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
    "a cube shaped ${UNIQUE_TOKEN} ${CLASS_TOKEN}."
  )

  prompt_test_list=(
    "a ${CLASS_TOKEN} in the jungle"
    "a ${CLASS_TOKEN} in the snow"
    "a ${CLASS_TOKEN} on the beach"
    "a ${CLASS_TOKEN} on a cobblestone street"
    "a ${CLASS_TOKEN} on top of pink fabric"
    "a ${CLASS_TOKEN} on top of a wooden floor"
    "a ${CLASS_TOKEN} with a city in the background"
    "a ${CLASS_TOKEN} with a mountain in the background"
    "a ${CLASS_TOKEN} with a blue house in the background"
    "a ${CLASS_TOKEN} on top of a purple rug in a forest"
    "a ${CLASS_TOKEN} wearing a red hat"
    "a ${CLASS_TOKEN} wearing a santa hat"
    "a ${CLASS_TOKEN} wearing a rainbow scarf"
    "a ${CLASS_TOKEN} wearing a black top hat and a monocle"
    "a ${CLASS_TOKEN} in a chef outfit"
    "a ${CLASS_TOKEN} in a firefighter outfit"
    "a ${CLASS_TOKEN} in a police outfit"
    "a ${CLASS_TOKEN} wearing pink glasses"
    "a ${CLASS_TOKEN} wearing a yellow shirt"
    "a ${CLASS_TOKEN} in a purple wizard outfit"
    "a red ${CLASS_TOKEN}"
    "a purple ${CLASS_TOKEN}"
    "a shiny ${CLASS_TOKEN}"
    "a wet ${CLASS_TOKEN}"
    "a cube shaped ${CLASS_TOKEN}"
  )
fi

VALIDATION_PROMPT=${PROMPT_LIST[@]}
INSTANCE_PROMPT="a photo of ${UNIQUE_TOKEN} ${CLASS_TOKEN}"
CLASS_PROMPT="a photo of ${CLASS_TOKEN}"

export MODEL_NAME="stabilityai/stable-diffusion-2-1" 
# export MODEL_NAME="runwayml/stable-diffusion-v1-5"

PEFT_TYPE="boft"
BLOCK_NUM=8
BLOCK_SIZE=0
N_BUTTERFLY_FACTOR=1

export PROJECT_NAME="dreambooth_${PEFT_TYPE}"
export RUN_NAME="${SELECTED_SUBJECT}_${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}"
export INSTANCE_DIR="./data/dreambooth/dataset/${SELECTED_SUBJECT}"
export CLASS_DIR="./data/class_data/${CLASS_TOKEN}"
export OUTPUT_DIR="./data/output/${PEFT_TYPE}"


accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir="$CLASS_DIR" \
  --output_dir=$OUTPUT_DIR \
  --wandb_project_name=$PROJECT_NAME \
  --wandb_run_name=$RUN_NAME \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="$INSTANCE_PROMPT" \
  --validation_prompt="$VALIDATION_PROMPT" \
  --class_prompt="$CLASS_PROMPT" \
  --resolution=512 \
  --train_batch_size=1 \
  --num_dataloader_workers=2 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --use_boft \
  --boft_block_num=$BLOCK_NUM \
  --boft_block_size=$BLOCK_SIZE \
  --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \
  --boft_dropout=0.1 \
  --boft_bias="boft_only" \
  --learning_rate=3e-5 \
  --max_train_steps=1010 \
  --checkpointing_steps=200 \
  --validation_steps=200 \
  --enable_xformers_memory_efficient_attention \
  --report_to="wandb" \

[huggingface/transformers] utils/check_config_attributes.py

SPECIAL_CASES_TO_ALLOW = {
    # 'max_position_embeddings' is not used in modeling file, but needed for eval frameworks like Huggingface's lighteval (https://github.com/huggingface/lighteval/blob/af24080ea4f16eaf1683e353042a2dfc9099f038/src/lighteval/models/base_model.py#L264).
    # periods and offsers are not used in modeling file, but used in the configuration file to define `layers_block_type` and `layers_num_experts`.
    "JambaConfig": [
        "max_position_embeddings",
        "attn_layer_offset",
        "attn_layer_period",
        "expert_layer_offset",
        "expert_layer_period",
    ],
    # used to compute the property `self.chunk_length`
    "EncodecConfig": ["overlap"],
    # used to compute the property `self.layers_block_type`
    "RecurrentGemmaConfig": ["block_types"],
    # used as in the config to define `intermediate_size`
    "MambaConfig": ["expand"],
    # used as `self.bert_model = BertModel(config, ...)`
    "DPRConfig": True,
    "FuyuConfig": True,
    # not used in modeling files, but it's an important information
    "FSMTConfig": ["langs"],
    # used internally in the configuration class file
    "GPTNeoConfig": ["attention_types"],
    # used internally in the configuration class file
    "EsmConfig": ["is_folding_model"],
    # used during training (despite we don't have training script for these models yet)
    "Mask2FormerConfig": ["ignore_value"],
    # `ignore_value` used during training (despite we don't have training script for these models yet)
    # `norm` used in conversion script (despite not using in the modeling file)
    "OneFormerConfig": ["ignore_value", "norm"],
    # used during preprocessing and collation, see `collating_graphormer.py`
    "GraphormerConfig": ["spatial_pos_max"],
    # used internally in the configuration class file
    "T5Config": ["feed_forward_proj"],
    # used internally in the configuration class file
    # `tokenizer_class` get default value `T5Tokenizer` intentionally
    "MT5Config": ["feed_forward_proj", "tokenizer_class"],
    "UMT5Config": ["feed_forward_proj", "tokenizer_class"],
    # used internally in the configuration class file
    "LongT5Config": ["feed_forward_proj"],
    # used internally in the configuration class file
    "Pop2PianoConfig": ["feed_forward_proj"],
    # used internally in the configuration class file
    "SwitchTransformersConfig": ["feed_forward_proj"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "BioGptConfig": ["layer_norm_eps"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "GLPNConfig": ["layer_norm_eps"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "SegformerConfig": ["layer_norm_eps"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "CvtConfig": ["layer_norm_eps"],
    # having default values other than `1e-5` - we can't fix them without breaking
    "PerceiverConfig": ["layer_norm_eps"],
    # used internally to calculate the feature size
    "InformerConfig": ["num_static_real_features", "num_time_features"],
    # used internally to calculate the feature size
    "TimeSeriesTransformerConfig": ["num_static_real_features", "num_time_features"],
    # used internally to calculate the feature size
    "AutoformerConfig": ["num_static_real_features", "num_time_features"],
    # used internally to calculate `mlp_dim`
    "SamVisionConfig": ["mlp_ratio"],
    # For (head) training, but so far not implemented
    "ClapAudioConfig": ["num_classes"],
    # Not used, but providing useful information to users
    "SpeechT5HifiGanConfig": ["sampling_rate"],
    # used internally in the configuration class file
    "UdopConfig": ["feed_forward_proj"],
    # Actually used in the config or generation config, in that case necessary for the sub-components generation
    "SeamlessM4TConfig": [
        "max_new_tokens",
        "t2u_max_new_tokens",
        "t2u_decoder_attention_heads",
        "t2u_decoder_ffn_dim",
        "t2u_decoder_layers",
        "t2u_encoder_attention_heads",
        "t2u_encoder_ffn_dim",
        "t2u_encoder_layers",
        "t2u_max_position_embeddings",
    ],
    # Actually used in the config or generation config, in that case necessary for the sub-components generation
    "SeamlessM4Tv2Config": [
        "max_new_tokens",
        "t2u_decoder_attention_heads",
        "t2u_decoder_ffn_dim",
        "t2u_decoder_layers",
        "t2u_encoder_attention_heads",
        "t2u_encoder_ffn_dim",
        "t2u_encoder_layers",
        "t2u_max_position_embeddings",
        "t2u_variance_pred_dropout",
        "t2u_variance_predictor_embed_dim",
        "t2u_variance_predictor_hidden_dim",
        "t2u_variance_predictor_kernel_size",
    ],
}

[huggingface/peft] examples/conditional_generation/multitask_prompt_tuning.ipynb

from datasets import load_dataset from transformers import set_seed, AutoModelForSeq2SeqLM, AutoTokenizer from peft import get_peft_model, MultitaskPromptTuningConfig, TaskType, MultitaskPromptTuningInit

set_seed(42)

model_name = "google/flan-t5-base"

peft_config = MultitaskPromptTuningConfig( tokenizer_name_or_path=model_name, num_tasks=2, task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=MultitaskPromptTuningInit.TEXT, num_virtual_tokens=50, num_transformer_submodules=1, prompt_tuning_init_text="classify the following into either positive or negative, or entailment, neutral or contradiction:", )

tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) model = get_peft_model(model, peft_config)

model = model.cuda()

def send_to_device(batch): for i in batch: batch[i] = batch[i].cuda() return batch

def get_sst2(split: str): examples = load_dataset("sst2")[split] result_examples = [] for example in examples: result_examples.append({})

    result_examples[-1]["input"] = example["sentence"].strip() + "</s>"
    result_examples[-1]["output"] = (
        f"positive{tokenizer.eos_token}" if example["label"] == 1 else f"negative{tokenizer.eos_token}"
    )
    result_examples[-1]["task_id"] = 0

return result_examples

def get_mnli(split: str): examples = load_dataset("multi_nli")[split] result_examples = [] for example in examples: result_examples.append({})

    result_examples[-1]["input"] = example["premise"].strip() + " " + example["hypothesis"].strip() + "</s>"

    if example["label"] == 0:
        result_examples[-1]["output"] = f"entailment{tokenizer.eos_token}"
    elif example["label"] == 1:
        result_examples[-1]["output"] = f"neutral{tokenizer.eos_token}"
    else:
        result_examples[-1]["output"] = f"contradiction{tokenizer.eos_token}"

    result_examples[-1]["task_id"] = 1

return result_examples

from typing import Tuple from torch.utils.data import Dataset, DataLoader import torch

class MyDataset(Dataset): def init(self, split: str, mode: str = "source") -> None: super().init()

    if split == "train":
        if mode == "source":
            self.examples = get_sst2(split) + get_mnli(split)
        elif mode == "target":
            self.examples = get_sst2(split)
    if split == "val":
        self.examples = get_sst2("validation")
    if split == "test":
        self.examples = get_sst2("validation")

def __getitem__(self, index) -> dict:
    return self.examples[index]

def __len__(self) -> int:
    return len(self.examples)

def __getitem__(self, index) -> dict:
    return self.examples[index]

def __len__(self) -> int:
    return len(self.examples)

def collate_fn(batch: dict) -> Tuple[torch.Tensor, torch.Tensor]: input = [i["input"] for i in batch] input = tokenizer(input, add_special_tokens=False, return_tensors="pt", padding=True)

output = [i["output"] for i in batch]
output = tokenizer(output, add_special_tokens=False, return_tensors="pt", padding=True).input_ids
output[output == tokenizer.pad_token_id] = -100

task_ids = [i["task_id"] for i in batch]
task_ids = torch.tensor(task_ids)

return {
    "input_ids": input.input_ids,
    "attention_mask": input.attention_mask,
    "labels": output,
    "task_ids": task_ids,
}

train = DataLoader(MyDataset("train"), shuffle=True, batch_size=8, collate_fn=collate_fn) val = DataLoader(MyDataset("val"), shuffle=False, batch_size=8, collate_fn=collate_fn) test = DataLoader(MyDataset("test"), shuffle=False, batch_size=8, collate_fn=collate_fn)

## source training

from torch.optim.adamw import AdamW from transformers import get_cosine_schedule_with_warmup from tqdm import tqdm from sklearn.metrics import f1_score

POSITIVE_TOKEN_ID = tokenizer(" positive", add_special_tokens=False)["input_ids"][0] NEGATIVE_TOKEN_ID = tokenizer(" negative", add_special_tokens=False)["input_ids"][0]

def classify(batch): batch = send_to_device(batch) # we pass labels here since we need to generate and peft doesn't support generation yet. # No clue how to get around this scores = model(**batch).logits preds = [] for i in range(scores.shape[0]): if scores[i, 0, POSITIVE_TOKEN_ID] > scores[i, 0, NEGATIVE_TOKEN_ID]: preds.append(POSITIVE_TOKEN_ID) else: preds.append(NEGATIVE_TOKEN_ID) return preds

@torch.inference_mode() def evaluate(model, data): loss = 0 preds = [] golds = []

for batch in tqdm(data):
    batch = send_to_device(batch)
    loss += model(**batch).loss
    golds.extend(batch["labels"][:, 0].tolist())
    preds.extend(classify(batch))

return loss / len(val), f1_score(golds, preds, pos_label=POSITIVE_TOKEN_ID)

optimizer = AdamW(model.parameters(), lr=1e-4) scheduler = get_cosine_schedule_with_warmup(optimizer, 200, len(train))

n = 1000 step = 0 train_ = tqdm(train)

val_loss, f1 = evaluate(model, val) print( f""" before source training val loss = {val_loss} f1 = {f1}""" )

for batch in train_: if step % n == 0: val_loss, f1 = evaluate(model, val) print( f""" step = {step} val loss = {val_loss} f1 = {f1}""" ) model.save_pretrained(f"checkpoints_source/{step}")

step += 1
batch = send_to_device(batch)
loss = model(**batch).loss
loss.backward()
optimizer.step()
scheduler.step()
train_.set_postfix(train_loss=loss)

## target training

train = DataLoader(MyDataset("train", "target"), shuffle=True, batch_size=8, collate_fn=collate_fn) val = DataLoader(MyDataset("val", "target"), shuffle=False, batch_size=8, collate_fn=collate_fn) test = DataLoader(MyDataset("test", "target"), shuffle=False, batch_size=8, collate_fn=collate_fn)

#### create a fresh model

peft_config = MultitaskPromptTuningConfig( tokenizer_name_or_path=model_name, num_tasks=1, task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=MultitaskPromptTuningInit.EXACT_SOURCE_TASK, prompt_tuning_init_state_dict_path="checkpoints_source/50000/adapter_model.bin", num_virtual_tokens=50, num_transformer_submodules=1, )

tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) model = get_peft_model(model, peft_config)

model = model.cuda()

optimizer = AdamW(model.parameters(), lr=1e-4) scheduler = get_cosine_schedule_with_warmup(optimizer, 200, len(train))

n = 1000 step = 0 train_ = tqdm(train)

val_loss, f1 = evaluate(model, val) print( f""" before target training val loss = {val_loss} f1 = {f1}""" )

for batch in train_: if step % n == 0: val_loss, f1 = evaluate(model, val) print( f""" step = {step} val loss = {val_loss} f1 = {f1}""" ) model.save_pretrained(f"checkpoints_target/{step}")

step += 1
batch = send_to_device(batch)
loss = model(**batch).loss
loss.backward()
optimizer.step()
scheduler.step()
train_.set_postfix(train_loss=loss)

load last checkpoint for now

from peft import set_peft_model_state_dict

sd_6000 = torch.load("checkpoints_target/6000/adapter_model.bin") set_peft_model_state_dict(model, sd_6000)

evaluate val

val_loss, f1 = evaluate(model, val) print( f""" final val loss = {val_loss} f1 = {f1}""" )

evaluate test

test_loss, f1 = evaluate(model, test) print( f""" final test loss = {test_loss} f1 = {f1}""" )

[huggingface/peft] examples/causal_language_modeling/accelerate_ds_zero3_cpu_offload_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false

[huggingface/accelerate] src/accelerate/utils/dataclasses.py

def parse_llama_config(megatron_lm_plugin, model, batch_data):
    model_type_name = "gpt"
    num_layers = model.config.num_hidden_layers
    pretraining_flag = True
    hidden_size = model.config.hidden_size
    num_attention_heads = model.config.num_attention_heads
    orig_vocab_size = model.config.vocab_size

    max_position_embeddings = model.config.max_position_embeddings
    seq_length = getattr(model.config, "max_sequence_length", None)
    if megatron_lm_plugin.seq_length is None:
        if seq_length is not None:
            megatron_lm_plugin.seq_length = seq_length
        elif megatron_lm_plugin.decoder_seq_length is not None:
            megatron_lm_plugin.seq_length = megatron_lm_plugin.decoder_seq_length
        elif batch_data is not None:
            megatron_lm_plugin.seq_length = batch_data["input_ids"].shape[1]
        else:
            megatron_lm_plugin.seq_length = max_position_embeddings

    megatron_lm_plugin.megatron_lm_default_args["return_logits"] = megatron_lm_plugin.return_logits
    megatron_lm_plugin.megatron_lm_default_args["tokenizer_type"] = "Llama2Tokenizer"
    megatron_lm_plugin.megatron_lm_default_args["model_type_name"] = model_type_name
    megatron_lm_plugin.megatron_lm_default_args["num_layers"] = num_layers
    megatron_lm_plugin.megatron_lm_default_args["pretraining_flag"] = pretraining_flag
    megatron_lm_plugin.megatron_lm_default_args["hidden_size"] = hidden_size
    megatron_lm_plugin.megatron_lm_default_args["num_attention_heads"] = num_attention_heads
    megatron_lm_plugin.megatron_lm_default_args["orig_vocab_size"] = orig_vocab_size
    megatron_lm_plugin.megatron_lm_default_args["max_position_embeddings"] = max_position_embeddings
    megatron_lm_plugin.megatron_lm_default_args["seq_length"] = megatron_lm_plugin.seq_length
    megatron_lm_plugin.megatron_lm_default_args["model_return_dict"] = model.config.return_dict

[huggingface/accelerate] examples/by_feature/megatron_lm_gpt_pretraining.py

                    continue

            with accelerator.accumulate(model):
                outputs = model(**batch)
                loss = outputs.loss
                # We keep track of the loss at each epoch
                if args.with_tracking:
                    total_loss += loss.detach().float()
                accelerator.backward(loss)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

            # Checks if the accelerator has performed an optimization step behind the scenes
            if accelerator.sync_gradients:
                progress_bar.update(1)
                completed_steps += 1

            if isinstance(checkpointing_steps, int):
                if completed_steps % checkpointing_steps == 0:
                    output_dir = f"step_{completed_steps }"
                    if args.output_dir is not None:
                        output_dir = os.path.join(args.output_dir, output_dir)
                    accelerator.save_state(output_dir)
            if completed_steps >= args.max_train_steps:
                break

        model.eval()
        losses = []
        for step, batch in enumerate(eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)

            loss = outputs.loss
            # New Code
            # For Megatron-LM, the losses are already averaged across the data parallel group
            if accelerator.distributed_type == DistributedType.MEGATRON_LM:
                losses.append(loss)
            else:
                losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size)))
        try:
            if accelerator.distributed_type == DistributedType.MEGATRON_LM:
                losses = torch.tensor(losses)
            else:
                losses = torch.cat(losses)
            eval_loss = torch.mean(losses)
            perplexity = math.exp(eval_loss)
        except OverflowError:
            perplexity = float("inf")

        logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}")

        if args.with_tracking:
            accelerator.log(
                {
                    "perplexity": perplexity,
                    "eval_loss": eval_loss,
                    "train_loss": total_loss.item() / len(train_dataloader),
                    "epoch": epoch,
                    "step": completed_steps,
                },
                step=completed_steps,
            )

        if args.push_to_hub and epoch < args.num_train_epochs - 1:
            accelerator.wait_for_everyone()
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(
                args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save
            )
            if accelerator.is_main_process:
                tokenizer.save_pretrained(args.output_dir)
                api.upload_folder(
                    repo_id=repo_id,
                    folder_path=args.output_dir,
                    commit_message=f"Training in progress epoch {epoch}",
                    run_as_future=True,
                )

        if args.checkpointing_steps == "epoch":
            output_dir = f"epoch_{epoch}"
            if args.output_dir is not None:
                output_dir = os.path.join(args.output_dir, output_dir)
            accelerator.save_state(output_dir)

    # this is causing some issue with Megatron-LM when using `wandb` at the end of the main function.
    # Everything works fine inspite of commenting this out. (wandb finishes/closes the run without error)
    # if args.with_tracking:
    #     accelerator.end_training()

    if args.output_dir is not None:
        accelerator.wait_for_everyone()
        # New Code
        # For Megatron-LM, we need to save the model using `accelerator.save_state`
        if accelerator.distributed_type == DistributedType.MEGATRON_LM:
            accelerator.save_state(args.output_dir)
        else:
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(
                args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save
            )
        if accelerator.is_main_process:
            tokenizer.save_pretrained(args.output_dir)
            if args.push_to_hub:
                api.upload_folder(
                    repo_id=repo_id,
                    folder_path=args.output_dir,
                    commit_message="End of training",
                )

        with open(os.path.join(args.output_dir, "all_results.json"), "w") as f:
            json.dump({"perplexity": perplexity}, f)

[huggingface/accelerate] docs/source/usage_guides/megatron_lm.md

Accelerate Megatron-LM Plugin

Important features are directly supported via the accelerate config command. An example of the corresponding questions for using Megatron-LM features is shown below:

:~$ accelerate config --config_file "megatron_gpt_config.yaml"
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 
Do you want to use DeepSpeed? [yes/NO]: 
Do you want to use FullyShardedDataParallel? [yes/NO]: 
Do you want to use Megatron-LM ? [yes/NO]: yes
What is the Tensor Parallelism degree/size? [1]:2
Do you want to enable Sequence Parallelism? [YES/no]: 
What is the Pipeline Parallelism degree/size? [1]:2
What is the number of micro-batches? [1]:2
Do you want to enable selective activation recomputation? [YES/no]: 
Do you want to use distributed optimizer which shards optimizer state and gradients across data parallel ranks? [YES/no]: 
What is the gradient clipping value based on global L2 Norm (0 to disable)? [1.0]: 
How many GPU(s) should be used for distributed training? [1]:4
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: bf16

The resulting config is shown below:

~$ cat megatron_gpt_config.yaml 
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MEGATRON_LM
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config:
  megatron_lm_gradient_clipping: 1.0
  megatron_lm_num_micro_batches: 2
  megatron_lm_pp_degree: 2
  megatron_lm_recompute_activations: true
  megatron_lm_sequence_parallelism: true
  megatron_lm_tp_degree: 2
  megatron_lm_use_distributed_optimizer: true
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

We will take the example of GPT pre-training. The minimal changes required to the official run_clm_no_trainer.py to use Megatron-LM are as follows:

As Megatron-LM uses its own implementation of Optimizer, the corresponding scheduler compatible with it needs to be used. As such, support for only the Megatron-LM's scheduler is present. User will need to create accelerate.utils.MegatronLMDummyScheduler. Example is given below:

from accelerate.utils import MegatronLMDummyScheduler

if accelerator.distributed_type == DistributedType.MEGATRON_LM:
    lr_scheduler = MegatronLMDummyScheduler(
        optimizer=optimizer,
        total_num_steps=args.max_train_steps,
        warmup_num_steps=args.num_warmup_steps,
    )
else:
    lr_scheduler = get_scheduler(
        name=args.lr_scheduler_type,
        optimizer=optimizer,
        num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps,
        num_training_steps=args.max_train_steps * args.gradient_accumulation_steps,
    )

Getting the details of the total batch size now needs to be cognization of tensor and pipeline parallel sizes. Example of getting the effective total batch size is shown below:

if accelerator.distributed_type == DistributedType.MEGATRON_LM:
    total_batch_size = accelerator.state.megatron_lm_plugin.global_batch_size
else:
    total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps

When using Megatron-LM, the losses are already averaged across the data parallel group

if accelerator.distributed_type == DistributedType.MEGATRON_LM:
    losses.append(loss)
else:
    losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size)))

if accelerator.distributed_type == DistributedType.MEGATRON_LM:
    losses = torch.tensor(losses)
else:
    losses = torch.cat(losses)

For Megatron-LM, we need to save the model using accelerator.save_state

if accelerator.distributed_type == DistributedType.MEGATRON_LM:
    accelerator.save_state(args.output_dir)
else:
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(
        args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save
    )

That's it! We are good to go 🚀. Please find the example script in the examples folder at the path accelerate/examples/by_feature/megatron_lm_gpt_pretraining.py. Let's run it for gpt-large model architecture using 4 A100-80GB GPUs.

accelerate launch --config_file megatron_gpt_config.yaml \
examples/by_feature/megatron_lm_gpt_pretraining.py \
--config_name "gpt2-large" \
--tokenizer_name "gpt2-large" \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--block_size 1024 \
--learning_rate 5e-5 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 24 \
--num_train_epochs 5 \
--with_tracking \
--report_to "wandb" \
--output_dir "awesome_model"

Below are some important excerpts from the output logs:

Loading extension module fused_dense_cuda...
>>> done with compiling and loading fused kernels. Compilation time: 3.569 seconds
 > padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
Building gpt model in the pre-training mode.
The Megatron LM model weights are initialized at random in `accelerator.prepare`. Please use `accelerator.load_checkpoint` to load a pre-trained checkpoint matching the distributed setup.
Preparing dataloader
Preparing dataloader
Preparing model
 > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 210753280
 > number of parameters on (tensor, pipeline) model parallel rank (1, 1): 209445120
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 210753280
 > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 209445120
Preparing optimizer
Preparing scheduler
> learning rate decay style: linear
10/10/2022 22:57:22 - INFO - __main__ - ***** Running training *****
10/10/2022 22:57:22 - INFO - __main__ -   Num examples = 2318
10/10/2022 22:57:22 - INFO - __main__ -   Num Epochs = 5
10/10/2022 22:57:22 - INFO - __main__ -   Instantaneous batch size per device = 24
10/10/2022 22:57:22 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 48
10/10/2022 22:57:22 - INFO - __main__ -   Gradient Accumulation steps = 1
10/10/2022 22:57:22 - INFO - __main__ -   Total optimization steps = 245
 20%|████████████▍                                                 | 49/245 [01:04<04:09,  1.27s/it]
 10/10/2022 22:58:29 - INFO - __main__ - epoch 0: perplexity: 1222.1594275215962 eval_loss: 7.10837459564209
 40%|████████████████████████▊                                     | 98/245 [02:10<03:07,  1.28s/it]
 10/10/2022 22:59:35 - INFO - __main__ - epoch 1: perplexity: 894.5236583794557 eval_loss: 6.796291351318359
 60%|████████████████████████████████████▌                        | 147/245 [03:16<02:05,  1.28s/it]
 10/10/2022 23:00:40 - INFO - __main__ - epoch 2: perplexity: 702.8458788508042 eval_loss: 6.555137634277344
 80%|████████████████████████████████████████████████▊            | 196/245 [04:22<01:02,  1.28s/it]
 10/10/2022 23:01:46 - INFO - __main__ - epoch 3: perplexity: 600.3220028695281 eval_loss: 6.39746618270874
100%|█████████████████████████████████████████████████████████████| 245/245 [05:27<00:00,  1.28s/it]

There are a large number of other options/features that one can set using accelerate.utils.MegatronLMPlugin.

[huggingface/accelerate] examples/by_feature/deepspeed_with_config_support.py

def parse_args():
    parser = argparse.ArgumentParser(description="Finetune a transformers model on a causal language modeling task")
    parser.add_argument(
        "--dataset_name",
        type=str,
        default=None,
        help="The name of the dataset to use (via the datasets library).",
    )
    parser.add_argument(
        "--dataset_config_name",
        type=str,
        default=None,
        help="The configuration name of the dataset to use (via the datasets library).",
    )
    parser.add_argument(
        "--train_file", type=str, default=None, help="A csv or a json file containing the training data."
    )
    parser.add_argument(
        "--validation_file", type=str, default=None, help="A csv or a json file containing the validation data."
    )
    parser.add_argument(
        "--validation_split_percentage",
        default=5,
        help="The percentage of the train set used as validation set in case there's no validation split",
    )
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to pretrained model or model identifier from huggingface.co/models.",
        required=False,
    )
    parser.add_argument(
        "--config_name",
        type=str,
        default=None,
        help="Pretrained config name or path if not the same as model_name",
    )
    parser.add_argument(
        "--tokenizer_name",
        type=str,
        default=None,
        help="Pretrained tokenizer name or path if not the same as model_name",
    )
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
        type=int,
        default=8,
        help="Batch size (per device) for the training dataloader.",
    )
    parser.add_argument(
        "--per_device_eval_batch_size",
        type=int,
        default=8,
        help="Batch size (per device) for the evaluation dataloader.",
    )
    parser.add_argument(
        "--learning_rate",
        type=float,
        default=5e-5,
        help="Initial learning rate (after the potential warmup period) to use.",
    )
    parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.")
    parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.")
    parser.add_argument(
        "--max_train_steps",
        type=int,
        default=None,
        help="Total number of training steps to perform. If provided, overrides num_train_epochs.",
    )
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument(
        "--lr_scheduler_type",
        type=SchedulerType,
        default="linear",
        help="The scheduler type to use.",
        choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"],
    )
    parser.add_argument(
        "--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler."
    )
    parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.")
    parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.")
    parser.add_argument(
        "--model_type",
        type=str,
        default=None,
        help="Model type to use if training from scratch.",
        choices=MODEL_TYPES,
    )
    parser.add_argument(
        "--block_size",
        type=int,
        default=None,
        help=(
            "Optional input sequence length after tokenization. The training dataset will be truncated in block of"
            " this size for training. Default to the model max input length for single sentence inputs (take into"
            " account special tokens)."
        ),
    )
    parser.add_argument(
        "--preprocessing_num_workers",
        type=int,
        default=None,
        help="The number of processes to use for the preprocessing.",
    )
    parser.add_argument(
        "--overwrite_cache", type=bool, default=False, help="Overwrite the cached training and evaluation sets"
    )
    parser.add_argument(
        "--no_keep_linebreaks", action="store_true", help="Do not keep line breaks when using TXT files."
    )
    parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.")
    parser.add_argument(
        "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`."
    )
    parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.")
    parser.add_argument(
        "--checkpointing_steps",
        type=str,
        default=None,
        help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
    )
    parser.add_argument(
        "--resume_from_checkpoint",
        type=str,
        default=None,
        help="If the training should continue from a checkpoint folder.",
    )
    # New Code #
    # Whether to load the best model at the end of training
    parser.add_argument(
        "--load_best_model",
        action="store_true",
        help="Whether to load the best model at the end of training",
    )
    parser.add_argument(
        "--with_tracking",
        action="store_true",
        help="Whether to enable experiment trackers for logging.",
    )
    parser.add_argument(
        "--report_to",
        type=str,
        default="all",
        help=(
            'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,'
            ' `"wandb"`, `"comet_ml"`, and `"dvclive"`. Use `"all"` (default) to report to all integrations.'
            "Only applicable when `--with_tracking` is passed."
        ),
    )
    args = parser.parse_args()

    # Sanity checks
    if args.dataset_name is None and args.train_file is None and args.validation_file is None:
        raise ValueError("Need either a dataset name or a training/validation file.")
    else:
        if args.train_file is not None:
            extension = args.train_file.split(".")[-1]
            assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, json or txt file."
        if args.validation_file is not None:
            extension = args.validation_file.split(".")[-1]
            assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, json or txt file."

    if args.push_to_hub:
        assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed."

    return args

[huggingface/accelerate] src/accelerate/commands/config/cluster.py

def get_cluster_input():
    distributed_type = _ask_options(
        "Which type of machine are you using?",
        [
            "No distributed training",
            "multi-CPU",
            "multi-XPU",
            "multi-GPU",
            "multi-NPU",
            "multi-MLU",
            "multi-MUSA",
            "TPU",
        ],
        _convert_distributed_mode,
    )

    machine_rank = 0
    num_machines = 1
    num_processes = 1
    gpu_ids = None
    main_process_ip = None
    main_process_port = None
    rdzv_backend = "static"
    same_network = True
    debug = False

    if distributed_type in [
        DistributedType.MULTI_GPU,
        DistributedType.MULTI_MLU,
        DistributedType.MULTI_MUSA,
        DistributedType.MULTI_NPU,
        DistributedType.MULTI_XPU,
        DistributedType.MULTI_CPU,
    ]:
        num_machines = _ask_field(
            "How many different machines will you use (use more than 1 for multi-node training)? [1]: ",
            int,
            default=1,
        )
        if num_machines > 1:
            machine_rank = _ask_options(
                "What is the rank of this machine?",
                list(range(num_machines)),
                int,
            )
            main_process_ip = _ask_field(
                "What is the IP address of the machine that will host the main process? ",
            )
            main_process_port = _ask_field(
                "What is the port you will use to communicate with the main process? ",
                int,
            )
            same_network = _ask_field(
                "Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: ",
                _convert_yes_no_to_bool,
                default=True,
                error_message="Please enter yes or no.",
            )
            if not same_network:
                rdzv_backend = _ask_field(
                    "What rendezvous backend will you use? ('static', 'c10d', ...): ", default="static"
                )
        debug = _ask_field(
            "Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: ",
            _convert_yes_no_to_bool,
            default=False,
            error_message="Please enter yes or no.",
        )

    if distributed_type == DistributedType.NO:
        use_cpu = _ask_field(
            "Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:",
            _convert_yes_no_to_bool,
            default=False,
            error_message="Please enter yes or no.",
        )
    elif distributed_type == DistributedType.MULTI_CPU:
        use_cpu = True
    else:
        use_cpu = False

    ipex_config = {}
    mpirun_config = {}
    if use_cpu:
        ipex_config["ipex"] = _ask_field(
            "Do you want to use Intel PyTorch Extension (IPEX) to speed up training on CPU? [yes/NO]:",
            _convert_yes_no_to_bool,
            default=False,
            error_message="Please enter yes or no.",
        )
        if distributed_type == DistributedType.MULTI_CPU:
            use_mpirun = _ask_field(
                "Do you want accelerate to launch mpirun? [yes/NO]: ",
                _convert_yes_no_to_bool,
                default=False,
                error_message="Please enter yes or no.",
            )
            if use_mpirun:
                mpirun_hostfile = _ask_field(
                    "Please enter the path to the hostfile to use with mpirun [~/hostfile]: ",
                    str,
                    default="~/hostfile",
                )
                mpirun_config["mpirun_hostfile"] = os.path.expanduser(mpirun_hostfile.strip())
                mpirun_config["mpirun_ccl"] = _ask_field("Enter the number of oneCCL worker threads [1]: ", default=1)
    if (
        not use_cpu
        and is_xpu_available()
        and distributed_type
        not in [
            DistributedType.MULTI_GPU,
            DistributedType.MULTI_NPU,
            DistributedType.MULTI_MLU,
            DistributedType.XLA,
            DistributedType.MULTI_MUSA,
        ]
    ):
        ipex_config["use_xpu"] = _ask_field(
            "Do you want to use XPU plugin to speed up training on XPU? [yes/NO]:",
            _convert_yes_no_to_bool,
            default=False,
            error_message="Please enter yes or no.",
        )

    dynamo_config = {}
    use_dynamo = _ask_field(
        "Do you wish to optimize your script with torch dynamo?[yes/NO]:",
        _convert_yes_no_to_bool,
        default=False,
        error_message="Please enter yes or no.",
    )
    if use_dynamo:
        prefix = "dynamo_"
        dynamo_config[prefix + "backend"] = _ask_options(
            "Which dynamo backend would you like to use?",
            [x.lower() for x in DYNAMO_BACKENDS],
            _convert_dynamo_backend,
            default=2,
        )
        use_custom_options = _ask_field(
            "Do you want to customize the defaults sent to torch.compile? [yes/NO]: ",
            _convert_yes_no_to_bool,
            default=False,
            error_message="Please enter yes or no.",
        )

        if use_custom_options:
            dynamo_config[prefix + "mode"] = _ask_options(
                "Which mode do you want to use?",
                TORCH_DYNAMO_MODES,
                lambda x: TORCH_DYNAMO_MODES[int(x)],
                default=0,
            )
            dynamo_config[prefix + "use_fullgraph"] = _ask_field(
                "Do you want the fullgraph mode or it is ok to break model into several subgraphs? [yes/NO]: ",
                _convert_yes_no_to_bool,
                default=False,
                error_message="Please enter yes or no.",
            )
            dynamo_config[prefix + "use_dynamic"] = _ask_field(
                "Do you want to enable dynamic shape tracing? [yes/NO]: ",
                _convert_yes_no_to_bool,
                default=False,
                error_message="Please enter yes or no.",
            )

    use_mps = not use_cpu and is_mps_available()
    deepspeed_config = {}
    if (
        distributed_type
        in [
            DistributedType.MULTI_GPU,
            DistributedType.MULTI_XPU,
            DistributedType.MULTI_NPU,
            DistributedType.MULTI_MLU,
            DistributedType.MULTI_MUSA,
            DistributedType.NO,
        ]
        and not use_mps
    ):
        use_deepspeed = _ask_field(
            "Do you want to use DeepSpeed? [yes/NO]: ",
            _convert_yes_no_to_bool,
            default=False,
            error_message="Please enter yes or no.",
        )
        if use_deepspeed:
            distributed_type = DistributedType.DEEPSPEED
            assert (
                is_deepspeed_available()
            ), "DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source"

        if distributed_type == DistributedType.DEEPSPEED:
            use_deepspeed_config = _ask_field(
                "Do you want to specify a json file to a DeepSpeed config? [yes/NO]: ",
                _convert_yes_no_to_bool,
                default=False,
                error_message="Please enter yes or no.",
            )
            if use_deepspeed_config:
                deepspeed_config["deepspeed_config_file"] = _ask_field(
                    "Please enter the path to the json DeepSpeed config file: ",
                    str,
                    default="none",
                )
            else:
                deepspeed_config["zero_stage"] = _ask_options(
                    "What should be your DeepSpeed's ZeRO optimization stage?",
                    [0, 1, 2, 3],
                    int,
                    default=2,
                )

                deepspeed_devices = ["none", "cpu", "nvme"]
                if deepspeed_config["zero_stage"] >= 2:
                    deepspeed_config["offload_optimizer_device"] = _ask_options(
                        "Where to offload optimizer states?", deepspeed_devices, lambda x: deepspeed_devices[int(x)]
                    )
                    deepspeed_config["offload_param_device"] = _ask_options(
                        "Where to offload parameters?", deepspeed_devices, lambda x: deepspeed_devices[int(x)]
                    )
                    if deepspeed_config["offload_param_device"] == "nvme":
                        deepspeed_config["offload_param_nvme_path"] = _ask_field(
                            "Nvme Path to offload parameters?",
                            str,
                            default="/nvme",
                        )
                    if deepspeed_config["offload_optimizer_device"] == "nvme":
                        deepspeed_config["offload_optimizer_nvme_path"] = _ask_field(
                            "Nvme Path to offload optimizer states?",
                            str,
                            default="/nvme",
                        )
                deepspeed_config["gradient_accumulation_steps"] = _ask_field(
                    "How many gradient accumulation steps you're passing in your script? [1]: ",
                    int,
                    default=1,
                )
                use_gradient_clipping = _ask_field(
                    "Do you want to use gradient clipping? [yes/NO]: ",
                    _convert_yes_no_to_bool,
                    default=False,
                    error_message="Please enter yes or no.",
                )
                if use_gradient_clipping:
                    deepspeed_config["gradient_clipping"] = _ask_field(
                        "What is the gradient clipping value? [1.0]: ",
                        float,
                        default=1.0,
                    )
                if deepspeed_config["zero_stage"] == 3:
                    deepspeed_config["zero3_save_16bit_model"] = _ask_field(
                        "Do you want to save 16-bit model weights when using ZeRO Stage-3? [yes/NO]: ",
                        _convert_yes_no_to_bool,
                        default=False,
                        error_message="Please enter yes or no.",
                    )
            deepspeed_config["zero3_init_flag"] = _ask_field(
                "Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: ",
                _convert_yes_no_to_bool,
                default=False,
                error_message="Please enter yes or no.",
            )
            if deepspeed_config["zero3_init_flag"]:
                if not is_transformers_available():
                    raise Exception(
                        "When `zero3_init_flag` is set, it requires Transformers to be installed. "
                        "Please run `pip3 install transformers`."
                    )
            use_moe = _ask_field(
                "Do you want to enable Mixture-of-Experts training (MoE)? [yes/NO]: ",
                _convert_yes_no_to_bool,
                default=False,
                error_message="Please enter yes or no.",
            )
            if use_moe:
                deepspeed_config["deepspeed_moe_layer_cls_names"] = _ask_field(
                    "Specify the comma-separated list of transformers MoE layer class names (case-sensitive), e.g : "
                    " `MixtralSparseMoeBlock`, `Qwen2MoeSparseMoeBlock`, `JetMoEAttention,JetMoEBlock` ... : ",
                    str,
                )

            if num_machines > 1:
                launcher_query = "Which Type of launcher do you want to use?"
                deepspeed_config["deepspeed_multinode_launcher"] = _ask_options(
                    launcher_query,
                    DEEPSPEED_MULTINODE_LAUNCHERS,
                    lambda x: DEEPSPEED_MULTINODE_LAUNCHERS[int(x)],
                )

                if deepspeed_config["deepspeed_multinode_launcher"] != DEEPSPEED_MULTINODE_LAUNCHERS[1]:
                    deepspeed_config["deepspeed_hostfile"] = _ask_field(
                        "DeepSpeed configures multi-node compute resources with hostfile. "
                        "Each row is of the format `hostname slots=[num_gpus]`, e.g., `localhost slots=2`; "
                        "for more information please refer official [documentation]"
                        "(https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node). "
                        "Please specify the location of hostfile: ",
                        str,
                    )

                    is_exclusion_filter = _ask_field(
                        "Do you want to specify exclusion filter string? [yes/NO]: ",
                        _convert_yes_no_to_bool,
                        default=False,
                        error_message="Please enter yes or no.",
                    )
                    if is_exclusion_filter:
                        deepspeed_config["deepspeed_exclusion_filter"] = _ask_field(
                            "DeepSpeed exclusion filter string: ",
                            str,
                        )

                    is_inclusion_filter = _ask_field(
                        "Do you want to specify inclusion filter string? [yes/NO]: ",
                        _convert_yes_no_to_bool,
                        default=False,
                        error_message="Please enter yes or no.",
                    )
                    if is_inclusion_filter:
                        deepspeed_config["deepspeed_inclusion_filter"] = _ask_field(
                            "DeepSpeed inclusion filter string: ",
                            str,
                        )

    fsdp_config = {}
    if distributed_type in [
        DistributedType.MULTI_GPU,
        DistributedType.MULTI_NPU,
        DistributedType.MULTI_MLU,
        DistributedType.MULTI_MUSA,
        DistributedType.MULTI_XPU,
    ]:
        use_fsdp = _ask_field(
            "Do you want to use FullyShardedDataParallel? [yes/NO]: ",
            _convert_yes_no_to_bool,
            default=False,
            error_message="Please enter yes or no.",
        )
        if use_fsdp:
            distributed_type = DistributedType.FSDP
        if distributed_type == DistributedType.FSDP:
            sharding_strategy_query = "What should be your sharding strategy?"
            fsdp_config["fsdp_sharding_strategy"] = _ask_options(
                sharding_strategy_query,
                FSDP_SHARDING_STRATEGY,
                lambda x: FSDP_SHARDING_STRATEGY[int(x)],
            )
            fsdp_config["fsdp_offload_params"] = _ask_field(
                "Do you want to offload parameters and gradients to CPU? [yes/NO]: ",
                _convert_yes_no_to_bool,
                default=False,
                error_message="Please enter yes or no.",
            )
            fsdp_wrap_query = "What should be your auto wrap policy?"
            fsdp_config["fsdp_auto_wrap_policy"] = _ask_options(
                fsdp_wrap_query,
                FSDP_AUTO_WRAP_POLICY,
                lambda x: FSDP_AUTO_WRAP_POLICY[int(x)],
            )
            if fsdp_config["fsdp_auto_wrap_policy"] == FSDP_AUTO_WRAP_POLICY[0]:
                use_no_split_modules = _ask_field(
                    "Do you want to use the model's `_no_split_modules` to wrap. Only applicable for 🤗 Transformers [yes/NO]: ",
                    _convert_yes_no_to_bool,
                    default=False,
                    error_message="Please enter yes or no.",
                )
                if not use_no_split_modules:
                    fsdp_config["fsdp_transformer_layer_cls_to_wrap"] = _ask_field(
                        "Specify the comma-separated list of transformer layer class names (case-sensitive) to wrap ,e.g, :"
                        "`BertLayer`, `GPTJBlock`, `T5Block`, `BertLayer,BertEmbeddings,BertSelfOutput` ...? : ",
                        str,
                    )
            elif fsdp_config["fsdp_auto_wrap_policy"] == FSDP_AUTO_WRAP_POLICY[1]:
                fsdp_config["fsdp_min_num_params"] = _ask_field(
                    "What should be your FSDP's minimum number of parameters for Default Auto Wrapping Policy? [1e8]: ",
                    int,
                    default=100000000,
                )
            fsdp_backward_prefetch_query = "What should be your FSDP's backward prefetch policy?"
            fsdp_config["fsdp_backward_prefetch"] = _ask_options(
                fsdp_backward_prefetch_query,
                FSDP_BACKWARD_PREFETCH,
                lambda x: FSDP_BACKWARD_PREFETCH[int(x)],

OpenAccess-AI-Collective/axolotl

huggingface/transformers

huggingface/peft

huggingface/accelerate