base_model: meta-llama/Meta-Llama-3-8B
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: "Replete-AI/Sandevistan"
type: sharegpt
conversation: llama3
field_human: human
field_model: pneuma
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./outputs/out
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
wandb_project: bab687a65317f91849c9af35a924b78fe525ec37
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
pad_token:
Yes, there are a few issues and potential improvements in the provided configuration file:
Incorrect or Unusual Dataset Configuration:
datasets
section includes fields like conversation
, field_human
, and field_model
which are not standard fields in the configuration examples provided in the OpenAccess-AI-Collective/axolotl codebase. If these fields are specific to a custom dataset processing script, ensure that the script correctly interprets and uses these fields. Otherwise, these fields might be unnecessary or incorrectly named.wandb_project
Token as Plain Text:
wandb_project
field seems to contain a token or key (bab687a65317f91849c9af35a924b78fe525ec37
). Typically, this field is expected to be the name of the project in Weights & Biases (wandb). If this is a sensitive token, it should not be included in plain text in the configuration file. It's better to use environment variables or wandb's login mechanism to handle authentication securely.Empty or Unset Fields:
wandb_entity
, wandb_watch
, wandb_name
, wandb_log_model
, early_stopping_patience
, resume_from_checkpoint
, xformers_attention
, eval_table_size
, deepspeed
, fsdp
, and fsdp_config
are either empty or not set. If these fields are not required for your training, consider removing them to clean up the configuration file. If they are needed, ensure they are correctly set.special_tokens
Configuration:
special_tokens
section only specifies pad_token
but does not assign any value to it. If a special padding token is required for your model or tokenizer, ensure it is correctly specified. Otherwise, you might want to remove this section if it's not needed.Potential Misconfiguration with bf16
, fp16
, and tf32
:
bf16
, fp16
, and tf32
are related to mixed precision training. The bf16
is set to auto
, while fp16
is empty, and tf32
is set to false
. Ensure that this configuration aligns with your hardware capabilities and training requirements. Typically, you would use one precision mode based on your hardware (e.g., fp16
for NVIDIA GPUs with Tensor Cores).Gradient Accumulation and Micro Batch Size:
gradient_accumulation_steps: 8
and micro_batch_size: 1
. Ensure that this setup matches your memory constraints and training efficiency requirements. Sometimes adjusting these values can significantly impact training speed and memory usage.load_in_8bit
and load_in_4bit
:
load_in_8bit
and load_in_4bit
are set to false
. If you are exploring memory-efficient training techniques, you might want to investigate these options further. However, if your training setup does not support these modes, leaving them as false
is appropriate.gradient_checkpointing_kwargs
:
gradient_checkpointing_kwargs
is provided but only includes use_reentrant: false
. If you are not using any other keyword arguments for gradient checkpointing, you might simplify the configuration by removing this section or ensure that all necessary arguments are included.Review and adjust the configuration based on these observations and the specific requirements of your training setup and model architecture.
base_model: state-spaces/mamba-2.8b model_type: MambaLMHeadModel tokenizer_type: AutoTokenizer tokenizer_config: EleutherAI/gpt-neox-20b load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.0 output_dir: ./outputs/out sequence_len: 2048 sample_packing: false pad_to_sequence_len: false wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 1 num_epochs: 2 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 5e-5 train_on_inputs: false group_by_length: true bf16: auto fp16: tf32: true gradient_checkpointing: false early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: warmup_steps: 10 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: tokens: save_safetensors: False
base_model: meta-llama/Meta-Llama-3-8B model_type: LlamaForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: tatsu-lab/alpaca type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./outputs/out sequence_len: 8192 sample_packing: true pad_to_sequence_len: true wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 8 micro_batch_size: 1 num_epochs: 1 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 2e-5 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false early_stopping_patience: resume_from_checkpoint: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 100 evals_per_epoch: 2 eval_table_size: saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: pad_token: <|end_of_text|>
base_model: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca dataset_prepared_path: val_set_size: 0.02 adapter: lora_model_dir: sequence_len: 1024 sample_packing: true lora_r: lora_alpha: lora_dropout: lora_target_modules: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/openllama-out gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.000003 train_on_inputs: false group_by_length: false float16: true bf16: false fp16: false tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
base_model: NousResearch/Llama-2-7b-hf model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./outputs/out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true flash_attn_cross_entropy: false flash_attn_rms_norm: true flash_attn_fuse_qkv: false flash_attn_fuse_mlp: true warmup_steps: 100 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 1 debug: deepspeed: #deepspeed_configs/zero2.json # multi-gpu only weight_decay: 0.1 fsdp: fsdp_config: special_tokens:
base_model: togethercomputer/RedPajama-INCITE-Chat-3B-v1 model_type: GPTNeoXForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: load_in_8bit: false datasets: - path: vicgalle/alpaca-gpt4 type: alpaca dataset_prepared_path: val_set_size: 0.02 adapter: lora_model_dir: sequence_len: 2048 max_packed_sequence_len: lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj lora_fan_in_fan_out: false wandb_project: redpajama-alpaca-3b wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/redpajama-alpaca-3b batch_size: 4 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.0000002 train_on_inputs: false group_by_length: false bf16: auto tf32: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 5 xformers_attention: flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0001 fsdp: fsdp_config: tokens: pad_token: "<|padding|>" bos_token: "<|endoftext|>" eos_token: "<|endoftext|>" unk_token: "<|endoftext|>"
base_model: tiiuae/falcon-7b trust_remote_code: true model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false gptq: false strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca:chat dataset_prepared_path: val_set_size: 0.05 adapter: lora_model_dir: sequence_len: 2048 max_packed_sequence_len: lora_r: 64 lora_alpha: 32 lora_dropout: 0.0 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/falcon-7b batch_size: 2 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.00003 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 40 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: pad_token: "<|endoftext|>" bos_token: "<|endoftext|>" eos_token: "<|endoftext|>"
base_model: EleutherAI/pythia-12b-deduped base_model_ignore_patterns: pytorch* # prefer safetensors model_type: GPTNeoXForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false gptq: false device_map: auto datasets: - path: vicgalle/alpaca-gpt4 type: alpaca dataset_prepared_path: val_set_size: 0.05 adapter: lora_model_dir: sequence_len: 2048 max_packed_sequence_len: 2048 lora_r: 64 lora_alpha: 32 lora_dropout: 0.0 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: true # pythia/GPTNeoX lora specific wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/pythia-12b gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 5 learning_rate: 0.00003 optimizer: adamw_bnb_8bit lr_scheduler: cosine train_on_inputs: false group_by_length: false bf16: false fp16: false float16: true tf32: true flash_optimum: true early_stopping_patience: resume_from_checkpoint: local_rank: gradient_checkpointing: true fsdp: fsdp_config:
{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
IDX=$1 PROMPT_IDX=$((IDX % 25)) CLASS_IDX=$((IDX % 30)) # Define the UNIQUE_TOKEN, CLASS_TOKENs, and SUBJECT_NAMES UNIQUE_TOKEN="qwe" SUBJECT_NAMES=( "backpack" "backpack_dog" "bear_plushie" "berry_bowl" "can" "candle" "cat" "cat2" "clock" "colorful_sneaker" "dog" "dog2" "dog3" "dog5" "dog6" "dog7" "dog8" "duck_toy" "fancy_boot" "grey_sloth_plushie" "monster_toy" "pink_sunglasses" "poop_emoji" "rc_car" "red_cartoon" "robot_toy" "shiny_sneaker" "teapot" "vase" "wolf_plushie" ) CLASS_TOKENs=( "backpack" "backpack" "stuffed animal" "bowl" "can" "candle" "cat" "cat" "clock" "sneaker" "dog" "dog" "dog" "dog" "dog" "dog" "dog" "toy" "boot" "stuffed animal" "toy" "glasses" "toy" "toy" "cartoon" "toy" "sneaker" "teapot" "vase" "stuffed animal" ) CLASS_TOKEN=${CLASS_TOKENs[$CLASS_IDX]} SELECTED_SUBJECT=${SUBJECT_NAMES[$CLASS_IDX]} if [[ $CLASS_IDX =~ ^(0|1|2|3|4|5|8|9|17|18|19|20|21|22|23|24|25|26|27|28|29)$ ]]; then PROMPT_LIST=( "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the jungle." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the snow." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on the beach." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on a cobblestone street." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of pink fabric." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a wooden floor." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a city in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a mountain in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a blue house in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a purple rug in a forest." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a wheat field in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a tree and autumn leaves in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with the Eiffel Tower in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} floating on top of water." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} floating in an ocean of milk." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of green grass with sunflowers around it." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a mirror." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of the sidewalk in a crowded street." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a dirt road." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a white rug." "a red ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a purple ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a shiny ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a wet ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a cube shaped ${UNIQUE_TOKEN} ${CLASS_TOKEN}." ) prompt_test_list=( "a ${CLASS_TOKEN} in the jungle" "a ${CLASS_TOKEN} in the snow" "a ${CLASS_TOKEN} on the beach" "a ${CLASS_TOKEN} on a cobblestone street" "a ${CLASS_TOKEN} on top of pink fabric" "a ${CLASS_TOKEN} on top of a wooden floor" "a ${CLASS_TOKEN} with a city in the background" "a ${CLASS_TOKEN} with a mountain in the background" "a ${CLASS_TOKEN} with a blue house in the background" "a ${CLASS_TOKEN} on top of a purple rug in a forest" "a ${CLASS_TOKEN} with a wheat field in the background" "a ${CLASS_TOKEN} with a tree and autumn leaves in the background" "a ${CLASS_TOKEN} with the Eiffel Tower in the background" "a ${CLASS_TOKEN} floating on top of water" "a ${CLASS_TOKEN} floating in an ocean of milk" "a ${CLASS_TOKEN} on top of green grass with sunflowers around it" "a ${CLASS_TOKEN} on top of a mirror" "a ${CLASS_TOKEN} on top of the sidewalk in a crowded street" "a ${CLASS_TOKEN} on top of a dirt road" "a ${CLASS_TOKEN} on top of a white rug" "a red ${CLASS_TOKEN}" "a purple ${CLASS_TOKEN}" "a shiny ${CLASS_TOKEN}" "a wet ${CLASS_TOKEN}" "a cube shaped ${CLASS_TOKEN}" ) else PROMPT_LIST=( "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the jungle." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the snow." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on the beach." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on a cobblestone street." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of pink fabric." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a wooden floor." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a city in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a mountain in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a blue house in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a purple rug in a forest." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a red hat." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a santa hat." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a rainbow scarf." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a black top hat and a monocle." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a chef outfit." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a firefighter outfit." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a police outfit." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing pink glasses." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a yellow shirt." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a purple wizard outfit." "a red ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a purple ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a shiny ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a wet ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a cube shaped ${UNIQUE_TOKEN} ${CLASS_TOKEN}." ) prompt_test_list=( "a ${CLASS_TOKEN} in the jungle" "a ${CLASS_TOKEN} in the snow" "a ${CLASS_TOKEN} on the beach" "a ${CLASS_TOKEN} on a cobblestone street" "a ${CLASS_TOKEN} on top of pink fabric" "a ${CLASS_TOKEN} on top of a wooden floor" "a ${CLASS_TOKEN} with a city in the background" "a ${CLASS_TOKEN} with a mountain in the background" "a ${CLASS_TOKEN} with a blue house in the background" "a ${CLASS_TOKEN} on top of a purple rug in a forest" "a ${CLASS_TOKEN} wearing a red hat" "a ${CLASS_TOKEN} wearing a santa hat" "a ${CLASS_TOKEN} wearing a rainbow scarf" "a ${CLASS_TOKEN} wearing a black top hat and a monocle" "a ${CLASS_TOKEN} in a chef outfit" "a ${CLASS_TOKEN} in a firefighter outfit" "a ${CLASS_TOKEN} in a police outfit" "a ${CLASS_TOKEN} wearing pink glasses" "a ${CLASS_TOKEN} wearing a yellow shirt" "a ${CLASS_TOKEN} in a purple wizard outfit" "a red ${CLASS_TOKEN}" "a purple ${CLASS_TOKEN}" "a shiny ${CLASS_TOKEN}" "a wet ${CLASS_TOKEN}" "a cube shaped ${CLASS_TOKEN}" ) fi VALIDATION_PROMPT=${PROMPT_LIST[@]} INSTANCE_PROMPT="a photo of ${UNIQUE_TOKEN} ${CLASS_TOKEN}" CLASS_PROMPT="a photo of ${CLASS_TOKEN}" export MODEL_NAME="stabilityai/stable-diffusion-2-1" # export MODEL_NAME="runwayml/stable-diffusion-v1-5" PEFT_TYPE="boft" BLOCK_NUM=8 BLOCK_SIZE=0 N_BUTTERFLY_FACTOR=1 export PROJECT_NAME="dreambooth_${PEFT_TYPE}" export RUN_NAME="${SELECTED_SUBJECT}_${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}" export INSTANCE_DIR="./data/dreambooth/dataset/${SELECTED_SUBJECT}" export CLASS_DIR="./data/class_data/${CLASS_TOKEN}" export OUTPUT_DIR="./data/output/${PEFT_TYPE}" accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir="$CLASS_DIR" \ --output_dir=$OUTPUT_DIR \ --wandb_project_name=$PROJECT_NAME \ --wandb_run_name=$RUN_NAME \ --with_prior_preservation --prior_loss_weight=1.0 \ --instance_prompt="$INSTANCE_PROMPT" \ --validation_prompt="$VALIDATION_PROMPT" \ --class_prompt="$CLASS_PROMPT" \ --resolution=512 \ --train_batch_size=1 \ --num_dataloader_workers=2 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=200 \ --use_boft \ --boft_block_num=$BLOCK_NUM \ --boft_block_size=$BLOCK_SIZE \ --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \ --boft_dropout=0.1 \ --boft_bias="boft_only" \ --learning_rate=3e-5 \ --max_train_steps=1010 \ --checkpointing_steps=200 \ --validation_steps=200 \ --enable_xformers_memory_efficient_attention \ --report_to="wandb" \
SPECIAL_CASES_TO_ALLOW = { # 'max_position_embeddings' is not used in modeling file, but needed for eval frameworks like Huggingface's lighteval (https://github.com/huggingface/lighteval/blob/af24080ea4f16eaf1683e353042a2dfc9099f038/src/lighteval/models/base_model.py#L264). # periods and offsers are not used in modeling file, but used in the configuration file to define `layers_block_type` and `layers_num_experts`. "JambaConfig": [ "max_position_embeddings", "attn_layer_offset", "attn_layer_period", "expert_layer_offset", "expert_layer_period", ], # used to compute the property `self.chunk_length` "EncodecConfig": ["overlap"], # used to compute the property `self.layers_block_type` "RecurrentGemmaConfig": ["block_types"], # used as in the config to define `intermediate_size` "MambaConfig": ["expand"], # used as `self.bert_model = BertModel(config, ...)` "DPRConfig": True, "FuyuConfig": True, # not used in modeling files, but it's an important information "FSMTConfig": ["langs"], # used internally in the configuration class file "GPTNeoConfig": ["attention_types"], # used internally in the configuration class file "EsmConfig": ["is_folding_model"], # used during training (despite we don't have training script for these models yet) "Mask2FormerConfig": ["ignore_value"], # `ignore_value` used during training (despite we don't have training script for these models yet) # `norm` used in conversion script (despite not using in the modeling file) "OneFormerConfig": ["ignore_value", "norm"], # used during preprocessing and collation, see `collating_graphormer.py` "GraphormerConfig": ["spatial_pos_max"], # used internally in the configuration class file "T5Config": ["feed_forward_proj"], # used internally in the configuration class file # `tokenizer_class` get default value `T5Tokenizer` intentionally "MT5Config": ["feed_forward_proj", "tokenizer_class"], "UMT5Config": ["feed_forward_proj", "tokenizer_class"], # used internally in the configuration class file "LongT5Config": ["feed_forward_proj"], # used internally in the configuration class file "Pop2PianoConfig": ["feed_forward_proj"], # used internally in the configuration class file "SwitchTransformersConfig": ["feed_forward_proj"], # having default values other than `1e-5` - we can't fix them without breaking "BioGptConfig": ["layer_norm_eps"], # having default values other than `1e-5` - we can't fix them without breaking "GLPNConfig": ["layer_norm_eps"], # having default values other than `1e-5` - we can't fix them without breaking "SegformerConfig": ["layer_norm_eps"], # having default values other than `1e-5` - we can't fix them without breaking "CvtConfig": ["layer_norm_eps"], # having default values other than `1e-5` - we can't fix them without breaking "PerceiverConfig": ["layer_norm_eps"], # used internally to calculate the feature size "InformerConfig": ["num_static_real_features", "num_time_features"], # used internally to calculate the feature size "TimeSeriesTransformerConfig": ["num_static_real_features", "num_time_features"], # used internally to calculate the feature size "AutoformerConfig": ["num_static_real_features", "num_time_features"], # used internally to calculate `mlp_dim` "SamVisionConfig": ["mlp_ratio"], # For (head) training, but so far not implemented "ClapAudioConfig": ["num_classes"], # Not used, but providing useful information to users "SpeechT5HifiGanConfig": ["sampling_rate"], # used internally in the configuration class file "UdopConfig": ["feed_forward_proj"], # Actually used in the config or generation config, in that case necessary for the sub-components generation "SeamlessM4TConfig": [ "max_new_tokens", "t2u_max_new_tokens", "t2u_decoder_attention_heads", "t2u_decoder_ffn_dim", "t2u_decoder_layers", "t2u_encoder_attention_heads", "t2u_encoder_ffn_dim", "t2u_encoder_layers", "t2u_max_position_embeddings", ], # Actually used in the config or generation config, in that case necessary for the sub-components generation "SeamlessM4Tv2Config": [ "max_new_tokens", "t2u_decoder_attention_heads", "t2u_decoder_ffn_dim", "t2u_decoder_layers", "t2u_encoder_attention_heads", "t2u_encoder_ffn_dim", "t2u_encoder_layers", "t2u_max_position_embeddings", "t2u_variance_pred_dropout", "t2u_variance_predictor_embed_dim", "t2u_variance_predictor_hidden_dim", "t2u_variance_predictor_kernel_size", ], }
from datasets import load_dataset from transformers import set_seed, AutoModelForSeq2SeqLM, AutoTokenizer from peft import get_peft_model, MultitaskPromptTuningConfig, TaskType, MultitaskPromptTuningInit
set_seed(42)
model_name = "google/flan-t5-base"
peft_config = MultitaskPromptTuningConfig( tokenizer_name_or_path=model_name, num_tasks=2, task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=MultitaskPromptTuningInit.TEXT, num_virtual_tokens=50, num_transformer_submodules=1, prompt_tuning_init_text="classify the following into either positive or negative, or entailment, neutral or contradiction:", )
tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) model = get_peft_model(model, peft_config)
model = model.cuda()
def send_to_device(batch): for i in batch: batch[i] = batch[i].cuda() return batch
def get_sst2(split: str): examples = load_dataset("sst2")[split] result_examples = [] for example in examples: result_examples.append({})
result_examples[-1]["input"] = example["sentence"].strip() + "</s>"
result_examples[-1]["output"] = (
f"positive{tokenizer.eos_token}" if example["label"] == 1 else f"negative{tokenizer.eos_token}"
)
result_examples[-1]["task_id"] = 0
return result_examples
def get_mnli(split: str): examples = load_dataset("multi_nli")[split] result_examples = [] for example in examples: result_examples.append({})
result_examples[-1]["input"] = example["premise"].strip() + " " + example["hypothesis"].strip() + "</s>"
if example["label"] == 0:
result_examples[-1]["output"] = f"entailment{tokenizer.eos_token}"
elif example["label"] == 1:
result_examples[-1]["output"] = f"neutral{tokenizer.eos_token}"
else:
result_examples[-1]["output"] = f"contradiction{tokenizer.eos_token}"
result_examples[-1]["task_id"] = 1
return result_examples
from typing import Tuple from torch.utils.data import Dataset, DataLoader import torch
class MyDataset(Dataset): def init(self, split: str, mode: str = "source") -> None: super().init()
if split == "train":
if mode == "source":
self.examples = get_sst2(split) + get_mnli(split)
elif mode == "target":
self.examples = get_sst2(split)
if split == "val":
self.examples = get_sst2("validation")
if split == "test":
self.examples = get_sst2("validation")
def __getitem__(self, index) -> dict:
return self.examples[index]
def __len__(self) -> int:
return len(self.examples)
def __getitem__(self, index) -> dict:
return self.examples[index]
def __len__(self) -> int:
return len(self.examples)
def collate_fn(batch: dict) -> Tuple[torch.Tensor, torch.Tensor]: input = [i["input"] for i in batch] input = tokenizer(input, add_special_tokens=False, return_tensors="pt", padding=True)
output = [i["output"] for i in batch]
output = tokenizer(output, add_special_tokens=False, return_tensors="pt", padding=True).input_ids
output[output == tokenizer.pad_token_id] = -100
task_ids = [i["task_id"] for i in batch]
task_ids = torch.tensor(task_ids)
return {
"input_ids": input.input_ids,
"attention_mask": input.attention_mask,
"labels": output,
"task_ids": task_ids,
}
train = DataLoader(MyDataset("train"), shuffle=True, batch_size=8, collate_fn=collate_fn) val = DataLoader(MyDataset("val"), shuffle=False, batch_size=8, collate_fn=collate_fn) test = DataLoader(MyDataset("test"), shuffle=False, batch_size=8, collate_fn=collate_fn)
## source training
from torch.optim.adamw import AdamW from transformers import get_cosine_schedule_with_warmup from tqdm import tqdm from sklearn.metrics import f1_score
POSITIVE_TOKEN_ID = tokenizer(" positive", add_special_tokens=False)["input_ids"][0] NEGATIVE_TOKEN_ID = tokenizer(" negative", add_special_tokens=False)["input_ids"][0]
def classify(batch): batch = send_to_device(batch) # we pass labels here since we need to generate and peft doesn't support generation yet. # No clue how to get around this scores = model(**batch).logits preds = [] for i in range(scores.shape[0]): if scores[i, 0, POSITIVE_TOKEN_ID] > scores[i, 0, NEGATIVE_TOKEN_ID]: preds.append(POSITIVE_TOKEN_ID) else: preds.append(NEGATIVE_TOKEN_ID) return preds
@torch.inference_mode() def evaluate(model, data): loss = 0 preds = [] golds = []
for batch in tqdm(data):
batch = send_to_device(batch)
loss += model(**batch).loss
golds.extend(batch["labels"][:, 0].tolist())
preds.extend(classify(batch))
return loss / len(val), f1_score(golds, preds, pos_label=POSITIVE_TOKEN_ID)
optimizer = AdamW(model.parameters(), lr=1e-4) scheduler = get_cosine_schedule_with_warmup(optimizer, 200, len(train))
n = 1000 step = 0 train_ = tqdm(train)
val_loss, f1 = evaluate(model, val) print( f""" before source training val loss = {val_loss} f1 = {f1}""" )
for batch in train_: if step % n == 0: val_loss, f1 = evaluate(model, val) print( f""" step = {step} val loss = {val_loss} f1 = {f1}""" ) model.save_pretrained(f"checkpoints_source/{step}")
step += 1
batch = send_to_device(batch)
loss = model(**batch).loss
loss.backward()
optimizer.step()
scheduler.step()
train_.set_postfix(train_loss=loss)
## target training
train = DataLoader(MyDataset("train", "target"), shuffle=True, batch_size=8, collate_fn=collate_fn) val = DataLoader(MyDataset("val", "target"), shuffle=False, batch_size=8, collate_fn=collate_fn) test = DataLoader(MyDataset("test", "target"), shuffle=False, batch_size=8, collate_fn=collate_fn)
#### create a fresh model
peft_config = MultitaskPromptTuningConfig( tokenizer_name_or_path=model_name, num_tasks=1, task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=MultitaskPromptTuningInit.EXACT_SOURCE_TASK, prompt_tuning_init_state_dict_path="checkpoints_source/50000/adapter_model.bin", num_virtual_tokens=50, num_transformer_submodules=1, )
tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) model = get_peft_model(model, peft_config)
model = model.cuda()
optimizer = AdamW(model.parameters(), lr=1e-4) scheduler = get_cosine_schedule_with_warmup(optimizer, 200, len(train))
n = 1000 step = 0 train_ = tqdm(train)
val_loss, f1 = evaluate(model, val) print( f""" before target training val loss = {val_loss} f1 = {f1}""" )
for batch in train_: if step % n == 0: val_loss, f1 = evaluate(model, val) print( f""" step = {step} val loss = {val_loss} f1 = {f1}""" ) model.save_pretrained(f"checkpoints_target/{step}")
step += 1
batch = send_to_device(batch)
loss = model(**batch).loss
loss.backward()
optimizer.step()
scheduler.step()
train_.set_postfix(train_loss=loss)
from peft import set_peft_model_state_dict
sd_6000 = torch.load("checkpoints_target/6000/adapter_model.bin") set_peft_model_state_dict(model, sd_6000)
val_loss, f1 = evaluate(model, val) print( f""" final val loss = {val_loss} f1 = {f1}""" )
test_loss, f1 = evaluate(model, test) print( f""" final test loss = {test_loss} f1 = {f1}""" )
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true use_cpu: false
def parse_llama_config(megatron_lm_plugin, model, batch_data): model_type_name = "gpt" num_layers = model.config.num_hidden_layers pretraining_flag = True hidden_size = model.config.hidden_size num_attention_heads = model.config.num_attention_heads orig_vocab_size = model.config.vocab_size max_position_embeddings = model.config.max_position_embeddings seq_length = getattr(model.config, "max_sequence_length", None) if megatron_lm_plugin.seq_length is None: if seq_length is not None: megatron_lm_plugin.seq_length = seq_length elif megatron_lm_plugin.decoder_seq_length is not None: megatron_lm_plugin.seq_length = megatron_lm_plugin.decoder_seq_length elif batch_data is not None: megatron_lm_plugin.seq_length = batch_data["input_ids"].shape[1] else: megatron_lm_plugin.seq_length = max_position_embeddings megatron_lm_plugin.megatron_lm_default_args["return_logits"] = megatron_lm_plugin.return_logits megatron_lm_plugin.megatron_lm_default_args["tokenizer_type"] = "Llama2Tokenizer" megatron_lm_plugin.megatron_lm_default_args["model_type_name"] = model_type_name megatron_lm_plugin.megatron_lm_default_args["num_layers"] = num_layers megatron_lm_plugin.megatron_lm_default_args["pretraining_flag"] = pretraining_flag megatron_lm_plugin.megatron_lm_default_args["hidden_size"] = hidden_size megatron_lm_plugin.megatron_lm_default_args["num_attention_heads"] = num_attention_heads megatron_lm_plugin.megatron_lm_default_args["orig_vocab_size"] = orig_vocab_size megatron_lm_plugin.megatron_lm_default_args["max_position_embeddings"] = max_position_embeddings megatron_lm_plugin.megatron_lm_default_args["seq_length"] = megatron_lm_plugin.seq_length megatron_lm_plugin.megatron_lm_default_args["model_return_dict"] = model.config.return_dict
continue with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch if args.with_tracking: total_loss += loss.detach().float() accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) completed_steps += 1 if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: output_dir = f"step_{completed_steps }" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if completed_steps >= args.max_train_steps: break model.eval() losses = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(**batch) loss = outputs.loss # New Code # For Megatron-LM, the losses are already averaged across the data parallel group if accelerator.distributed_type == DistributedType.MEGATRON_LM: losses.append(loss) else: losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) try: if accelerator.distributed_type == DistributedType.MEGATRON_LM: losses = torch.tensor(losses) else: losses = torch.cat(losses) eval_loss = torch.mean(losses) perplexity = math.exp(eval_loss) except OverflowError: perplexity = float("inf") logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}") if args.with_tracking: accelerator.log( { "perplexity": perplexity, "eval_loss": eval_loss, "train_loss": total_loss.item() / len(train_dataloader), "epoch": epoch, "step": completed_steps, }, step=completed_steps, ) if args.push_to_hub and epoch < args.num_train_epochs - 1: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message=f"Training in progress epoch {epoch}", run_as_future=True, ) if args.checkpointing_steps == "epoch": output_dir = f"epoch_{epoch}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) # this is causing some issue with Megatron-LM when using `wandb` at the end of the main function. # Everything works fine inspite of commenting this out. (wandb finishes/closes the run without error) # if args.with_tracking: # accelerator.end_training() if args.output_dir is not None: accelerator.wait_for_everyone() # New Code # For Megatron-LM, we need to save the model using `accelerator.save_state` if accelerator.distributed_type == DistributedType.MEGATRON_LM: accelerator.save_state(args.output_dir) else: unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) if args.push_to_hub: api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message="End of training", ) with open(os.path.join(args.output_dir, "all_results.json"), "w") as f: json.dump({"perplexity": perplexity}, f)
Important features are directly supported via the accelerate config
command.
An example of the corresponding questions for using Megatron-LM features is shown below:
:~$ accelerate config --config_file "megatron_gpt_config.yaml" In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0 Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2 How many different machines will you use (use more than 1 for multi-node training)? [1]: Do you want to use DeepSpeed? [yes/NO]: Do you want to use FullyShardedDataParallel? [yes/NO]: Do you want to use Megatron-LM ? [yes/NO]: yes What is the Tensor Parallelism degree/size? [1]:2 Do you want to enable Sequence Parallelism? [YES/no]: What is the Pipeline Parallelism degree/size? [1]:2 What is the number of micro-batches? [1]:2 Do you want to enable selective activation recomputation? [YES/no]: Do you want to use distributed optimizer which shards optimizer state and gradients across data parallel ranks? [YES/no]: What is the gradient clipping value based on global L2 Norm (0 to disable)? [1.0]: How many GPU(s) should be used for distributed training? [1]:4 Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: bf16
The resulting config is shown below:
~$ cat megatron_gpt_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MEGATRON_LM
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config:
megatron_lm_gradient_clipping: 1.0
megatron_lm_num_micro_batches: 2
megatron_lm_pp_degree: 2
megatron_lm_recompute_activations: true
megatron_lm_sequence_parallelism: true
megatron_lm_tp_degree: 2
megatron_lm_use_distributed_optimizer: true
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false
We will take the example of GPT pre-training. The minimal changes required to the official run_clm_no_trainer.py
to use Megatron-LM are as follows:
accelerate.utils.MegatronLMDummyScheduler
.
Example is given below:from accelerate.utils import MegatronLMDummyScheduler if accelerator.distributed_type == DistributedType.MEGATRON_LM: lr_scheduler = MegatronLMDummyScheduler( optimizer=optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps, ) else: lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, )
if accelerator.distributed_type == DistributedType.MEGATRON_LM: total_batch_size = accelerator.state.megatron_lm_plugin.global_batch_size else: total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
if accelerator.distributed_type == DistributedType.MEGATRON_LM: losses.append(loss) else: losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) if accelerator.distributed_type == DistributedType.MEGATRON_LM: losses = torch.tensor(losses) else: losses = torch.cat(losses)
accelerator.save_state
if accelerator.distributed_type == DistributedType.MEGATRON_LM: accelerator.save_state(args.output_dir) else: unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save )
That's it! We are good to go 🚀. Please find the example script in the examples folder at the path accelerate/examples/by_feature/megatron_lm_gpt_pretraining.py
.
Let's run it for gpt-large
model architecture using 4 A100-80GB GPUs.
accelerate launch --config_file megatron_gpt_config.yaml \ examples/by_feature/megatron_lm_gpt_pretraining.py \ --config_name "gpt2-large" \ --tokenizer_name "gpt2-large" \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --block_size 1024 \ --learning_rate 5e-5 \ --per_device_train_batch_size 24 \ --per_device_eval_batch_size 24 \ --num_train_epochs 5 \ --with_tracking \ --report_to "wandb" \ --output_dir "awesome_model"
Below are some important excerpts from the output logs:
Loading extension module fused_dense_cuda... >>> done with compiling and loading fused kernels. Compilation time: 3.569 seconds > padded vocab (size: 50257) with 175 dummy tokens (new size: 50432) Building gpt model in the pre-training mode. The Megatron LM model weights are initialized at random in `accelerator.prepare`. Please use `accelerator.load_checkpoint` to load a pre-trained checkpoint matching the distributed setup. Preparing dataloader Preparing dataloader Preparing model > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 210753280 > number of parameters on (tensor, pipeline) model parallel rank (1, 1): 209445120 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 210753280 > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 209445120 Preparing optimizer Preparing scheduler > learning rate decay style: linear 10/10/2022 22:57:22 - INFO - __main__ - ***** Running training ***** 10/10/2022 22:57:22 - INFO - __main__ - Num examples = 2318 10/10/2022 22:57:22 - INFO - __main__ - Num Epochs = 5 10/10/2022 22:57:22 - INFO - __main__ - Instantaneous batch size per device = 24 10/10/2022 22:57:22 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 48 10/10/2022 22:57:22 - INFO - __main__ - Gradient Accumulation steps = 1 10/10/2022 22:57:22 - INFO - __main__ - Total optimization steps = 245 20%|████████████▍ | 49/245 [01:04<04:09, 1.27s/it] 10/10/2022 22:58:29 - INFO - __main__ - epoch 0: perplexity: 1222.1594275215962 eval_loss: 7.10837459564209 40%|████████████████████████▊ | 98/245 [02:10<03:07, 1.28s/it] 10/10/2022 22:59:35 - INFO - __main__ - epoch 1: perplexity: 894.5236583794557 eval_loss: 6.796291351318359 60%|████████████████████████████████████▌ | 147/245 [03:16<02:05, 1.28s/it] 10/10/2022 23:00:40 - INFO - __main__ - epoch 2: perplexity: 702.8458788508042 eval_loss: 6.555137634277344 80%|████████████████████████████████████████████████▊ | 196/245 [04:22<01:02, 1.28s/it] 10/10/2022 23:01:46 - INFO - __main__ - epoch 3: perplexity: 600.3220028695281 eval_loss: 6.39746618270874 100%|█████████████████████████████████████████████████████████████| 245/245 [05:27<00:00, 1.28s/it]
There are a large number of other options/features that one can set using accelerate.utils.MegatronLMPlugin
.
def parse_args(): parser = argparse.ArgumentParser(description="Finetune a transformers model on a causal language modeling task") parser.add_argument( "--dataset_name", type=str, default=None, help="The name of the dataset to use (via the datasets library).", ) parser.add_argument( "--dataset_config_name", type=str, default=None, help="The configuration name of the dataset to use (via the datasets library).", ) parser.add_argument( "--train_file", type=str, default=None, help="A csv or a json file containing the training data." ) parser.add_argument( "--validation_file", type=str, default=None, help="A csv or a json file containing the validation data." ) parser.add_argument( "--validation_split_percentage", default=5, help="The percentage of the train set used as validation set in case there's no validation split", ) parser.add_argument( "--model_name_or_path", type=str, help="Path to pretrained model or model identifier from huggingface.co/models.", required=False, ) parser.add_argument( "--config_name", type=str, default=None, help="Pretrained config name or path if not the same as model_name", ) parser.add_argument( "--tokenizer_name", type=str, default=None, help="Pretrained tokenizer name or path if not the same as model_name", ) parser.add_argument( "--use_slow_tokenizer", action="store_true", help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).", ) parser.add_argument( "--per_device_train_batch_size", type=int, default=8, help="Batch size (per device) for the training dataloader.", ) parser.add_argument( "--per_device_eval_batch_size", type=int, default=8, help="Batch size (per device) for the evaluation dataloader.", ) parser.add_argument( "--learning_rate", type=float, default=5e-5, help="Initial learning rate (after the potential warmup period) to use.", ) parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.") parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.") parser.add_argument( "--max_train_steps", type=int, default=None, help="Total number of training steps to perform. If provided, overrides num_train_epochs.", ) parser.add_argument( "--gradient_accumulation_steps", type=int, default=1, help="Number of updates steps to accumulate before performing a backward/update pass.", ) parser.add_argument( "--lr_scheduler_type", type=SchedulerType, default="linear", help="The scheduler type to use.", choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"], ) parser.add_argument( "--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler." ) parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.") parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.") parser.add_argument( "--model_type", type=str, default=None, help="Model type to use if training from scratch.", choices=MODEL_TYPES, ) parser.add_argument( "--block_size", type=int, default=None, help=( "Optional input sequence length after tokenization. The training dataset will be truncated in block of" " this size for training. Default to the model max input length for single sentence inputs (take into" " account special tokens)." ), ) parser.add_argument( "--preprocessing_num_workers", type=int, default=None, help="The number of processes to use for the preprocessing.", ) parser.add_argument( "--overwrite_cache", type=bool, default=False, help="Overwrite the cached training and evaluation sets" ) parser.add_argument( "--no_keep_linebreaks", action="store_true", help="Do not keep line breaks when using TXT files." ) parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.") parser.add_argument( "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") parser.add_argument( "--checkpointing_steps", type=str, default=None, help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.", ) parser.add_argument( "--resume_from_checkpoint", type=str, default=None, help="If the training should continue from a checkpoint folder.", ) # New Code # # Whether to load the best model at the end of training parser.add_argument( "--load_best_model", action="store_true", help="Whether to load the best model at the end of training", ) parser.add_argument( "--with_tracking", action="store_true", help="Whether to enable experiment trackers for logging.", ) parser.add_argument( "--report_to", type=str, default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' ' `"wandb"`, `"comet_ml"`, and `"dvclive"`. Use `"all"` (default) to report to all integrations.' "Only applicable when `--with_tracking` is passed." ), ) args = parser.parse_args() # Sanity checks if args.dataset_name is None and args.train_file is None and args.validation_file is None: raise ValueError("Need either a dataset name or a training/validation file.") else: if args.train_file is not None: extension = args.train_file.split(".")[-1] assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, json or txt file." if args.validation_file is not None: extension = args.validation_file.split(".")[-1] assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, json or txt file." if args.push_to_hub: assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed." return args
def get_cluster_input(): distributed_type = _ask_options( "Which type of machine are you using?", [ "No distributed training", "multi-CPU", "multi-XPU", "multi-GPU", "multi-NPU", "multi-MLU", "multi-MUSA", "TPU", ], _convert_distributed_mode, ) machine_rank = 0 num_machines = 1 num_processes = 1 gpu_ids = None main_process_ip = None main_process_port = None rdzv_backend = "static" same_network = True debug = False if distributed_type in [ DistributedType.MULTI_GPU, DistributedType.MULTI_MLU, DistributedType.MULTI_MUSA, DistributedType.MULTI_NPU, DistributedType.MULTI_XPU, DistributedType.MULTI_CPU, ]: num_machines = _ask_field( "How many different machines will you use (use more than 1 for multi-node training)? [1]: ", int, default=1, ) if num_machines > 1: machine_rank = _ask_options( "What is the rank of this machine?", list(range(num_machines)), int, ) main_process_ip = _ask_field( "What is the IP address of the machine that will host the main process? ", ) main_process_port = _ask_field( "What is the port you will use to communicate with the main process? ", int, ) same_network = _ask_field( "Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: ", _convert_yes_no_to_bool, default=True, error_message="Please enter yes or no.", ) if not same_network: rdzv_backend = _ask_field( "What rendezvous backend will you use? ('static', 'c10d', ...): ", default="static" ) debug = _ask_field( "Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if distributed_type == DistributedType.NO: use_cpu = _ask_field( "Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) elif distributed_type == DistributedType.MULTI_CPU: use_cpu = True else: use_cpu = False ipex_config = {} mpirun_config = {} if use_cpu: ipex_config["ipex"] = _ask_field( "Do you want to use Intel PyTorch Extension (IPEX) to speed up training on CPU? [yes/NO]:", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if distributed_type == DistributedType.MULTI_CPU: use_mpirun = _ask_field( "Do you want accelerate to launch mpirun? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if use_mpirun: mpirun_hostfile = _ask_field( "Please enter the path to the hostfile to use with mpirun [~/hostfile]: ", str, default="~/hostfile", ) mpirun_config["mpirun_hostfile"] = os.path.expanduser(mpirun_hostfile.strip()) mpirun_config["mpirun_ccl"] = _ask_field("Enter the number of oneCCL worker threads [1]: ", default=1) if ( not use_cpu and is_xpu_available() and distributed_type not in [ DistributedType.MULTI_GPU, DistributedType.MULTI_NPU, DistributedType.MULTI_MLU, DistributedType.XLA, DistributedType.MULTI_MUSA, ] ): ipex_config["use_xpu"] = _ask_field( "Do you want to use XPU plugin to speed up training on XPU? [yes/NO]:", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) dynamo_config = {} use_dynamo = _ask_field( "Do you wish to optimize your script with torch dynamo?[yes/NO]:", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if use_dynamo: prefix = "dynamo_" dynamo_config[prefix + "backend"] = _ask_options( "Which dynamo backend would you like to use?", [x.lower() for x in DYNAMO_BACKENDS], _convert_dynamo_backend, default=2, ) use_custom_options = _ask_field( "Do you want to customize the defaults sent to torch.compile? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if use_custom_options: dynamo_config[prefix + "mode"] = _ask_options( "Which mode do you want to use?", TORCH_DYNAMO_MODES, lambda x: TORCH_DYNAMO_MODES[int(x)], default=0, ) dynamo_config[prefix + "use_fullgraph"] = _ask_field( "Do you want the fullgraph mode or it is ok to break model into several subgraphs? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) dynamo_config[prefix + "use_dynamic"] = _ask_field( "Do you want to enable dynamic shape tracing? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) use_mps = not use_cpu and is_mps_available() deepspeed_config = {} if ( distributed_type in [ DistributedType.MULTI_GPU, DistributedType.MULTI_XPU, DistributedType.MULTI_NPU, DistributedType.MULTI_MLU, DistributedType.MULTI_MUSA, DistributedType.NO, ] and not use_mps ): use_deepspeed = _ask_field( "Do you want to use DeepSpeed? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if use_deepspeed: distributed_type = DistributedType.DEEPSPEED assert ( is_deepspeed_available() ), "DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source" if distributed_type == DistributedType.DEEPSPEED: use_deepspeed_config = _ask_field( "Do you want to specify a json file to a DeepSpeed config? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if use_deepspeed_config: deepspeed_config["deepspeed_config_file"] = _ask_field( "Please enter the path to the json DeepSpeed config file: ", str, default="none", ) else: deepspeed_config["zero_stage"] = _ask_options( "What should be your DeepSpeed's ZeRO optimization stage?", [0, 1, 2, 3], int, default=2, ) deepspeed_devices = ["none", "cpu", "nvme"] if deepspeed_config["zero_stage"] >= 2: deepspeed_config["offload_optimizer_device"] = _ask_options( "Where to offload optimizer states?", deepspeed_devices, lambda x: deepspeed_devices[int(x)] ) deepspeed_config["offload_param_device"] = _ask_options( "Where to offload parameters?", deepspeed_devices, lambda x: deepspeed_devices[int(x)] ) if deepspeed_config["offload_param_device"] == "nvme": deepspeed_config["offload_param_nvme_path"] = _ask_field( "Nvme Path to offload parameters?", str, default="/nvme", ) if deepspeed_config["offload_optimizer_device"] == "nvme": deepspeed_config["offload_optimizer_nvme_path"] = _ask_field( "Nvme Path to offload optimizer states?", str, default="/nvme", ) deepspeed_config["gradient_accumulation_steps"] = _ask_field( "How many gradient accumulation steps you're passing in your script? [1]: ", int, default=1, ) use_gradient_clipping = _ask_field( "Do you want to use gradient clipping? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if use_gradient_clipping: deepspeed_config["gradient_clipping"] = _ask_field( "What is the gradient clipping value? [1.0]: ", float, default=1.0, ) if deepspeed_config["zero_stage"] == 3: deepspeed_config["zero3_save_16bit_model"] = _ask_field( "Do you want to save 16-bit model weights when using ZeRO Stage-3? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) deepspeed_config["zero3_init_flag"] = _ask_field( "Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if deepspeed_config["zero3_init_flag"]: if not is_transformers_available(): raise Exception( "When `zero3_init_flag` is set, it requires Transformers to be installed. " "Please run `pip3 install transformers`." ) use_moe = _ask_field( "Do you want to enable Mixture-of-Experts training (MoE)? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if use_moe: deepspeed_config["deepspeed_moe_layer_cls_names"] = _ask_field( "Specify the comma-separated list of transformers MoE layer class names (case-sensitive), e.g : " " `MixtralSparseMoeBlock`, `Qwen2MoeSparseMoeBlock`, `JetMoEAttention,JetMoEBlock` ... : ", str, ) if num_machines > 1: launcher_query = "Which Type of launcher do you want to use?" deepspeed_config["deepspeed_multinode_launcher"] = _ask_options( launcher_query, DEEPSPEED_MULTINODE_LAUNCHERS, lambda x: DEEPSPEED_MULTINODE_LAUNCHERS[int(x)], ) if deepspeed_config["deepspeed_multinode_launcher"] != DEEPSPEED_MULTINODE_LAUNCHERS[1]: deepspeed_config["deepspeed_hostfile"] = _ask_field( "DeepSpeed configures multi-node compute resources with hostfile. " "Each row is of the format `hostname slots=[num_gpus]`, e.g., `localhost slots=2`; " "for more information please refer official [documentation]" "(https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node). " "Please specify the location of hostfile: ", str, ) is_exclusion_filter = _ask_field( "Do you want to specify exclusion filter string? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if is_exclusion_filter: deepspeed_config["deepspeed_exclusion_filter"] = _ask_field( "DeepSpeed exclusion filter string: ", str, ) is_inclusion_filter = _ask_field( "Do you want to specify inclusion filter string? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if is_inclusion_filter: deepspeed_config["deepspeed_inclusion_filter"] = _ask_field( "DeepSpeed inclusion filter string: ", str, ) fsdp_config = {} if distributed_type in [ DistributedType.MULTI_GPU, DistributedType.MULTI_NPU, DistributedType.MULTI_MLU, DistributedType.MULTI_MUSA, DistributedType.MULTI_XPU, ]: use_fsdp = _ask_field( "Do you want to use FullyShardedDataParallel? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if use_fsdp: distributed_type = DistributedType.FSDP if distributed_type == DistributedType.FSDP: sharding_strategy_query = "What should be your sharding strategy?" fsdp_config["fsdp_sharding_strategy"] = _ask_options( sharding_strategy_query, FSDP_SHARDING_STRATEGY, lambda x: FSDP_SHARDING_STRATEGY[int(x)], ) fsdp_config["fsdp_offload_params"] = _ask_field( "Do you want to offload parameters and gradients to CPU? [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) fsdp_wrap_query = "What should be your auto wrap policy?" fsdp_config["fsdp_auto_wrap_policy"] = _ask_options( fsdp_wrap_query, FSDP_AUTO_WRAP_POLICY, lambda x: FSDP_AUTO_WRAP_POLICY[int(x)], ) if fsdp_config["fsdp_auto_wrap_policy"] == FSDP_AUTO_WRAP_POLICY[0]: use_no_split_modules = _ask_field( "Do you want to use the model's `_no_split_modules` to wrap. Only applicable for 🤗 Transformers [yes/NO]: ", _convert_yes_no_to_bool, default=False, error_message="Please enter yes or no.", ) if not use_no_split_modules: fsdp_config["fsdp_transformer_layer_cls_to_wrap"] = _ask_field( "Specify the comma-separated list of transformer layer class names (case-sensitive) to wrap ,e.g, :" "`BertLayer`, `GPTJBlock`, `T5Block`, `BertLayer,BertEmbeddings,BertSelfOutput` ...? : ", str, ) elif fsdp_config["fsdp_auto_wrap_policy"] == FSDP_AUTO_WRAP_POLICY[1]: fsdp_config["fsdp_min_num_params"] = _ask_field( "What should be your FSDP's minimum number of parameters for Default Auto Wrapping Policy? [1e8]: ", int, default=100000000, ) fsdp_backward_prefetch_query = "What should be your FSDP's backward prefetch policy?" fsdp_config["fsdp_backward_prefetch"] = _ask_options( fsdp_backward_prefetch_query, FSDP_BACKWARD_PREFETCH, lambda x: FSDP_BACKWARD_PREFETCH[int(x)],