base_model: Qwen/Qwen2.5-32B base_model_config: Qwen/Qwen2.5-32B model_type: AutoModel tokenizer_type: AutoTokenizer trust_remote_code: true shuffle_merged_datasets: true datasets: - path: /workspace/finetuning/outputs/orion-v1.0.0.jsonl ds_type: json type: input_output output_dir: /workspace/orion warmup_steps: 50 sequence_len: 4096 sample_packing: true pad_to_sequence_len: true wandb_project: lunalust wandb_entity: orionsoftware gradient_accumulation_steps: 1 micro_batch_size: 1 eval_batch_size: 1 num_epochs: 1 optimizer: adamw_torch lr_scheduler: constant learning_rate: 0.00002 bf16: true gradient_checkpointing: true logging_steps: 1 flash_attention: true saves_per_epoch: 1 deepspeed: /workspace/axolotl/deepspeed_configs/zero3.json weight_decay: 0.01 special_tokens: eos_token: "<|im_end|>" pad_token: "<|endoftext|>" sep_token: "<sep>" unk_token: "<unk>" additional_special_tokens: ["<|im_start|>", "<sep>"] tokens: - "<|im_start|>" - "_[username]" - "_[moanshort]" - "_[moanlong]"
The configuration file you've provided seems mostly correct for a typical fine-tuning task using a model from the Qwen repository with specific datasets, tokenizer, and training parameters. However, there are a few potential issues or points that might need clarification or correction based on common practices and the information provided:
model_type
and tokenizer_type
: The values AutoModel
and AutoTokenizer
are placeholders. You should specify the exact model and tokenizer classes depending on your task, such as GPT2LMHeadModel
for causal language modeling with a GPT-2 model. The AutoModel
and AutoTokenizer
classes are used to automatically infer the specific model/tokenizer classes from a model repository, but they require the model repository to contain configuration files that specify the model architecture.
base_model_config
: This field is not typically required unless you're specifying a custom configuration that differs from the default configuration of base_model
. If Qwen/Qwen2.5-32B
already includes a configuration file, you might not need to specify this unless you're overriding default settings.
trust_remote_code
: Setting this to true
means you're allowing the execution of custom code from the model repository (if any exists). This is fine as long as you trust the source, but it's worth mentioning for security reasons.
ds_type
: In the datasets
section, the field ds_type
should be type
, based on common usage in similar configuration files. This specifies the type of dataset being used (e.g., json
, csv
).
deepspeed
Configuration Path: The deepspeed
field expects a path to a DeepSpeed configuration file. Ensure that /workspace/axolotl/deepspeed_configs/zero3.json
exists and is correctly configured for your training environment. This file controls various aspects of DeepSpeed's optimizations, such as ZeRO stages, offloading, etc.
optimizer
: The value adamw_torch
is specified. Ensure that this aligns with your intended use of either PyTorch's native AdamW optimizer or another variant. In DeepSpeed configurations, the optimizer can also be specified within the DeepSpeed config JSON.
lr_scheduler
: You've specified constant
, which means the learning rate will not change during training. Make sure this is intended, as other options like linear
or cosine
might be more suitable for some tasks.
bf16
: You've enabled bfloat16 mixed precision training. Ensure your hardware (GPUs) supports BF16 operations for efficient training.
flash_attention
: This option is specified as true
. Verify that your DeepSpeed configuration and environment support FlashAttention for efficient attention computation.
special_tokens
and tokens
: You've defined custom special tokens and additional tokens. Ensure these are correctly integrated into your tokenizer and model for them to be recognized during training and inference.
Overall, the configuration seems tailored for a specific fine-tuning task with DeepSpeed optimization. Ensure that all paths, model types, and custom settings align with your project's requirements and the capabilities of your training environment.
base_model: togethercomputer/RedPajama-INCITE-Chat-3B-v1 model_type: GPTNeoXForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: load_in_8bit: false datasets: - path: vicgalle/alpaca-gpt4 type: alpaca dataset_prepared_path: val_set_size: 0.02 adapter: lora_model_dir: sequence_len: 2048 max_packed_sequence_len: lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj lora_fan_in_fan_out: false wandb_project: redpajama-alpaca-3b wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/redpajama-alpaca-3b batch_size: 4 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.0000002 train_on_inputs: false group_by_length: false bf16: auto tf32: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 5 xformers_attention: flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0001 fsdp: fsdp_config: tokens: pad_token: "<|padding|>" bos_token: "<|endoftext|>" eos_token: "<|endoftext|>" unk_token: "<|endoftext|>"
base_model: tiiuae/falcon-7b trust_remote_code: true model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false gptq: false strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca:chat dataset_prepared_path: val_set_size: 0.05 adapter: lora_model_dir: sequence_len: 2048 max_packed_sequence_len: lora_r: 64 lora_alpha: 32 lora_dropout: 0.0 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/falcon-7b batch_size: 2 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.00003 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 40 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: pad_token: "<|endoftext|>" bos_token: "<|endoftext|>" eos_token: "<|endoftext|>"
base_model: state-spaces/mamba-2.8b model_type: MambaLMHeadModel tokenizer_type: AutoTokenizer tokenizer_config: EleutherAI/gpt-neox-20b load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: val_set_size: 0.0 output_dir: ./outputs/out sequence_len: 2048 sample_packing: false pad_to_sequence_len: false wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 1 num_epochs: 2 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 5e-5 train_on_inputs: false group_by_length: true bf16: auto fp16: tf32: true gradient_checkpointing: false early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: warmup_steps: 10 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: tokens: save_safetensors: False
base_model: EleutherAI/pythia-12b-deduped base_model_ignore_patterns: pytorch* # prefer safetensors model_type: GPTNeoXForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false gptq: false device_map: auto datasets: - path: vicgalle/alpaca-gpt4 type: alpaca dataset_prepared_path: val_set_size: 0.05 adapter: lora_model_dir: sequence_len: 2048 max_packed_sequence_len: 2048 lora_r: 64 lora_alpha: 32 lora_dropout: 0.0 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: true # pythia/GPTNeoX lora specific wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/pythia-12b gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 5 learning_rate: 0.00003 optimizer: adamw_bnb_8bit lr_scheduler: cosine train_on_inputs: false group_by_length: false bf16: false fp16: false float16: true tf32: true flash_optimum: true early_stopping_patience: resume_from_checkpoint: local_rank: gradient_checkpointing: true fsdp: fsdp_config:
base_model: mosaicml/mpt-7b tokenizer_type: AutoTokenizer trust_remote_code: true # required for mpt as their model class is not merged into transformers yet load_in_8bit: false datasets: - path: vicgalle/alpaca-gpt4 type: alpaca dataset_prepared_path: val_set_size: 0.02 adapter: lora_model_dir: sequence_len: 2048 max_packed_sequence_len: lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj lora_fan_in_fan_out: false wandb_project: mpt-alpaca-7b wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/mpt-alpaca-7b gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.0000002 train_on_inputs: false group_by_length: false bf16: auto tf32: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 5 xformers_attention: flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0001 fsdp: fsdp_config: tokens: pad_token: "<|padding|>" bos_token: "<|endoftext|>" eos_token: "<|endoftext|>" unk_token: "<|endoftext|>"
base_model: stabilityai/stablelm-2-1_6b model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: true load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./outputs/out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true flash_attn_cross_entropy: false flash_attn_rms_norm: true flash_attn_fuse_qkv: false flash_attn_fuse_mlp: true warmup_steps: 100 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 1 debug: deepspeed: #deepspeed_configs/zero2.json # multi-gpu only weight_decay: 0.1 fsdp: fsdp_config: special_tokens:
base_model: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca dataset_prepared_path: val_set_size: 0.02 adapter: lora_model_dir: sequence_len: 1024 sample_packing: true lora_r: lora_alpha: lora_dropout: lora_target_modules: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/openllama-out gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.000003 train_on_inputs: false group_by_length: false float16: true bf16: false fp16: false tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
from datasets import load_dataset from transformers import set_seed, AutoModelForSeq2SeqLM, AutoTokenizer from peft import get_peft_model, MultitaskPromptTuningConfig, TaskType, MultitaskPromptTuningInit
set_seed(42)
model_name = "google/flan-t5-base"
peft_config = MultitaskPromptTuningConfig( tokenizer_name_or_path=model_name, num_tasks=2, task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=MultitaskPromptTuningInit.TEXT, num_virtual_tokens=50, num_transformer_submodules=1, prompt_tuning_init_text="classify the following into either positive or negative, or entailment, neutral or contradiction:", )
tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) model = get_peft_model(model, peft_config)
model = model.cuda()
def send_to_device(batch): for i in batch: batch[i] = batch[i].cuda() return batch
def get_sst2(split: str): examples = load_dataset("sst2")[split] result_examples = [] for example in examples: result_examples.append({})
result_examples[-1]["input"] = example["sentence"].strip() + "</s>"
result_examples[-1]["output"] = (
f"positive{tokenizer.eos_token}" if example["label"] == 1 else f"negative{tokenizer.eos_token}"
)
result_examples[-1]["task_id"] = 0
return result_examples
def get_mnli(split: str): examples = load_dataset("multi_nli")[split] result_examples = [] for example in examples: result_examples.append({})
result_examples[-1]["input"] = example["premise"].strip() + " " + example["hypothesis"].strip() + "</s>"
if example["label"] == 0:
result_examples[-1]["output"] = f"entailment{tokenizer.eos_token}"
elif example["label"] == 1:
result_examples[-1]["output"] = f"neutral{tokenizer.eos_token}"
else:
result_examples[-1]["output"] = f"contradiction{tokenizer.eos_token}"
result_examples[-1]["task_id"] = 1
return result_examples
from typing import Tuple from torch.utils.data import Dataset, DataLoader import torch
class MyDataset(Dataset): def init(self, split: str, mode: str = "source") -> None: super().init()
if split == "train":
if mode == "source":
self.examples = get_sst2(split) + get_mnli(split)
elif mode == "target":
self.examples = get_sst2(split)
if split == "val":
self.examples = get_sst2("validation")
if split == "test":
self.examples = get_sst2("validation")
def __getitem__(self, index) -> dict:
return self.examples[index]
def __len__(self) -> int:
return len(self.examples)
def __getitem__(self, index) -> dict:
return self.examples[index]
def __len__(self) -> int:
return len(self.examples)
def collate_fn(batch: dict) -> Tuple[torch.Tensor, torch.Tensor]: input = [i["input"] for i in batch] input = tokenizer(input, add_special_tokens=False, return_tensors="pt", padding=True)
output = [i["output"] for i in batch]
output = tokenizer(output, add_special_tokens=False, return_tensors="pt", padding=True).input_ids
output[output == tokenizer.pad_token_id] = -100
task_ids = [i["task_id"] for i in batch]
task_ids = torch.tensor(task_ids)
return {
"input_ids": input.input_ids,
"attention_mask": input.attention_mask,
"labels": output,
"task_ids": task_ids,
}
train = DataLoader(MyDataset("train"), shuffle=True, batch_size=8, collate_fn=collate_fn) val = DataLoader(MyDataset("val"), shuffle=False, batch_size=8, collate_fn=collate_fn) test = DataLoader(MyDataset("test"), shuffle=False, batch_size=8, collate_fn=collate_fn)
## source training
from torch.optim.adamw import AdamW from transformers import get_cosine_schedule_with_warmup from tqdm import tqdm from sklearn.metrics import f1_score
POSITIVE_TOKEN_ID = tokenizer(" positive", add_special_tokens=False)["input_ids"][0] NEGATIVE_TOKEN_ID = tokenizer(" negative", add_special_tokens=False)["input_ids"][0]
def classify(batch): batch = send_to_device(batch) # we pass labels here since we need to generate and peft doesn't support generation yet. # No clue how to get around this scores = model(**batch).logits preds = [] for i in range(scores.shape[0]): if scores[i, 0, POSITIVE_TOKEN_ID] > scores[i, 0, NEGATIVE_TOKEN_ID]: preds.append(POSITIVE_TOKEN_ID) else: preds.append(NEGATIVE_TOKEN_ID) return preds
@torch.inference_mode() def evaluate(model, data): loss = 0 preds = [] golds = []
for batch in tqdm(data):
batch = send_to_device(batch)
loss += model(**batch).loss
golds.extend(batch["labels"][:, 0].tolist())
preds.extend(classify(batch))
return loss / len(val), f1_score(golds, preds, pos_label=POSITIVE_TOKEN_ID)
optimizer = AdamW(model.parameters(), lr=1e-4) scheduler = get_cosine_schedule_with_warmup(optimizer, 200, len(train))
n = 1000 step = 0 train_ = tqdm(train)
val_loss, f1 = evaluate(model, val) print( f""" before source training val loss = {val_loss} f1 = {f1}""" )
for batch in train_: if step % n == 0: val_loss, f1 = evaluate(model, val) print( f""" step = {step} val loss = {val_loss} f1 = {f1}""" ) model.save_pretrained(f"checkpoints_source/{step}")
step += 1
batch = send_to_device(batch)
loss = model(**batch).loss
loss.backward()
optimizer.step()
scheduler.step()
train_.set_postfix(train_loss=loss)
## target training
train = DataLoader(MyDataset("train", "target"), shuffle=True, batch_size=8, collate_fn=collate_fn) val = DataLoader(MyDataset("val", "target"), shuffle=False, batch_size=8, collate_fn=collate_fn) test = DataLoader(MyDataset("test", "target"), shuffle=False, batch_size=8, collate_fn=collate_fn)
#### create a fresh model
peft_config = MultitaskPromptTuningConfig( tokenizer_name_or_path=model_name, num_tasks=1, task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=MultitaskPromptTuningInit.EXACT_SOURCE_TASK, prompt_tuning_init_state_dict_path="checkpoints_source/50000/adapter_model.bin", num_virtual_tokens=50, num_transformer_submodules=1, )
tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) model = get_peft_model(model, peft_config)
model = model.cuda()
optimizer = AdamW(model.parameters(), lr=1e-4) scheduler = get_cosine_schedule_with_warmup(optimizer, 200, len(train))
n = 1000 step = 0 train_ = tqdm(train)
val_loss, f1 = evaluate(model, val) print( f""" before target training val loss = {val_loss} f1 = {f1}""" )
for batch in train_: if step % n == 0: val_loss, f1 = evaluate(model, val) print( f""" step = {step} val loss = {val_loss} f1 = {f1}""" ) model.save_pretrained(f"checkpoints_target/{step}")
step += 1
batch = send_to_device(batch)
loss = model(**batch).loss
loss.backward()
optimizer.step()
scheduler.step()
train_.set_postfix(train_loss=loss)
from peft import set_peft_model_state_dict
sd_6000 = torch.load("checkpoints_target/6000/adapter_model.bin") set_peft_model_state_dict(model, sd_6000)
val_loss, f1 = evaluate(model, val) print( f""" final val loss = {val_loss} f1 = {f1}""" )
test_loss, f1 = evaluate(model, test) print( f""" final test loss = {test_loss} f1 = {f1}""" )
SPECIAL_CASES_TO_ALLOW = { # 'max_position_embeddings' is not used in modeling file, but needed for eval frameworks like Huggingface's lighteval (https://github.com/huggingface/lighteval/blob/af24080ea4f16eaf1683e353042a2dfc9099f038/src/lighteval/models/base_model.py#L264). # periods and offsers are not used in modeling file, but used in the configuration file to define `layers_block_type` and `layers_num_experts`. "JambaConfig": [ "max_position_embeddings", "attn_layer_offset", "attn_layer_period", "expert_layer_offset", "expert_layer_period", ], # used to compute the property `self.chunk_length` "EncodecConfig": ["overlap"], # used to compute the property `self.layers_block_type` "RecurrentGemmaConfig": ["block_types"], # used as in the config to define `intermediate_size` "MambaConfig": ["expand"], # used as `self.bert_model = BertModel(config, ...)` "DPRConfig": True, "FuyuConfig": True, # not used in modeling files, but it's an important information "FSMTConfig": ["langs"], # used internally in the configuration class file "GPTNeoConfig": ["attention_types"], # used internally in the configuration class file "EsmConfig": ["is_folding_model"], # used during training (despite we don't have training script for these models yet) "Mask2FormerConfig": ["ignore_value"], # `ignore_value` used during training (despite we don't have training script for these models yet) # `norm` used in conversion script (despite not using in the modeling file) "OneFormerConfig": ["ignore_value", "norm"], # used during preprocessing and collation, see `collating_graphormer.py` "GraphormerConfig": ["spatial_pos_max"], # used internally in the configuration class file "T5Config": ["feed_forward_proj"], # used internally in the configuration class file # `tokenizer_class` get default value `T5Tokenizer` intentionally "MT5Config": ["feed_forward_proj", "tokenizer_class"], "UMT5Config": ["feed_forward_proj", "tokenizer_class"], # used internally in the configuration class file "LongT5Config": ["feed_forward_proj"], # used internally in the configuration class file "Pop2PianoConfig": ["feed_forward_proj"], # used internally in the configuration class file "SwitchTransformersConfig": ["feed_forward_proj"], # having default values other than `1e-5` - we can't fix them without breaking "BioGptConfig": ["layer_norm_eps"], # having default values other than `1e-5` - we can't fix them without breaking "GLPNConfig": ["layer_norm_eps"], # having default values other than `1e-5` - we can't fix them without breaking "SegformerConfig": ["layer_norm_eps"], # having default values other than `1e-5` - we can't fix them without breaking "CvtConfig": ["layer_norm_eps"], # having default values other than `1e-5` - we can't fix them without breaking "PerceiverConfig": ["layer_norm_eps"], # used internally to calculate the feature size "InformerConfig": ["num_static_real_features", "num_time_features"], # used internally to calculate the feature size "TimeSeriesTransformerConfig": ["num_static_real_features", "num_time_features"], # used internally to calculate the feature size "AutoformerConfig": ["num_static_real_features", "num_time_features"], # used internally to calculate `mlp_dim` "SamVisionConfig": ["mlp_ratio"], # For (head) training, but so far not implemented "ClapAudioConfig": ["num_classes"], # Not used, but providing useful information to users "SpeechT5HifiGanConfig": ["sampling_rate"], # used internally in the configuration class file "UdopConfig": ["feed_forward_proj"], # Actually used in the config or generation config, in that case necessary for the sub-components generation "SeamlessM4TConfig": [ "max_new_tokens", "t2u_max_new_tokens", "t2u_decoder_attention_heads", "t2u_decoder_ffn_dim", "t2u_decoder_layers", "t2u_encoder_attention_heads", "t2u_encoder_ffn_dim", "t2u_encoder_layers", "t2u_max_position_embeddings", ], # Actually used in the config or generation config, in that case necessary for the sub-components generation "SeamlessM4Tv2Config": [ "max_new_tokens", "t2u_decoder_attention_heads", "t2u_decoder_ffn_dim", "t2u_decoder_layers", "t2u_encoder_attention_heads", "t2u_encoder_ffn_dim", "t2u_encoder_layers", "t2u_max_position_embeddings", "t2u_variance_pred_dropout", "t2u_variance_predictor_embed_dim", "t2u_variance_predictor_hidden_dim", "t2u_variance_predictor_kernel_size", ], }
from transformers import AutoModelForSeq2SeqLM from peft import get_peft_config, get_peft_model, get_peft_model_state_dict, PrefixTuningConfig, TaskType import torch from datasets import load_dataset import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["CUDA_VISIBLE_DEVICES"] = "3" from transformers import AutoTokenizer from torch.utils.data import DataLoader from transformers import default_data_collator, get_linear_schedule_with_warmup from tqdm import tqdm from datasets import load_dataset
device = "cuda" model_name_or_path = "t5-large" tokenizer_name_or_path = "t5-large"
checkpoint_name = "financial_sentiment_analysis_prefix_tuning_v1.pt" text_column = "sentence" label_column = "text_label" max_length = 128 lr = 1e-2 num_epochs = 5 batch_size = 8
peft_config = PrefixTuningConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, num_virtual_tokens=20)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path) model = get_peft_model(model, peft_config) model.print_trainable_parameters() model
dataset = load_dataset("financial_phrasebank", "sentences_allagree") dataset = dataset["train"].train_test_split(test_size=0.1) dataset["validation"] = dataset["test"] del dataset["test"]
classes = dataset["train"].features["label"].names dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["label"]]}, batched=True, num_proc=1, )
dataset["train"][0]
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt") labels = tokenizer(targets, max_length=2, padding="max_length", truncation=True, return_tensors="pt") labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs
processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=False, desc="Running tokenizer on dataset", )
train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["validation"]
train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True ) eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr) lr_scheduler = get_linear_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=0, num_training_steps=(len(train_dataloader) * num_epochs), )
model = model.to(device)
for epoch in range(num_epochs): model.train() total_loss = 0 for step, batch in enumerate(tqdm(train_dataloader)): batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss total_loss += loss.detach().float() loss.backward() optimizer.step() lr_scheduler.step() optimizer.zero_grad()
model.eval()
eval_loss = 0
eval_preds = []
for step, batch in enumerate(tqdm(eval_dataloader)):
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
loss = outputs.loss
eval_loss += loss.detach().float()
eval_preds.extend(
tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
)
eval_epoch_loss = eval_loss / len(eval_dataloader)
eval_ppl = torch.exp(eval_epoch_loss)
train_epoch_loss = total_loss / len(train_dataloader)
train_ppl = torch.exp(train_epoch_loss)
print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")
correct = 0 total = 0 for pred, true in zip(eval_preds, dataset["validation"]["text_label"]): if pred.strip() == true.strip(): correct += 1 total += 1 accuracy = correct / total * 100 print(f"{accuracy=} % on the evaluation dataset") print(f"{eval_preds[:10]=}") print(f"{dataset['validation']['text_label'][:10]=}")
peft_model_id = f"{model_name_or_path}{peft_config.peft_type}{peft_config.task_type}" model.save_pretrained(peft_model_id)
ckpt = f"{peft_model_id}/adapter_model.bin" !du -h $ckpt
from peft import PeftModel, PeftConfig
peft_model_id = f"{model_name_or_path}{peft_config.peft_type}{peft_config.task_type}"
config = PeftConfig.from_pretrained(peft_model_id) model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path) model = PeftModel.from_pretrained(model, peft_model_id)
model.eval() i = 107 inputs = tokenizer(dataset["validation"][text_column][i], return_tensors="pt") print(dataset["validation"][text_column][i]) print(inputs)
with torch.no_grad(): outputs = model.generate(input_ids=inputs["input_ids"], max_new_tokens=10) print(outputs) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
IDX=$1 PROMPT_IDX=$((IDX % 25)) CLASS_IDX=$((IDX % 30)) # Define the UNIQUE_TOKEN, CLASS_TOKENs, and SUBJECT_NAMES UNIQUE_TOKEN="qwe" SUBJECT_NAMES=( "backpack" "backpack_dog" "bear_plushie" "berry_bowl" "can" "candle" "cat" "cat2" "clock" "colorful_sneaker" "dog" "dog2" "dog3" "dog5" "dog6" "dog7" "dog8" "duck_toy" "fancy_boot" "grey_sloth_plushie" "monster_toy" "pink_sunglasses" "poop_emoji" "rc_car" "red_cartoon" "robot_toy" "shiny_sneaker" "teapot" "vase" "wolf_plushie" ) CLASS_TOKENs=( "backpack" "backpack" "stuffed animal" "bowl" "can" "candle" "cat" "cat" "clock" "sneaker" "dog" "dog" "dog" "dog" "dog" "dog" "dog" "toy" "boot" "stuffed animal" "toy" "glasses" "toy" "toy" "cartoon" "toy" "sneaker" "teapot" "vase" "stuffed animal" ) CLASS_TOKEN=${CLASS_TOKENs[$CLASS_IDX]} SELECTED_SUBJECT=${SUBJECT_NAMES[$CLASS_IDX]} if [[ $CLASS_IDX =~ ^(0|1|2|3|4|5|8|9|17|18|19|20|21|22|23|24|25|26|27|28|29)$ ]]; then PROMPT_LIST=( "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the jungle." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the snow." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on the beach." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on a cobblestone street." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of pink fabric." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a wooden floor." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a city in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a mountain in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a blue house in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a purple rug in a forest." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a wheat field in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a tree and autumn leaves in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with the Eiffel Tower in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} floating on top of water." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} floating in an ocean of milk." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of green grass with sunflowers around it." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a mirror." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of the sidewalk in a crowded street." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a dirt road." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a white rug." "a red ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a purple ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a shiny ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a wet ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a cube shaped ${UNIQUE_TOKEN} ${CLASS_TOKEN}." ) prompt_test_list=( "a ${CLASS_TOKEN} in the jungle" "a ${CLASS_TOKEN} in the snow" "a ${CLASS_TOKEN} on the beach" "a ${CLASS_TOKEN} on a cobblestone street" "a ${CLASS_TOKEN} on top of pink fabric" "a ${CLASS_TOKEN} on top of a wooden floor" "a ${CLASS_TOKEN} with a city in the background" "a ${CLASS_TOKEN} with a mountain in the background" "a ${CLASS_TOKEN} with a blue house in the background" "a ${CLASS_TOKEN} on top of a purple rug in a forest" "a ${CLASS_TOKEN} with a wheat field in the background" "a ${CLASS_TOKEN} with a tree and autumn leaves in the background" "a ${CLASS_TOKEN} with the Eiffel Tower in the background" "a ${CLASS_TOKEN} floating on top of water" "a ${CLASS_TOKEN} floating in an ocean of milk" "a ${CLASS_TOKEN} on top of green grass with sunflowers around it" "a ${CLASS_TOKEN} on top of a mirror" "a ${CLASS_TOKEN} on top of the sidewalk in a crowded street" "a ${CLASS_TOKEN} on top of a dirt road" "a ${CLASS_TOKEN} on top of a white rug" "a red ${CLASS_TOKEN}" "a purple ${CLASS_TOKEN}" "a shiny ${CLASS_TOKEN}" "a wet ${CLASS_TOKEN}" "a cube shaped ${CLASS_TOKEN}" ) else PROMPT_LIST=( "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the jungle." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in the snow." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on the beach." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on a cobblestone street." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of pink fabric." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a wooden floor." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a city in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a mountain in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} with a blue house in the background." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} on top of a purple rug in a forest." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a red hat." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a santa hat." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a rainbow scarf." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a black top hat and a monocle." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a chef outfit." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a firefighter outfit." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a police outfit." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing pink glasses." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} wearing a yellow shirt." "a ${UNIQUE_TOKEN} ${CLASS_TOKEN} in a purple wizard outfit." "a red ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a purple ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a shiny ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a wet ${UNIQUE_TOKEN} ${CLASS_TOKEN}." "a cube shaped ${UNIQUE_TOKEN} ${CLASS_TOKEN}." ) prompt_test_list=( "a ${CLASS_TOKEN} in the jungle" "a ${CLASS_TOKEN} in the snow" "a ${CLASS_TOKEN} on the beach" "a ${CLASS_TOKEN} on a cobblestone street" "a ${CLASS_TOKEN} on top of pink fabric" "a ${CLASS_TOKEN} on top of a wooden floor" "a ${CLASS_TOKEN} with a city in the background" "a ${CLASS_TOKEN} with a mountain in the background" "a ${CLASS_TOKEN} with a blue house in the background" "a ${CLASS_TOKEN} on top of a purple rug in a forest" "a ${CLASS_TOKEN} wearing a red hat" "a ${CLASS_TOKEN} wearing a santa hat" "a ${CLASS_TOKEN} wearing a rainbow scarf" "a ${CLASS_TOKEN} wearing a black top hat and a monocle" "a ${CLASS_TOKEN} in a chef outfit" "a ${CLASS_TOKEN} in a firefighter outfit" "a ${CLASS_TOKEN} in a police outfit" "a ${CLASS_TOKEN} wearing pink glasses" "a ${CLASS_TOKEN} wearing a yellow shirt" "a ${CLASS_TOKEN} in a purple wizard outfit" "a red ${CLASS_TOKEN}" "a purple ${CLASS_TOKEN}" "a shiny ${CLASS_TOKEN}" "a wet ${CLASS_TOKEN}" "a cube shaped ${CLASS_TOKEN}" ) fi VALIDATION_PROMPT=${PROMPT_LIST[@]} INSTANCE_PROMPT="a photo of ${UNIQUE_TOKEN} ${CLASS_TOKEN}" CLASS_PROMPT="a photo of ${CLASS_TOKEN}" export MODEL_NAME="stabilityai/stable-diffusion-2-1" # export MODEL_NAME="runwayml/stable-diffusion-v1-5" PEFT_TYPE="boft" BLOCK_NUM=8 BLOCK_SIZE=0 N_BUTTERFLY_FACTOR=1 export PROJECT_NAME="dreambooth_${PEFT_TYPE}" export RUN_NAME="${SELECTED_SUBJECT}_${PEFT_TYPE}_${BLOCK_NUM}${BLOCK_SIZE}${N_BUTTERFLY_FACTOR}" export INSTANCE_DIR="./data/dreambooth/dataset/${SELECTED_SUBJECT}" export CLASS_DIR="./data/class_data/${CLASS_TOKEN}" export OUTPUT_DIR="./data/output/${PEFT_TYPE}" accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir="$CLASS_DIR" \ --output_dir=$OUTPUT_DIR \ --wandb_project_name=$PROJECT_NAME \ --wandb_run_name=$RUN_NAME \ --with_prior_preservation --prior_loss_weight=1.0 \ --instance_prompt="$INSTANCE_PROMPT" \ --validation_prompt="$VALIDATION_PROMPT" \ --class_prompt="$CLASS_PROMPT" \ --resolution=512 \ --train_batch_size=1 \ --num_dataloader_workers=2 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=200 \ --use_boft \ --boft_block_num=$BLOCK_NUM \ --boft_block_size=$BLOCK_SIZE \ --boft_n_butterfly_factor=$N_BUTTERFLY_FACTOR \ --boft_dropout=0.1 \ --boft_bias="boft_only" \ --learning_rate=3e-5 \ --max_train_steps=1010 \ --checkpointing_steps=200 \ --validation_steps=200 \ --enable_xformers_memory_efficient_attention \ --report_to="wandb" \
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true use_cpu: false
import os
import torch from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit from torch.utils.data import DataLoader from tqdm import tqdm from datasets import load_dataset
os.environ["TOKENIZERS_PARALLELISM"] = "false"
device = "cuda" model_name_or_path = "t5-large" tokenizer_name_or_path = "t5-large"
checkpoint_name = "financial_sentiment_analysis_prompt_tuning_v1.pt" text_column = "sentence" label_column = "text_label" max_length = 128 lr = 1 num_epochs = 5 batch_size = 8
peft_config = PromptTuningConfig( task_type=TaskType.SEQ_2_SEQ_LM, prompt_tuning_init=PromptTuningInit.TEXT, num_virtual_tokens=20, prompt_tuning_init_text="What is the sentiment of this article?\n", inference_mode=False, tokenizer_name_or_path=model_name_or_path, )
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path) model = get_peft_model(model, peft_config) model.print_trainable_parameters() model
dataset = load_dataset("financial_phrasebank", "sentences_allagree") dataset = dataset["train"].train_test_split(test_size=0.1) dataset["validation"] = dataset["test"] del dataset["test"]
classes = dataset["train"].features["label"].names dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["label"]]}, batched=True, num_proc=1, )
dataset["train"][0]
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes])
def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt") labels = tokenizer( targets, max_length=target_max_length, padding="max_length", truncation=True, return_tensors="pt" ) labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs
processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=False, desc="Running tokenizer on dataset", )
train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["validation"]
train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True ) eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr) lr_scheduler = get_linear_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=0, num_training_steps=(len(train_dataloader) * num_epochs), )
model = model.to(device)
for epoch in range(num_epochs): model.train() total_loss = 0 for step, batch in enumerate(tqdm(train_dataloader)): batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss total_loss += loss.detach().float() loss.backward() optimizer.step() lr_scheduler.step() optimizer.zero_grad()
model.eval()
eval_loss = 0
eval_preds = []
for step, batch in enumerate(tqdm(eval_dataloader)):
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
loss = outputs.loss
eval_loss += loss.detach().float()
eval_preds.extend(
tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
)
eval_epoch_loss = eval_loss / len(eval_dataloader)
eval_ppl = torch.exp(eval_epoch_loss)
train_epoch_loss = total_loss / len(train_dataloader)
train_ppl = torch.exp(train_epoch_loss)
print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")
correct = 0 total = 0 for pred, true in zip(eval_preds, dataset["validation"]["text_label"]): if pred.strip() == true.strip(): correct += 1 total += 1 accuracy = correct / total * 100 print(f"{accuracy=} % on the evaluation dataset") print(f"{eval_preds[:10]=}") print(f"{dataset['validation']['text_label'][:10]=}")
peft_model_id = f"{model_name_or_path}{peft_config.peft_type}{peft_config.task_type}" model.save_pretrained(peft_model_id)
ckpt = f"{peft_model_id}/adapter_model.bin" !du -h $ckpt
from peft import PeftModel, PeftConfig
peft_model_id = f"{model_name_or_path}{peft_config.peft_type}{peft_config.task_type}"
config = PeftConfig.from_pretrained(peft_model_id) model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path) model = PeftModel.from_pretrained(model, peft_model_id)
model.eval() i = 107 input_ids = tokenizer(dataset["validation"][text_column][i], return_tensors="pt").input_ids print(dataset["validation"][text_column][i]) print(input_ids)
with torch.no_grad(): outputs = model.generate(input_ids=input_ids, max_new_tokens=10) print(outputs) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
continue with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch if args.with_tracking: total_loss += loss.detach().float() accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) completed_steps += 1 if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: output_dir = f"step_{completed_steps }" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if completed_steps >= args.max_train_steps: break model.eval() losses = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(**batch) loss = outputs.loss # New Code # For Megatron-LM, the losses are already averaged across the data parallel group if accelerator.distributed_type == DistributedType.MEGATRON_LM: losses.append(loss) else: losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) try: if accelerator.distributed_type == DistributedType.MEGATRON_LM: losses = torch.tensor(losses) else: losses = torch.cat(losses) eval_loss = torch.mean(losses) perplexity = math.exp(eval_loss) except OverflowError: perplexity = float("inf") logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}") if args.with_tracking: accelerator.log( { "perplexity": perplexity, "eval_loss": eval_loss, "train_loss": total_loss.item() / len(train_dataloader), "epoch": epoch, "step": completed_steps, }, step=completed_steps, ) if args.push_to_hub and epoch < args.num_train_epochs - 1: accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message=f"Training in progress epoch {epoch}", run_as_future=True, ) if args.checkpointing_steps == "epoch": output_dir = f"epoch_{epoch}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) # this is causing some issue with Megatron-LM when using `wandb` at the end of the main function. # Everything works fine inspite of commenting this out. (wandb finishes/closes the run without error) # if args.with_tracking: # accelerator.end_training() if args.output_dir is not None: accelerator.wait_for_everyone() # New Code # For Megatron-LM, we need to save the model using `accelerator.save_state` if accelerator.distributed_type == DistributedType.MEGATRON_LM: accelerator.save_state(args.output_dir) else: unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained( args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save ) if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir) if args.push_to_hub: api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message="End of training", ) with open(os.path.join(args.output_dir, "all_results.json"), "w") as f: json.dump({"perplexity": perplexity}, f)
def parse_args(): parser = argparse.ArgumentParser(description="Finetune a transformers model on a causal language modeling task") parser.add_argument( "--dataset_name", type=str, default=None, help="The name of the dataset to use (via the datasets library).", ) parser.add_argument( "--dataset_config_name", type=str, default=None, help="The configuration name of the dataset to use (via the datasets library).", ) parser.add_argument( "--train_file", type=str, default=None, help="A csv or a json file containing the training data." ) parser.add_argument( "--validation_file", type=str, default=None, help="A csv or a json file containing the validation data." ) parser.add_argument( "--validation_split_percentage", default=5, help="The percentage of the train set used as validation set in case there's no validation split", ) parser.add_argument( "--model_name_or_path", type=str, help="Path to pretrained model or model identifier from huggingface.co/models.", required=False, ) parser.add_argument( "--config_name", type=str, default=None, help="Pretrained config name or path if not the same as model_name", ) parser.add_argument( "--tokenizer_name", type=str, default=None, help="Pretrained tokenizer name or path if not the same as model_name", ) parser.add_argument( "--use_slow_tokenizer", action="store_true", help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).", ) parser.add_argument( "--per_device_train_batch_size", type=int, default=8, help="Batch size (per device) for the training dataloader.", ) parser.add_argument( "--per_device_eval_batch_size", type=int, default=8, help="Batch size (per device) for the evaluation dataloader.", ) parser.add_argument( "--learning_rate", type=float, default=5e-5, help="Initial learning rate (after the potential warmup period) to use.", ) parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.") parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.") parser.add_argument( "--max_train_steps", type=int, default=None, help="Total number of training steps to perform. If provided, overrides num_train_epochs.", ) parser.add_argument( "--gradient_accumulation_steps", type=int, default=1, help="Number of updates steps to accumulate before performing a backward/update pass.", ) parser.add_argument( "--lr_scheduler_type", type=SchedulerType, default="linear", help="The scheduler type to use.", choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"], ) parser.add_argument( "--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler." ) parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.") parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.") parser.add_argument( "--model_type", type=str, default=None, help="Model type to use if training from scratch.", choices=MODEL_TYPES, ) parser.add_argument( "--block_size", type=int, default=None, help=( "Optional input sequence length after tokenization. The training dataset will be truncated in block of" " this size for training. Default to the model max input length for single sentence inputs (take into" " account special tokens)." ), ) parser.add_argument( "--preprocessing_num_workers", type=int, default=None, help="The number of processes to use for the preprocessing.", ) parser.add_argument( "--overwrite_cache", type=bool, default=False, help="Overwrite the cached training and evaluation sets" ) parser.add_argument( "--no_keep_linebreaks", action="store_true", help="Do not keep line breaks when using TXT files." ) parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.") parser.add_argument( "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") parser.add_argument( "--checkpointing_steps", type=str, default=None, help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.", ) parser.add_argument( "--resume_from_checkpoint", type=str, default=None, help="If the training should continue from a checkpoint folder.", ) # New Code # # Whether to load the best model at the end of training parser.add_argument( "--load_best_model", action="store_true", help="Whether to load the best model at the end of training", ) parser.add_argument( "--with_tracking", action="store_true", help="Whether to enable experiment trackers for logging.", ) parser.add_argument( "--report_to", type=str, default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' ' `"wandb"`, `"comet_ml"`, and `"dvclive"`. Use `"all"` (default) to report to all integrations.' "Only applicable when `--with_tracking` is passed." ), ) args = parser.parse_args() # Sanity checks if args.dataset_name is None and args.train_file is None and args.validation_file is None: raise ValueError("Need either a dataset name or a training/validation file.") else: if args.train_file is not None: extension = args.train_file.split(".")[-1] assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, json or txt file." if args.validation_file is not None: extension = args.validation_file.split(".")[-1] assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, json or txt file." if args.push_to_hub: assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed." return args
Pre-Requisites: Install DeepSpeed version >=0.6.5. Please refer to the DeepSpeed Installation details for more information.
We will first look at easy to use integration via accelerate config
.
Followed by more flexible and feature rich deepspeed config file
integration.
On your machine(s) just run:
accelerate config
and answer the questions asked. It will ask whether you want to use a config file for DeepSpeed to which you should answer no. Then answer the following questions to generate a basic DeepSpeed config. This will generate a config file that will be used automatically to properly set the default options when doing
accelerate launch my_script.py --args_to_my_script
For instance, here is how you would run the NLP example examples/nlp_example.py
(from the root of the repo) with DeepSpeed Plugin:
ZeRO Stage-2 DeepSpeed Plugin Example
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero_stage: 2 distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 use_cpu: false
accelerate launch examples/nlp_example.py --mixed_precision fp16
ZeRO Stage-3 with CPU Offload DeepSpeed Plugin Example
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 use_cpu: false
accelerate launch examples/nlp_example.py --mixed_precision fp16
Currently, Accelerate
supports following config through the CLI:
`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning `gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them. `gradient_clipping`: Enable gradient clipping with value. `offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2. `offload_optimizer_nvme_path`: Decides Nvme Path to offload optimizer states. If unspecified, will default to 'none'. `offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3. `offload_param_nvme_path`: Decides Nvme Path to offload parameters. If unspecified, will default to 'none'. `zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3. `zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3. `mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training. `deepspeed_moe_layer_cls_names`: Comma-separated list of transformer Mixture-of-Experts (MoE) layer class names (case-sensitive) to wrap ,e.g, `MixtralSparseMoeBlock`, `Qwen2MoeSparseMoeBlock`, `JetMoEAttention,JetMoEBlock` ... `deepspeed_hostfile`: DeepSpeed hostfile for configuring multi-node compute resources. `deepspeed_exclusion_filter`: DeepSpeed exclusion filter string when using mutli-node setup. `deepspeed_inclusion_filter`: DeepSpeed inclusion filter string when using mutli-node setup. `deepspeed_multinode_launcher`: DeepSpeed multi-node launcher to use. If unspecified, will default to `pdsh`. `deepspeed_config_file`: path to the DeepSpeed config file in `json` format. See the next section for more details on this.
To be able to tweak more options, you will need to use a DeepSpeed config file.
On your machine(s) just run:
accelerate config
and answer the questions asked. It will ask whether you want to use a config file for deepspeed to which you answer yes and provide the path to the deepspeed config file. This will generate a config file that will be used automatically to properly set the default options when doing
accelerate launch my_script.py --args_to_my_script
For instance, here is how you would run the NLP example examples/by_feature/deepspeed_with_config_support.py
(from the root of the repo) with DeepSpeed Config File:
ZeRO Stage-2 DeepSpeed Config File Example
compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage2_config.json zero3_init_flag: true distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 use_cpu: false
with the contents of zero_stage2_config.json
being:
{ "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "weight_decay": "auto", "torch_adam": true, "adam_w_mode": true } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto", "total_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": "auto", "contiguous_gradients": true }, "gradient_accumulation_steps": 1, "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
accelerate launch examples/by_feature/deepspeed_with_config_support.py \ --config_name "gpt2-large" \ --tokenizer_name "gpt2-large" \ --dataset_name "wikitext" \ --dataset_config_name "wikitext-2-raw-v1" \ --block_size 128 \ --output_dir "./clm/clm_deepspeed_stage2_accelerate" \ --learning_rate 5e-4 \ --per_device_train_batch_size 24 \ --per_device_eval_batch_size 24 \ --num_train_epochs 3 \ --with_tracking \ --report_to "wandb"\
ZeRO Stage-3 with CPU offload DeepSpeed Config File Example
compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage3_offload_config.json zero3_init_flag: true distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 use_cpu: false
with the contents of zero_stage3_offload_config.json
being:
{ "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto", "total_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "sub_group_size": 1e9, "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": "auto" }, "gradient_accumulation_steps": 1, "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
accelerate launch examples/by_feature/deepspeed_with_config_support.py \ --config_name "gpt2-large" \ --tokenizer_name "gpt2-large" \ --dataset_name "wikitext" \ --dataset_config_name "wikitext-2-raw-v1" \ --block_size 128 \ --output_dir "./clm/clm_deepspeed_stage3_offload_accelerate" \ --learning_rate 5e-4 \ --per_device_train_batch_size 32 \ --per_device_eval_batch_size 32 \ --num_train_epochs 3 \ --with_tracking \ --report_to "wandb"\
ZeRO++ Config Example You can use the features of ZeRO++ by using the appropriate config parameters. Note that ZeRO++ is an extension for ZeRO Stage 3. Here is how the config file can be modified, from DeepSpeed's ZeRO++ tutorial:
{ "zero_optimization": { "stage": 3, "reduce_bucket_size": "auto", "zero_quantized_weights": true, "zero_hpz_partition_size": 8, "zero_quantized_gradients": true, "contiguous_gradients": true, "overlap_comm": true } }
For hierarchical partitioning, the partition size zero_hpz_partition_size
should ideally be set to the number of GPUs per node. (For example, the above config file assumes 8 GPUs per node)
Important code changes when using DeepSpeed Config File
DeepSpeed Optimizers and Schedulers. For more information on these, see the DeepSpeed Optimizers and DeepSpeed Schedulers documentation. We will look at the changes needed in the code when using these.
a. DS Optim + DS Scheduler: The case when both optimizer
and scheduler
keys are present in the DeepSpeed config file.
In this situation, those will be used and the user has to use accelerate.utils.DummyOptim
and accelerate.utils.DummyScheduler
to replace the PyTorch/Custom optimizers and schedulers in their code.
Below is the snippet from examples/by_feature/deepspeed_with_config_support.py
showing this:
# Creates Dummy Optimizer if `optimizer` was specified in the config file else creates Adam Optimizer optimizer_cls = ( torch.optim.AdamW if accelerator.state.deepspeed_plugin is None or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config else DummyOptim ) optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate) # Creates Dummy Scheduler if `scheduler` was specified in the config file else creates `args.lr_scheduler_type` Scheduler if ( accelerator.state.deepspeed_plugin is None or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config ): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, num_warmup_steps=args.num_warmup_steps, num_training_steps=args.max_train_steps, ) else: lr_scheduler = DummyScheduler( optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps )
b. Custom Optim + Custom Scheduler: The case when both optimizer
and scheduler
keys are absent in the DeepSpeed config file.
In this situation, no code changes are needed from the user and this is the case when using integration via DeepSpeed Plugin.
In the above example we can see that the code remains unchanged if the optimizer
and scheduler
keys are absent in the DeepSpeed config file.
c. Custom Optim + DS Scheduler: The case when only scheduler
key is present in the DeepSpeed config file.
In this situation, the user has to use accelerate.utils.DummyScheduler
to replace the PyTorch/Custom scheduler in their code.
d. DS Optim + Custom Scheduler: The case when only optimizer
key is present in the DeepSpeed config file.
This will result in an error because you can only use DS Scheduler when using DS Optim.
Notice the auto
values in the above example DeepSpeed config files. These are automatically handled by prepare
method
based on model, dataloaders, dummy optimizer and dummy schedulers provided to prepare
method.
Only the auto
fields specified in above examples are handled by prepare
method and the rest have to be explicitly specified by the user.
The auto
values are calculated as:
reduce_bucket_size
: hidden_size * hidden_size
stage3_prefetch_bucket_size
: int(0.9 * hidden_size * hidden_size)
stage3_param_persistence_threshold
: 10 * hidden_size
For the auto
feature to work for these 3 config entries - Accelerate will use model.config.hidden_size
or max(model.config.hidden_sizes)
as hidden_size
. If neither of these is available, the launching will fail and you will have to set these 3 config entries manually. Remember the first 2 config entries are the communication buffers - the larger they are the more efficient the comms will be, and the larger they are the more GPU memory they will consume, so it's a tunable performance trade-off.
Things to note when using DeepSpeed Config File
Below is a sample script using deepspeed_config_file
in different scenarios.
Code test.py
:
from accelerate import Accelerator from accelerate.state import AcceleratorState def main(): accelerator = Accelerator() accelerator.print(f"{AcceleratorState()}") if __name__ == "__main__": main()
Scenario 1: Manually tampered accelerate config file having deepspeed_config_file
along with other entries.
accelerate
config:command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: 'cpu' offload_param_device: 'cpu' zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 deepspeed_config_file: 'ds_config.json' distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main megatron_lm_config: {} num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false
ds_config.json
:{ "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "stage3_gather_16bit_weights_on_model_save": false, "offload_optimizer": { "device": "none" }, "offload_param": { "device": "none" } }, "gradient_clipping": 1.0, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": 10, "steps_per_print": 2000000 }
accelerate launch test.py
:ValueError: When using `deepspeed_config_file`, the following accelerate config variables will be ignored: ['gradient_accumulation_steps', 'gradient_clipping', 'zero_stage', 'offload_optimizer_device', 'offload_param_device', 'zero3_save_16bit_model', 'mixed_precision']. Please specify them appropriately in the DeepSpeed config file. If you are using an accelerate config file, remove other config variables mentioned in the above specified list. The easiest method is to create a new config following the questionnaire via `accelerate config`. It will only ask for the necessary config variables when using `deepspeed_config_file`.
Scenario 2: Use the solution of the error to create new accelerate config and check that no ambiguity error is now thrown.
accelerate config
:$ accelerate config ------------------------------------------------------------------------------------------------------------------------------- In which compute environment are you running? This machine -------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file: ds_config.json
Do you want to enable deepspeed.zero.Init
when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:4
accelerate configuration saved at ds_config_sample.yaml
2. Content of the `accelerate` config:
```yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: ds_config.json
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false
accelerate launch test.py
:Distributed environment: DEEPSPEED Backend: nccl Num processes: 4 Process index: 0 Local process index: 0 Device: cuda:0 Mixed precision type: bf16 ds_config: {'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'stage3_gather_16bit_weights_on_model_save': False, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}}, 'gradient_clipping': 1.0, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 10, 'steps_per_print': inf, 'fp16': {'enabled': False}}
Scenario 3: Setting the accelerate launch
command arguments related to DeepSpeed as "auto"
in the DeepSpeed` configuration file and check that things work as expected.
ds_config.json
with "auto"
for the accelerate launch
DeepSpeed command arguments:{ "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": "auto", "stage3_gather_16bit_weights_on_model_save": "auto", "offload_optimizer": { "device": "auto" }, "offload_param": { "device": "auto" } }, "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "steps_per_print": 2000000 }
accelerate launch --mixed_precision="fp16" --zero_stage=3 --gradient_accumulation_steps=5 --gradient_clipping=1.0 --offload_param_device="cpu" --offload_optimizer_device="nvme" --zero3_save_16bit_model="true" test.py
:Distributed environment: DEEPSPEED Backend: nccl Num processes: 4 Process index: 0 Local process index: 0 Device: cuda:0 Mixed precision type: fp16 ds_config: {'bf16': {'enabled': False}, 'zero_optimization': {'stage': 3, 'stage3_gather_16bit_weights_on_model_save': True, 'offload_optimizer': {'device': 'nvme'}, 'offload_param': {'device': 'cpu'}}, 'gradient_clipping': 1.0, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 5, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}
Note:
"auto"
values are handled in accelerator.prepare()
call as explained in point 2 of
Important code changes when using DeepSpeed Config File
.gradient_accumulation_steps
is auto
, the value passed while creating Accelerator
object via Accelerator(gradient_accumulation_steps=k)
will be used. When using DeepSpeed Plugin, the value from it will be used and it will overwrite the value passed while creating Accelerator object.