Phorm.ai
OpenAccess-AI-Collective/axolotl

    024-06-12 12:03:44,694] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 4 (pid: 3830) of binary: /root/miniconda3/envs/py3.10/bin/python

    Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module> sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command multi_gpu_launcher(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher distrib_run.run(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

    help

    The error message you're encountering indicates that a child process initiated by PyTorch's distributed launcher has failed. This can happen for a variety of reasons, including but not limited to issues with the environment setup, incorrect distributed configurations, or errors within the script being executed in a distributed manner.

    Here are some steps to troubleshoot and potentially resolve the issue:

    1. Check Environment Setup: Ensure that your environment is correctly set up for distributed training. This includes verifying that all necessary dependencies are installed and compatible with each other.

    2. Verify Distributed Configuration: Make sure that the configuration passed to the distributed launcher is correct. This includes the number of nodes, the number of GPUs per node, and any other relevant distributed settings.

    3. Inspect the Script for Errors: The issue might be within the script you're trying to run in a distributed manner. Run the script in a non-distributed mode (if possible) to check for any errors that might not be related to the distributed setup.

    4. Increase Shared Memory Size: If your script involves a significant amount of data transfer between processes, you might need to increase the shared memory size available to Docker containers or your system. This is a common issue when using PyTorch's DataLoader with num_workers set to a value greater than 0.

    5. Debug Mode and Logging: Utilize the debug mode and logging to get more detailed error messages. For accelerate, you can enable debug mode as mentioned in the Hanging code and timeout errors section of the Accelerate documentation. This might provide more insights into what's causing the child process to fail.

    6. Check for Deadlocks: Ensure that your code does not have any potential deadlocks, especially when using custom synchronization primitives or when performing collective operations (e.g., torch.distributed.barrier()).

    7. System Resources: Verify that your system has enough resources (CPU, GPU, memory) to run the distributed job. Insufficient resources can lead to various issues, including process termination.

    8. Update PyTorch and Accelerate: Ensure that you're using the latest versions of PyTorch and Accelerate, as bugs in earlier versions might have been fixed in newer releases.

    9. Consult Documentation and Community: If the issue persists, consult the official documentation of PyTorch's distributed package and Accelerate. You might also find it helpful to search for or ask questions on community forums such as the PyTorch Discussion Forum or GitHub issues for the Accelerate repository.

    If you're able to provide more specific details about your setup and the script you're trying to run, I might be able to offer more targeted advice.

    Search results
    [huggingface/accelerate] src/accelerate/commands/launch.py
    def multi_gpu_launcher(args): import torch.distributed.run as distrib_run current_env = prepare_multi_gpu_env(args) if not check_cuda_p2p_ib_support(): message = "Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled." warn = False if "NCCL_P2P_DISABLE" not in current_env: current_env["NCCL_P2P_DISABLE"] = "1" warn = True if "NCCL_IB_DISABLE" not in current_env: current_env["NCCL_IB_DISABLE"] = "1" warn = True if warn: logger.warning(message) debug = getattr(args, "debug", False) args = _filter_args( args, distrib_run.get_args_parser(), ["--training_script", args.training_script, "--training_script_args", args.training_script_args], ) with patch_environment(**current_env): try: distrib_run.run(args) except Exception: if is_rich_available() and debug: console = get_console() console.print("\n[bold red]Using --debug, `torch.distributed` Stack Trace:[/bold red]") console.print_exception(suppress=[__file__], show_locals=False) else: raise
    [huggingface/transformers] tests/sagemaker/scripts/pytorch/run_ddp.py
    def main(): args = parse_args() port = 8888 num_gpus = int(os.environ["SM_NUM_GPUS"]) hosts = json.loads(os.environ["SM_HOSTS"]) num_nodes = len(hosts) current_host = os.environ["SM_CURRENT_HOST"] rank = hosts.index(current_host) os.environ["NCCL_DEBUG"] = "INFO" if num_nodes > 1: cmd = f"""python -m torch.distributed.launch \ --nnodes={num_nodes} \ --node_rank={rank} \ --nproc_per_node={num_gpus} \ --master_addr={hosts[0]} \ --master_port={port} \ ./run_glue.py \ {"".join([f" --{parameter} {value}" for parameter,value in args.__dict__.items()])}""" else: cmd = f"""python -m torch.distributed.launch \ --nproc_per_node={num_gpus} \ ./run_glue.py \ {"".join([f" --{parameter} {value}" for parameter,value in args.__dict__.items()])}""" try: subprocess.run(cmd, shell=True) except Exception as e: logger.info(e)
    [huggingface/accelerate] src/accelerate/utils/constants.py
    # These are the args for `torch.distributed.launch` for pytorch < 1.9 TORCH_LAUNCH_PARAMS = [ "nnodes", "nproc_per_node", "rdzv_backend", "rdzv_endpoint", "rdzv_id", "rdzv_conf", "standalone", "max_restarts", "monitor_interval", "start_method", "role", "module", "m", "no_python", "run_path", "log_dir", "r", "redirects", "t", "tee", "node_rank", "master_addr", "master_port", ]
    [huggingface/accelerate] docs/source/basic_tutorials/troubleshooting.md

    Hanging code and timeout errors

    There can be many reasons why your code is hanging. Let's take a look at how to solve some of the most common issues that can cause your code to hang.

    Mismatched tensor shapes

    Mismatched tensor shapes is a common issue that can cause your code to hang for a significant amount of time on a distributed setup.

    When running scripts in a distributed setup, functions such as [Accelerator.gather] and [Accelerator.reduce] are necessary to grab tensors across devices to collectively perform operations on them. These (and other) functions rely on torch.distributed to perform a gather operation, which requires tensors to have the exact same shape across all processes. When the tensor shapes don't match, your code hangs and you'll eventually hit a timeout exception.

    You can use Accelerate's operational debug mode to immediately catch this issue. We recommend enabling this mode during the accelerate config setup, but you can also enable it from the CLI, as an environment variable, or by manually editing the config.yaml file.

    <hfoptions id="mismatch"> <hfoption id="CLI">
    accelerate launch --debug {my_script.py} --arg1 --arg2
    </hfoption> <hfoption id="environment variable">

    If enabling debug mode as an environment variable, you don't need to call accelerate launch.

    ACCELERATE_DEBUG_MODE="1" torchrun {my_script.py} --arg1 --arg2
    </hfoption> <hfoption id="config.yaml">

    Add debug: true to your config.yaml file.

    compute_environment: LOCAL_MACHINE debug: true
    </hfoption> </hfoptions>

    Once you enable debug mode, you should get a traceback that points to the tensor shape mismatch issue.

    Traceback (most recent call last): File "/home/zach_mueller_huggingface_co/test.py", line 18, in <module> main() File "/home/zach_mueller_huggingface_co/test.py", line 15, in main broadcast_tensor = broadcast(tensor) File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/utils/operations.py", line 303, in wrapper accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid. Operation: `accelerate.utils.operations.broadcast` Input shapes: - Process 0: [1, 5] - Process 1: [1, 2, 5]

    Early stopping

    For early stopping in distributed training, if each process has a specific stopping condition (e.g. validation loss), it may not be synchronized across all processes. As a result, a break can happen on process 0 but not on process 1 which will cause your code to hang indefinitely until a timeout occurs.

    If you have early stopping conditionals, use the set_breakpoint and check_breakpoint methods to make sure all the processes are ended correctly.

    # Assume `should_do_breakpoint` is a custom defined function that returns a conditional, # and that conditional might be true only on process 1 if should_do_breakpoint(loss): accelerator.set_breakpoint() # Later in the training script when we need to check for the breakpoint if accelerator.check_breakpoint(): break

    Low kernel versions on Linux

    On Linux with kernel version < 5.5, hanging processes have been reported. To avoid this problem, upgrade your system to a later kernel version.

    MPI

    If your distributed CPU training job using MPI is hanging, ensure that you have passwordless SSH setup (using keys) between the nodes. This means that for all nodes in your hostfile, you should to be able to SSH from one node to another without being prompted for a password.

    Next, try to run the mpirun command as a sanity check. For example, the command below should print out the hostnames for each of the nodes.

    mpirun -f hostfile -n {number of nodes} -ppn 1 hostname
    [huggingface/accelerate] src/accelerate/utils/launch.py
    def __call__(self, index, *args): if self.debug: world_size = int(os.environ.get("WORLD_SIZE")) rdv_file = os.environ.get("ACCELERATE_DEBUG_RDV_FILE") torch.distributed.init_process_group( "gloo", rank=index, store=torch.distributed.FileStore(rdv_file, world_size), world_size=world_size, ) elif self.distributed_type in ( DistributedType.MULTI_GPU, DistributedType.MULTI_MLU, DistributedType.MULTI_NPU, DistributedType.MULTI_XPU, DistributedType.MULTI_CPU, ): # Prepare the environment for torch.distributed os.environ["LOCAL_RANK"] = str(index) nproc = int(os.environ.get("NPROC", 1)) node_rank = int(os.environ.get("NODE_RANK", 0)) os.environ["RANK"] = str(nproc * node_rank + index) os.environ["FORK_LAUNCHED"] = str(1) self.launcher(*args)
    [huggingface/accelerate] src/accelerate/commands/launch.py
    def deepspeed_launcher(args): import torch.distributed.run as distrib_run if not is_deepspeed_available(): raise ImportError("DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source.") else: from deepspeed.launcher.runner import DEEPSPEED_ENVIRONMENT_NAME cmd, current_env = prepare_deepspeed_cmd_env(args) if not check_cuda_p2p_ib_support(): message = "Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled." warn = False if "NCCL_P2P_DISABLE" not in current_env: current_env["NCCL_P2P_DISABLE"] = "1" warn = True if "NCCL_IB_DISABLE" not in current_env: current_env["NCCL_IB_DISABLE"] = "1" warn = True if warn: logger.warning(message) if args.num_machines > 1 and args.deepspeed_multinode_launcher != DEEPSPEED_MULTINODE_LAUNCHERS[1]: with open(DEEPSPEED_ENVIRONMENT_NAME, "a") as f: valid_env_items = convert_dict_to_env_variables(current_env) if len(valid_env_items) > 1: f.writelines(valid_env_items) process = subprocess.Popen(cmd, env=current_env) process.wait() if process.returncode != 0: if not args.quiet: raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) else: sys.exit(1) else: debug = getattr(args, "debug", False) args = _filter_args( args, distrib_run.get_args_parser(), ["--training_script", args.training_script, "--training_script_args", args.training_script_args], ) with patch_environment(**current_env): try: distrib_run.run(args) except Exception: if is_rich_available() and debug: console = get_console() console.print("\n[bold red]Using --debug, `torch.distributed` Stack Trace:[/bold red]") console.print_exception(suppress=[__file__], show_locals=False) else: raise
    [huggingface/peft] tests/test_lora_megatron.py
    def initialize_distributed(): print(f"Initializing torch.distributed with rank: {rank}, world_size: {world_size}") torch.cuda.set_device(0) init_method = "tcp://" master_ip = os.getenv("MASTER_ADDR", "localhost") master_port = os.getenv("MASTER_PORT", "6001") init_method += master_ip + ":" + master_port torch.distributed.init_process_group(backend="nccl", world_size=world_size, rank=rank, init_method=init_method)
    [huggingface/peft] examples/sft/run_peft_multigpu.sh
    torchrun --nproc_per_node 8 --nnodes 1 train.py \ --seed 100 \ --model_name_or_path "mistralai/Mistral-7B-v0.1" \ --dataset_name "smangrul/ultrachat-10k-chatml" \ --chat_template_format "chatml" \ --add_special_tokens False \ --append_concat_token False \ --splits "train,test" \ --max_seq_len 2048 \ --num_train_epochs 1 \ --logging_steps 5 \ --log_level "info" \ --logging_strategy "steps" \ --evaluation_strategy "epoch" \ --save_strategy "epoch" \ --push_to_hub \ --hub_private_repo True \ --hub_strategy "every_save" \ --bf16 True \ --packing True \ --learning_rate 1e-4 \ --lr_scheduler_type "cosine" \ --weight_decay 1e-4 \ --warmup_ratio 0.0 \ --max_grad_norm 1.0 \ --output_dir "mistral-sft-lora-multigpu" \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --gradient_accumulation_steps 8 \ --gradient_checkpointing True \ --use_reentrant False \ --dataset_text_field "content" \ --use_peft_lora True \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.1 \ --lora_target_modules "all-linear" \ --use_4bit_quantization True \ --use_nested_quant True \ --bnb_4bit_compute_dtype "bfloat16" \ --use_flash_attn True
    [huggingface/transformers] docs/source/en/debugging.md

    Multi-GPU Network Issues Debug

    When training or inferencing with DistributedDataParallel and multiple GPU, if you run into issue of inter-communication between processes and/or nodes, you can use the following script to diagnose network issues.

    wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py

    For example to test how 2 GPUs interact do:

    python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

    If both processes can talk to each and allocate GPU memory each will print an OK status.

    For more GPUs or nodes adjust the arguments in the script.

    You will find a lot more details inside the diagnostics script and even a recipe to how you could run it in a SLURM environment.

    An additional level of debug is to add NCCL_DEBUG=INFO environment variable as follows:

    NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

    This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported. Or if you're not sure how to interpret the output you can share the log file in an Issue.

    [huggingface/transformers] src/transformers/commands/env.py
    def run(self): safetensors_version = "not installed" if is_safetensors_available(): import safetensors safetensors_version = safetensors.__version__ elif importlib.util.find_spec("safetensors") is not None: import safetensors safetensors_version = f"{safetensors.__version__} but is ignored because of PyTorch version too old." accelerate_version = "not installed" accelerate_config = accelerate_config_str = "not found" if is_accelerate_available(): import accelerate from accelerate.commands.config import default_config_file, load_config_from_file accelerate_version = accelerate.__version__ # Get the default from the config file. if self._accelerate_config_file is not None or os.path.isfile(default_config_file): accelerate_config = load_config_from_file(self._accelerate_config_file).to_dict() accelerate_config_str = ( "\n".join([f"\t- {prop}: {val}" for prop, val in accelerate_config.items()]) if isinstance(accelerate_config, dict) else f"\t{accelerate_config}" ) pt_version = "not installed" pt_cuda_available = "NA" if is_torch_available(): import torch pt_version = torch.__version__ pt_cuda_available = torch.cuda.is_available() tf_version = "not installed" tf_cuda_available = "NA" if is_tf_available(): import tensorflow as tf tf_version = tf.__version__ try: # deprecated in v2.1 tf_cuda_available = tf.test.is_gpu_available() except AttributeError: # returns list of devices, convert to bool tf_cuda_available = bool(tf.config.list_physical_devices("GPU")) flax_version = "not installed" jax_version = "not installed" jaxlib_version = "not installed" jax_backend = "NA" if is_flax_available(): import flax import jax import jaxlib flax_version = flax.__version__ jax_version = jax.__version__ jaxlib_version = jaxlib.__version__ jax_backend = jax.lib.xla_bridge.get_backend().platform info = { "`transformers` version": version, "Platform": platform.platform(), "Python version": platform.python_version(), "Huggingface_hub version": huggingface_hub.__version__, "Safetensors version": f"{safetensors_version}", "Accelerate version": f"{accelerate_version}", "Accelerate config": f"{accelerate_config_str}", "PyTorch version (GPU?)": f"{pt_version} ({pt_cuda_available})", "Tensorflow version (GPU?)": f"{tf_version} ({tf_cuda_available})", "Flax version (CPU?/GPU?/TPU?)": f"{flax_version} ({jax_backend})", "Jax version": f"{jax_version}", "JaxLib version": f"{jaxlib_version}", "Using GPU in script?": "<fill in>", "Using distributed or parallel set-up in script?": "<fill in>", } print("\nCopy-and-paste the text below in your GitHub issue and FILL OUT the two last points.\n") print(self.format_dict(info)) return info
    [huggingface/peft] examples/loftq_finetuning/train_gsm8k_llama.py
    def main(): args = parse_args() # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_clm_no_trainer", args) # Initialize the accelerator. We will let the accelerator handle device placement for us in this example. # If we're using tracking, we also need to initialize it here and it will by default pick up all supported trackers # in the environment accelerator_log_kwargs = {} if args.with_tracking: accelerator_log_kwargs["log_with"] = args.report_to accelerator_log_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs) # Make one log on every process with the configuration for debugging. logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO, ) logger.info(accelerator.state, main_process_only=False) if accelerator.is_local_main_process: datasets.utils.logging.set_verbosity_warning() transformers.utils.logging.set_verbosity_info() else: datasets.utils.logging.set_verbosity_error() transformers.utils.logging.set_verbosity_error() # If passed along, set the training seed now. if args.seed is not None: set_seed(args.seed) # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: api = HfApi(token=args.hub_token) # Create repo (repo_name from args or inferred) repo_name = args.hub_model_id if repo_name is None: repo_name = Path(args.output_dir).absolute().name repo_id = api.create_repo(repo_name, exist_ok=True).repo_id with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: gitignore.write("step_*\n") if "epoch_*" not in gitignore: gitignore.write("epoch_*\n") elif args.output_dir is not None: os.makedirs(args.output_dir, exist_ok=True) accelerator.wait_for_everyone() # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below) # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ # (the dataset will be downloaded automatically from the datasets Hub). # # For CSV/JSON files, this script will use the column called 'text' or the first column if no column called # 'text' is found. You can easily tweak this behavior (see below). # # In distributed training, the load_dataset function guarantee that only one local process can concurrently # download the dataset. if args.dataset_name is not None: # Downloading and loading a dataset from the hub. raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name) if "validation" not in raw_datasets.keys(): raw_datasets["validation"] = load_dataset( args.dataset_name, args.dataset_config_name, split=f"train[:{args.validation_split_percentage}%]", ) raw_datasets["train"] = load_dataset( args.dataset_name, args.dataset_config_name, split=f"train[{args.validation_split_percentage}%:]", ) else: data_files = {} dataset_args = {} if args.train_file is not None: data_files["train"] = args.train_file if args.validation_file is not None: data_files["validation"] = args.validation_file extension = args.train_file.split(".")[-1] if extension == "txt": extension = "text" dataset_args["keep_linebreaks"] = not args.no_keep_linebreaks raw_datasets = load_dataset(extension, data_files=data_files, **dataset_args) # If no validation data is there, validation_split_percentage will be used to divide the dataset. if "validation" not in raw_datasets.keys(): raw_datasets["validation"] = load_dataset( extension, data_files=data_files, split=f"train[:{args.validation_split_percentage}%]", **dataset_args, ) raw_datasets["train"] = load_dataset( extension, data_files=data_files, split=f"train[{args.validation_split_percentage}%:]", **dataset_args, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at # https://huggingface.co/docs/datasets/loading_datasets.html. # Load pretrained model and tokenizer # # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if args.config_name: config = AutoConfig.from_pretrained( args.config_name, trust_remote_code=args.trust_remote_code, ) elif args.model_name_or_path: config = AutoConfig.from_pretrained( args.model_name_or_path, trust_remote_code=args.trust_remote_code, ) else: config = CONFIG_MAPPING[args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") if args.tokenizer_name: tokenizer = AutoTokenizer.from_pretrained( args.tokenizer_name, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code ) elif args.model_name_or_path: tokenizer = AutoTokenizer.from_pretrained( args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code, ) else: raise ValueError( "You are instantiating a new tokenizer from scratch. This is not supported by this script." "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) ########################## # Tokenizer # ########################## tokenizer.pad_token_id = 0 # unk. we want this to be different from the eos token tokenizer.padding_side = "left" # Allow batched inference tokenizer.truncation_side = "left" if args.model_name_or_path: model = AutoModelForCausalLM.from_pretrained( args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, low_cpu_mem_usage=True, quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=config.torch_dtype, ), ) else: logger.info("Training new model from scratch") model = AutoModelForCausalLM.from_config(config, trust_remote_code=args.trust_remote_code) ########################## # Peft Model # ########################## if args.adapter_name_or_path is None: model = PeftModel.from_pretrained(model, args.model_name_or_path, subfolder="loftq_init", is_trainable=True) else: model = PeftModel.from_pretrained(model, args.adapter_name_or_path, is_trainable=True) model.print_trainable_parameters() # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. embedding_size = model.get_input_embeddings().weight.shape[0] if len(tokenizer) > embedding_size: model.resize_token_embeddings(len(tokenizer)) # Preprocessing the datasets. # First we tokenize all the texts. ########################## # GSM8K dataset # ########################## # Preprocessing the datasets. # First we tokenize all the texts. column_names = raw_datasets["train"].column_names # Get the column names for source/target. source_column, target_column = "question", "answer" # Temporarily set max_target_length for training. padding = "max_length" if args.pad_to_max_length else False task_prompt = "\nAnswer the above question. First think step by step and then answer the final number.\n" def prompt_process(sent_1, sent_2, prompt_1="", prompt_2="", prompt_3=""): sent_2 = sent_2.replace("####", "The final answer is") return prompt_1 + sent_1 + prompt_2 + sent_2 + prompt_3 def preprocess_function_train(examples): sources = examples[source_column] targets = examples[target_column] inputs = [prompt_process(source, target, prompt_2=task_prompt) for (source, target) in zip(sources, targets)] model_inputs = tokenizer( inputs, max_length=args.max_source_length + args.max_target_length, padding=padding, truncation=True, return_tensors="pt", ) labels = copy.deepcopy(model_inputs) # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore # padding in the loss. if padding == "max_length" and args.ignore_pad_token_for_loss: # get the length of the target tokens. -1 to kick out the <BOS> token target_tokens = tokenizer(targets, padding=False) target_len = [len(label) - 1 for label in target_tokens["input_ids"]] # don't calculate the loss from source and padding (left padding) for i in range(len(labels["input_ids"])): labels["input_ids"][i, : -target_len[i]] = -100 model_inputs["labels"] = labels["input_ids"] return model_inputs def preprocess_function_test(examples): sources = examples[source_column] labels = examples[target_column] inputs = [source + task_prompt for source in sources] model_inputs = tokenizer(inputs, max_length=args.max_source_length, padding=padding, truncation=True) labels = tokenizer(labels, max_length=args.max_target_length, padding=padding, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs with accelerator.main_process_first(): train_dataset = raw_datasets["train"].map( preprocess_function_train, batched=True, num_proc=args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not args.overwrite_cache, desc="Running tokenizer on training dataset", ) eval_dataset = raw_datasets["test"].map( preprocess_function_test, batched=True, num_proc=args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not args.overwrite_cache, desc="Running tokenizer on test dataset", ) # Log a few random samples from the set: for index in random.sample(range(len(train_dataset)), 2): logger.info(f"Sample {index} of the training set: {train_dataset[index]}.") for index in random.sample(range(len(eval_dataset)), 2): logger.info(f"Sample {index} of the validation set: {eval_dataset[index]}.") # DataLoaders creation: train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=args.per_device_train_batch_size ) eval_dataloader = DataLoader( eval_dataset, collate_fn=default_data_collator, batch_size=args.per_device_eval_batch_size ) # Optimizer # Split weights in two groups, one with weight decay and the other not. no_decay = ["bias", "layer_norm.weight"] optimizer_grouped_parameters = [ { "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay) and "lora" in n], "weight_decay": args.weight_decay, }, { "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0, }, ] optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=args.learning_rate) # Scheduler and math around the number of training steps. overrode_max_train_steps = False num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) if args.max_train_steps is None: args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch overrode_max_train_steps = True lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, ) # Prepare everything with our `accelerator`. model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler ) # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties. if accelerator.distributed_type == DistributedType.TPU: model.tie_weights() # We need to recalculate our total training steps as the size of the training dataloader may have changed. num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) if overrode_max_train_steps: args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch # Afterwards we recalculate our number of training epochs args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch) # Figure out how many steps we should save the Accelerator states checkpointing_steps = args.checkpointing_steps if checkpointing_steps is not None and checkpointing_steps.isdigit(): checkpointing_steps = int(checkpointing_steps) # We need to initialize the trackers we use, and also store our configuration. # The trackers initializes automatically on the main process. if args.with_tracking: experiment_config = vars(args) # TensorBoard cannot log Enums, need the raw value experiment_config["lr_scheduler_type"] = experiment_config["lr_scheduler_type"].value accelerator.init_trackers("clm_no_trainer", experiment_config) # Train! total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps logger.info("***** Running training *****") logger.info(f" Num examples = {len(train_dataset)}") logger.info(f" Num Epochs = {args.num_train_epochs}") logger.info(f" Instantaneous batch size per device = {args.per_device_train_batch_size}") logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}") logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}") logger.info(f" Total optimization steps = {args.max_train_steps}") # Only show the progress bar once on each machine. progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process) completed_steps = 0 starting_epoch = 0 # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last checkpoint_path = path path = os.path.basename(checkpoint_path) accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") accelerator.load_state(path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None
    [huggingface/peft] examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py
    def main(): accelerator = Accelerator() # model_name_or_path = "bigscience/T0_3B" model_name_or_path = "facebook/bart-large" dataset_name = "twitter_complaints" peft_config = LoraConfig( task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1 ) text_column = "Tweet text" label_column = "text_label" lr = 3e-3 num_epochs = 5 batch_size = 8 seed = 42 do_test = False set_seed(seed) dataset = load_dataset("ought/raft", dataset_name) classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names] dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["Label"]]}, batched=True, num_proc=1, ) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes]) def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, truncation=True) labels = tokenizer( targets, max_length=target_max_length, padding="max_length", truncation=True, return_tensors="pt" ) labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs with accelerator.main_process_first(): processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=True, desc="Running tokenizer on dataset", ) accelerator.wait_for_everyone() train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["train"] test_dataset = processed_datasets["test"] def collate_fn(examples): return tokenizer.pad(examples, padding="longest", return_tensors="pt") train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True ) eval_dataloader = DataLoader(eval_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True) test_dataloader = DataLoader(test_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True) # creating model model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path) model = get_peft_model(model, peft_config) model.print_trainable_parameters() # optimizer optimizer = torch.optim.AdamW(model.parameters(), lr=lr) # lr scheduler lr_scheduler = get_linear_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=0, num_training_steps=(len(train_dataloader) * num_epochs), ) model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler = accelerator.prepare( model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler ) accelerator.print(model) is_ds_zero_3 = False if getattr(accelerator.state, "deepspeed_plugin", None): is_ds_zero_3 = accelerator.state.deepspeed_plugin.zero_stage == 3 for epoch in range(num_epochs): with TorchTracemalloc() as tracemalloc: model.train() total_loss = 0 for step, batch in enumerate(tqdm(train_dataloader)): outputs = model(**batch) loss = outputs.loss total_loss += loss.detach().float() accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage accelerator.print(f"GPU Memory before entering the train : {b2mb(tracemalloc.begin)}") accelerator.print(f"GPU Memory consumed at the end of the train (end-begin): {tracemalloc.used}") accelerator.print(f"GPU Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}") accelerator.print( f"GPU Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}" ) accelerator.print(f"CPU Memory before entering the train : {b2mb(tracemalloc.cpu_begin)}") accelerator.print(f"CPU Memory consumed at the end of the train (end-begin): {tracemalloc.cpu_used}") accelerator.print(f"CPU Peak Memory consumed during the train (max-begin): {tracemalloc.cpu_peaked}") accelerator.print( f"CPU Total Peak Memory consumed during the train (max): {tracemalloc.cpu_peaked + b2mb(tracemalloc.cpu_begin)}" ) train_epoch_loss = total_loss / len(train_dataloader) train_ppl = torch.exp(train_epoch_loss) accelerator.print(f"{epoch=}: {train_ppl=} {train_epoch_loss=}") model.eval() eval_preds = [] with TorchTracemalloc() as tracemalloc: for _, batch in enumerate(tqdm(eval_dataloader)): batch = {k: v for k, v in batch.items() if k != "labels"} with torch.no_grad(): outputs = accelerator.unwrap_model(model).generate( **batch, synced_gpus=is_ds_zero_3 ) # synced_gpus=True for DS-stage 3 outputs = accelerator.pad_across_processes(outputs, dim=1, pad_index=tokenizer.pad_token_id) preds = accelerator.gather_for_metrics(outputs).detach().cpu().numpy() eval_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True)) # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage accelerator.print(f"GPU Memory before entering the eval : {b2mb(tracemalloc.begin)}") accelerator.print(f"GPU Memory consumed at the end of the eval (end-begin): {tracemalloc.used}") accelerator.print(f"GPU Peak Memory consumed during the eval (max-begin): {tracemalloc.peaked}") accelerator.print( f"GPU Total Peak Memory consumed during the eval (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}" ) accelerator.print(f"CPU Memory before entering the eval : {b2mb(tracemalloc.cpu_begin)}") accelerator.print(f"CPU Memory consumed at the end of the eval (end-begin): {tracemalloc.cpu_used}") accelerator.print(f"CPU Peak Memory consumed during the eval (max-begin): {tracemalloc.cpu_peaked}") accelerator.print( f"CPU Total Peak Memory consumed during the eval (max): {tracemalloc.cpu_peaked + b2mb(tracemalloc.cpu_begin)}" ) correct = 0 total = 0 assert len(eval_preds) == len( dataset["train"][label_column] ), f"{len(eval_preds)} != {len(dataset['train'][label_column])}" for pred, true in zip(eval_preds, dataset["train"][label_column]): if pred.strip() == true.strip(): correct += 1 total += 1 accuracy = correct / total * 100 accelerator.print(f"{accuracy=}") accelerator.print(f"{eval_preds[:10]=}") accelerator.print(f"{dataset['train'][label_column][:10]=}") if do_test: model.eval() test_preds = [] for _, batch in enumerate(tqdm(test_dataloader)): batch = {k: v for k, v in batch.items() if k != "labels"} with torch.no_grad(): outputs = accelerator.unwrap_model(model).generate( **batch, synced_gpus=is_ds_zero_3 ) # synced_gpus=True for DS-stage 3 outputs = accelerator.pad_across_processes(outputs, dim=1, pad_index=tokenizer.pad_token_id) preds = accelerator.gather(outputs).detach().cpu().numpy() test_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True)) test_preds_cleaned = [] for _, pred in enumerate(test_preds): test_preds_cleaned.append(get_closest_label(pred, classes)) test_df = dataset["test"].to_pandas() assert len(test_preds_cleaned) == len(test_df), f"{len(test_preds_cleaned)} != {len(test_df)}" test_df[label_column] = test_preds_cleaned test_df["text_labels_orig"] = test_preds accelerator.print(test_df[[text_column, label_column]].sample(20)) pred_df = test_df[["ID", label_column]] pred_df.columns = ["ID", "Label"] os.makedirs(f"data/{dataset_name}", exist_ok=True) pred_df.to_csv(f"data/{dataset_name}/predictions.csv", index=False) accelerator.wait_for_everyone() # Option1: Pushing the model to Hugging Face Hub # model.push_to_hub( # f"{dataset_name}_{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}".replace("/", "_"), # token = "hf_..." # ) # token (`bool` or `str`, *optional*): # `token` is to be used for HTTP Bearer authorization when accessing remote files. If `True`, will use the token generated # when running `huggingface-cli login` (stored in `~/.huggingface`). Will default to `True` if `repo_url` # is not specified. # Or you can get your token from https://huggingface.co/settings/token # Option2: Saving the model locally peft_model_id = f"{dataset_name}_{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}".replace( "/", "_" ) model.save_pretrained(peft_model_id) accelerator.wait_for_everyone()
    [huggingface/peft] examples/feature_extraction/peft_lora_embedding_semantic_search.py
    def main(): args = parse_args() accelerator_kwargs = {"gradient_accumulation_steps": args.gradient_accumulation_steps} if args.with_tracking: accelerator_kwargs["log_with"] = args.report_to accelerator_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(**accelerator_kwargs) # Make one log on every process with the configuration for debugging. logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO, ) logger.info(accelerator.state, main_process_only=False) if accelerator.is_local_main_process: datasets.utils.logging.set_verbosity_warning() transformers.utils.logging.set_verbosity_info() else: datasets.utils.logging.set_verbosity_error() transformers.utils.logging.set_verbosity_error() # If passed along, set the training seed now. if args.seed is not None: set_seed(args.seed) # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: api = HfApi(token=args.hub_token) # Create repo (repo_name from args or inferred) repo_name = args.hub_model_id if repo_name is None: repo_name = Path(args.output_dir).absolute().name repo_id = api.create_repo(repo_name, exist_ok=True).repo_id with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: gitignore.write("step_*\n") if "epoch_*" not in gitignore: gitignore.write("epoch_*\n") elif args.output_dir is not None: os.makedirs(args.output_dir, exist_ok=True) accelerator.wait_for_everyone() # get the tokenizer tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) # dataset download and preprocessing if args.sanity_test: train_dataset = load_dataset("smangrul/amazon_esci", split="train[:1024]") val_dataset = load_dataset("smangrul/amazon_esci", split="validation[:1024]") dataset = DatasetDict({"train": train_dataset, "validation": val_dataset}) else: dataset = load_dataset(args.dataset_name) def preprocess_function(examples): queries = examples["query"] result = tokenizer(queries, padding="max_length", max_length=70, truncation=True) result = {f"query_{k}": v for k, v in result.items()} products = examples["product_title"] result_products = tokenizer(products, padding="max_length", max_length=70, truncation=True) for k, v in result_products.items(): result[f"product_{k}"] = v result["labels"] = examples["relevance_label"] return result processed_datasets = dataset.map( preprocess_function, batched=True, remove_columns=dataset["train"].column_names, desc="Running tokenizer on dataset", ) # Log a few random samples from the training set: for index in random.sample(range(len(processed_datasets["train"])), 3): logger.info(f"Sample {index} of the training set: {processed_datasets['train'][index]}.") # base model model = AutoModelForSentenceEmbedding(args.model_name_or_path, tokenizer) if args.use_peft: # peft config and wrapping peft_config = LoraConfig( r=8, lora_alpha=16, bias="none", task_type=TaskType.FEATURE_EXTRACTION, target_modules=["key", "query", "value"], ) model = get_peft_model(model, peft_config) model.print_trainable_parameters() accelerator.print(model) # get dataloaders train_dataloader = DataLoader( processed_datasets["train"], shuffle=True, collate_fn=default_data_collator, batch_size=args.per_device_train_batch_size, pin_memory=True, ) eval_dataloader = DataLoader( processed_datasets["validation"], shuffle=False, collate_fn=default_data_collator, batch_size=args.per_device_eval_batch_size, pin_memory=True, ) optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate) # Scheduler and math around the number of training steps. overrode_max_train_steps = False num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) if args.max_train_steps is None: args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch overrode_max_train_steps = True lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, num_warmup_steps=args.num_warmup_steps, num_training_steps=args.max_train_steps, ) # Prepare everything with our `accelerator`. model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler ) # We need to recalculate our total training steps as the size of the training dataloader may have changed num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) if overrode_max_train_steps: args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch # Afterwards we recalculate our number of training epochs args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch) # Figure out how many steps we should save the Accelerator states checkpointing_steps = args.checkpointing_steps if checkpointing_steps is not None and checkpointing_steps.isdigit(): checkpointing_steps = int(checkpointing_steps) # We need to initialize the trackers we use, and also store our configuration. # The trackers initializes automatically on the main process. if args.with_tracking: experiment_config = vars(args) # TensorBoard cannot log Enums, need the raw value experiment_config["lr_scheduler_type"] = experiment_config["lr_scheduler_type"].value accelerator.init_trackers("peft_semantic_search", experiment_config) metric = evaluate.load("roc_auc") total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps if args.use_peft: # saving and loading checkpoints for resuming training accelerator.register_save_state_pre_hook(save_model_hook) accelerator.register_load_state_pre_hook(load_model_hook) logger.info("***** Running training *****") logger.info(f" Num examples = {len(processed_datasets['train'])}") logger.info(f" Num Epochs = {args.num_train_epochs}") logger.info(f" Instantaneous batch size per device = {args.per_device_train_batch_size}") logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}") logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}") logger.info(f" Total optimization steps = {args.max_train_steps}") # Only show the progress bar once on each machine. progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process) completed_steps = 0 starting_epoch = 0 # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") accelerator.load_state(args.resume_from_checkpoint) path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None completed_steps = starting_epoch * num_update_steps_per_epoch else: # need to multiply `gradient_accumulation_steps` to reflect real steps resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) resume_step -= starting_epoch * len(train_dataloader) completed_steps = resume_step // args.gradient_accumulation_steps # update the progress_bar if load from checkpoint progress_bar.update(completed_steps) for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: # We skip the first `n` batches in the dataloader when resuming from a checkpoint active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) else: active_dataloader = train_dataloader for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): query_embs = model(**{k.replace("query_", ""): v for k, v in batch.items() if "query" in k}) product_embs = model(**{k.replace("product_", ""): v for k, v in batch.items() if "product" in k}) loss = get_loss(get_cosing_embeddings(query_embs, product_embs), batch["labels"]) total_loss += accelerator.reduce(loss.detach().float(), reduction="sum") accelerator.backward(loss) optimizer.step() lr_scheduler.step() model.zero_grad() # Checks if the accelerator has performed an optimization step behind the scenes if accelerator.sync_gradients: progress_bar.update(1) completed_steps += 1 if (step + 1) % 100 == 0: logger.info(f"Step: {step+1}, Loss: {total_loss/(step+1)}") if args.with_tracking: accelerator.log({"train/loss": total_loss / (step + 1)}, step=completed_steps) if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: output_dir = f"step_{completed_steps }" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) if completed_steps >= args.max_train_steps: break model.eval() for step, batch in enumerate(eval_dataloader): with torch.no_grad(): query_embs = model(**{k.replace("query_", ""): v for k, v in batch.items() if "query" in k}) product_embs = model(**{k.replace("product_", ""): v for k, v in batch.items() if "product" in k}) prediction_scores = get_cosing_embeddings(query_embs, product_embs) prediction_scores, references = accelerator.gather_for_metrics((prediction_scores, batch["labels"])) metric.add_batch( prediction_scores=prediction_scores, references=references, ) result = metric.compute() result = {f"eval/{k}": v for k, v in result.items()} # Use accelerator.print to print only on the main process. accelerator.print(f"epoch {epoch}:", result) if args.with_tracking: result["train/epoch_loss"] = total_loss.item() / len(train_dataloader) accelerator.log(result, step=completed_steps) if args.output_dir is not None: accelerator.wait_for_everyone() if accelerator.is_main_process: if isinstance(checkpointing_steps, str): accelerator.save_state(os.path.join(args.output_dir, f"epoch_{epoch}")) accelerator.unwrap_model(model).save_pretrained( args.output_dir, state_dict=accelerator.get_state_dict(accelerator.unwrap_model(model)) ) tokenizer.save_pretrained(args.output_dir) if args.push_to_hub: commit_message = ( f"Training in progress epoch {epoch}" if epoch < args.num_train_epochs - 1 else "End of training" ) api.upload_folder( repo_id=repo_id, folder_path=args.output_dir, commit_message=commit_message, run_as_future=True, ) accelerator.wait_for_everyone() accelerator.end_training()
    [openaccess-ai-collective/axolotl] docs/nccl.qmd
    ---
    title: NCCL
    description: Troubleshooting NCCL issues
    ---
    
    NVIDIA NCCL is a library to facilitate and optimize multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce, etc. Broadly, NCCL configuration is highly environment-specific and is configured via several [environment variables](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html). A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
    
    ```text
    Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.
    

    Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised. Nvidia recommends disabling PCI access control services (ACS) as a possible solution if this is available to you.

    Forcing cross-GPU communication via NVLink may help without increasing timeouts. To verify that your configuration is leveraging NVLink run the following command:

    nvidia-smi nvlink --status

    To force NCCL to use NVLink, simply set this in the environment:

    export NCCL_P2P_LEVEL=NVL

    If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL in the table below:

    | NCCL_P2P_LEVEL | Description | | -------------- | ----------- | | PIX | P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication. | | PXB | P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency. | | PHB | P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL) |

    To validate that acceptable data transfer speeds exist for your training job, running NCCL Tests can help pinpoint bottlenecks, for example:

    ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

    It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:

    export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=ALL export TORCH_DISTRIBUTED_DEBUG=INFO export TORCHELASTIC_ERROR_FILE=/PATH/TO/torcherror.log

    Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout value in the Axolotl configuration. See PyTorch init_process_group for documentation on this value.

    [openaccess-ai-collective/axolotl] README.md

    Environment

    Docker

    docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest

    Or run on the current files for development:

    docker compose up -d

    [!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.

    <details> <summary>Docker advanced</summary>

    A more powerful Docker command to run would be this:

    docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

    It additionally:

    • Prevents memory issues when running e.g. deepspeed (e.g. you could hit SIGBUS/signal 7 error) through --ipc and --ulimit args.
    • Persists the downloaded HF data (models etc.) and your modifications to axolotl code through --mount/-v args.
    • The --name argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...) or in your terminal.
    • The --privileged flag gives all capabilities to the container.
    • The --shm-size 10g argument increases the shared memory size. Use this if you see exitcode: -7 errors using deepspeed.

    More information on nvidia website

    </details>

    Conda/Pip venv

    1. Install python >=3.10

    2. Install pytorch stable https://pytorch.org/get-started/locally/

    3. Install Axolotl along with python dependencies

      pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
    4. (Optional) Login to Huggingface to use gated models/datasets.

      huggingface-cli login

      Get the token at huggingface.co/settings/tokens

    Cloud GPU

    For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest

    Bare Metal Cloud GPU

    LambdaLabs
    <details> <summary>Click to Expand</summary>
    1. Install python
    sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
    1. Install pip
    wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
    1. Install Pytorch https://pytorch.org/get-started/locally/

    2. Follow instructions on quickstart.

    3. Run

    pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
    1. Set path
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
    </details>
    GCP
    <details> <summary>Click to Expand</summary>

    Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.

    Make sure to run the below to uninstall xla.

    pip uninstall -y torch_xla[tpu]
    </details>

    Windows

    Please use WSL or Docker!

    Mac

    Use the below instead of the install method in QuickStart.

    pip3 install -e '.'
    

    More info: mac.md

    Google Colab

    Please use this example notebook.

    Launching on public clouds via SkyPilot

    To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:

    pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check

    Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1:

    git clone https://github.com/skypilot-org/skypilot.git
    cd skypilot/llm/axolotl
    

    Use one command to launch:

    # On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
    [openaccess-ai-collective/axolotl] FAQS.md

    FAQs

    • Can you train StableLM with this? Yes, but only with a single GPU atm. Multi GPU support is coming soon! Just waiting on this PR
    • Will this work with Deepspeed? That's still a WIP, but setting export ACCELERATE_USE_DEEPSPEED=true should work in some cases
    • Error invalid argument at line 359 in file /workspace/bitsandbytes/csrc/pythonInterface.c /arrow/cpp/src/arrow/filesystem/s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit. Try reinstalling bitsandbytes and transformers from source.
    [openaccess-ai-collective/axolotl] src/axolotl/train.py
    def train( *, cfg: DictDefault, cli_args: TrainerCliArgs, dataset_meta: TrainDatasetMeta ) -> Tuple[Union[PeftModel, PreTrainedModel], PreTrainedTokenizer]: # load the tokenizer first LOG.debug( f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}", main_process_only=True, ) tokenizer = load_tokenizer(cfg) train_dataset = dataset_meta.train_dataset eval_dataset = dataset_meta.eval_dataset total_num_steps = dataset_meta.total_num_steps if cfg.resume_from_checkpoint is None and cfg.auto_resume_from_checkpoints: possible_checkpoints = [ str(cp) for cp in Path(cfg.output_dir).glob("checkpoint-*") ] if len(possible_checkpoints) > 0: sorted_paths = sorted( possible_checkpoints, key=lambda path: int(path.split("-")[-1]), ) cfg.resume_from_checkpoint = sorted_paths[-1] LOG.info( f"Using Auto-resume functionality to start with checkpoint at {cfg.resume_from_checkpoint}" ) resume_from_checkpoint = cfg.resume_from_checkpoint # Load the model and tokenizer msg = "loading model" if cfg.adapter: msg += " and peft_config..." LOG.debug(msg) # we wait unitl the last possible moment to setup Accelerator Accelerator() model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference) model.generation_config.do_sample = True model_ref = None if cfg.rl and cfg.rl != "orpo": if cfg.adapter and not cfg.rl_adapter_ref_model: # use built-in trl autounwrap LOG.debug("Passing model_ref: None to RL trainer") model_ref = None # explicit setting to None else: # load the model again for model_ref/baseline model_ref, _ = load_model( cfg, tokenizer, inference=cli_args.inference, reference_model=True ) safe_serialization = cfg.save_safetensors is True if cfg.unfrozen_parameters: freeze_layers_except(model, cfg.unfrozen_parameters) trainer = setup_trainer( cfg, train_dataset, eval_dataset, (model, model_ref, peft_config), tokenizer, total_num_steps, ) # go ahead and presave, so we have the adapter config available to inspect if peft_config: LOG.info(f"Pre-saving adapter config to {cfg.output_dir}") peft_config.save_pretrained(cfg.output_dir) # additionally presave the tokenizer and model configs if not Path(cfg.output_dir).is_dir(): os.makedirs(cfg.output_dir, exist_ok=True) tokenizer.save_pretrained(str(Path(cfg.output_dir))) if hasattr(model, "config"): model.config.save_pretrained(str(Path(cfg.output_dir))) # In case we want to stop early with ctrl+c, this is a nice to have to save the pretrained model if cfg.local_rank == 0: def terminate_handler(_, __, model_weakref): if model_weakref() is not None: _model = model_weakref() if cfg.flash_optimum and BetterTransformer: _model = BetterTransformer.reverse(_model) _model.save_pretrained( cfg.output_dir, safe_serialization=safe_serialization ) sys.exit(0) _model_weakref = weakref.ref(model) signal.signal( signal.SIGINT, lambda signum, frame: terminate_handler(signum, frame, _model_weakref), ) badge_markdown = """[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)""" transformers.modelcard.AUTOGENERATED_TRAINER_COMMENT += f"\n{badge_markdown}" if getattr(cfg, "axolotl_config_path"): raw_axolotl_cfg = Path(cfg.axolotl_config_path) version = get_distribution("axolotl").version if raw_axolotl_cfg.is_file(): transformers.modelcard.AUTOGENERATED_TRAINER_COMMENT += f"\n<details><summary>See axolotl config</summary>\n\naxolotl version: `{version}`\n```yaml\n{raw_axolotl_cfg.read_text(encoding='utf-8')}\n```\n\n</details><br>\n" LOG.info("Starting trainer...") if cfg.group_by_length: LOG.info("hang tight... sorting dataset for group_by_length") pretrain_hooks(cfg, trainer) if cfg.flash_optimum: with torch.backends.cuda.sdp_kernel( # TODO configure these from the YAML w/ sdp_kernel_kwargs: ... enable_flash=True, enable_math=True, enable_mem_efficient=True, ): trainer.train(resume_from_checkpoint=resume_from_checkpoint) else: trainer.train(resume_from_checkpoint=resume_from_checkpoint) post_train_hooks(cfg, trainer) LOG.info(f"Training Completed!!! Saving pre-trained model to {cfg.output_dir}") # post training for name, module in model.named_modules(): if hasattr(module, "_post_training"): module._post_training(model, name) # pylint: disable=protected-access if trainer.is_fsdp_enabled: trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT") LOG.info("Set FSDP state dict type to FULL_STATE_DICT for saving.") if cfg.relora_steps: if cfg.adapter == "lora" and not (cfg.load_in_4bit or cfg.load_in_8bit): model = model.merge_and_unload() else: # final model weights have already been saved by `ReLoRACallback.on_train_end` return model, tokenizer # TODO do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading # only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file if cfg.fsdp: trainer.save_model(cfg.output_dir) elif cfg.deepspeed and is_deepspeed_zero3_enabled(): # Copied over from: https://github.com/huggingface/accelerate/blob/5ae611118057232f441055f7ef9ba0b0f2b8d533/docs/source/usage_guides/deepspeed.md#saving-and-loading trainer.accelerator.wait_for_everyone() unwrapped_model = trainer.accelerator.unwrap_model(trainer.model_wrapped) # Saves the whole/unpartitioned fp16 model when in ZeRO Stage-3 to the output directory if # `stage3_gather_16bit_weights_on_model_save` is True in DeepSpeed Config file or # `zero3_save_16bit_model` is True in DeepSpeed Plugin. # For Zero Stages 1 and 2, models are saved as usual in the output directory. # The model name saved is `pytorch_model.bin` unwrapped_model.save_pretrained( cfg.output_dir, is_main_process=trainer.accelerator.is_main_process, save_function=trainer.accelerator.save, state_dict=trainer.accelerator.get_state_dict(trainer.model_wrapped), ) elif cfg.local_rank == 0: if cfg.flash_optimum and BetterTransformer: model = BetterTransformer.reverse(model) model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization) if not cfg.hub_model_id: try: trainer.create_model_card(model_name=cfg.output_dir.lstrip("./")) except AttributeError: pass elif cfg.hub_model_id: # defensively push to the hub to ensure the model card is updated trainer.push_to_hub() return model, tokenizer
    [openaccess-ai-collective/axolotl] docs/multi-node.qmd
    ---
    title: Multi Node
    description: How to use Axolotl on multiple machines
    ---
    
    You will need to create a configuration for accelerate, either by using `accelerate config` and follow the instructions or you can use one of the preset below:
    
    ~/.cache/huggingface/accelerate/default_config.yaml
    ```yaml
    compute_environment: LOCAL_MACHINE
    debug: false
    distributed_type: FSDP
    downcast_bf16: 'no'
    machine_rank: 0 # Set to 0 for the main machine, increment by one for other machines
    main_process_ip: 10.0.0.4 # Set to main machine's IP
    main_process_port: 5000
    main_training_function: main
    mixed_precision: bf16
    num_machines: 2 # Change to the number of machines
    num_processes: 4 # That's the total number of GPUs, (for example: if you have 2 machines with 4 GPU, put 8)
    rdzv_backend: static
    same_network: true
    tpu_env: []
    tpu_use_cluster: false
    tpu_use_sudo: false
    use_cpu: false
    

    Configure your model to use FSDP with for example:

    fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

    Machine configuration

    On each machine you need a copy of Axolotl, we suggest using the same commit to ensure compatibility.

    You will also need to have the same configuration file for your model on each machine.

    On the main machine only, make sure the port you set as main_process_port is open in TCP and reachable by other machines.

    All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.

    [openaccess-ai-collective/axolotl] src/axolotl/utils/config/__init__.py
    def choose_device(cfg): def get_device(): try: if torch.cuda.is_available(): return f"cuda:{cfg.local_rank}" if torch.backends.mps.is_available(): return "mps" raise SystemError("No CUDA/mps device found") except Exception: # pylint: disable=broad-exception-caught return "cpu" cfg.device = get_device() if cfg.world_size == 1: cfg.device_map = cfg.device_map or "auto" else: if cfg.device.startswith("cuda"): cfg.device_map = {"": torch.cuda.current_device()} else: cfg.device_map = {"": cfg.device} # in `accelerate launch`, we need to not pass through any device map and let # accelerate figure out which parts of the model to put on which gpu accelerate_vars = [var for var in os.environ if var.startswith("ACCELERATE_USE_")] if accelerate_vars: cfg.device_map = None
    [openaccess-ai-collective/axolotl] README.md

    Quickstart ⚡

    Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.

    Requirements: Python >=3.10 and Pytorch >=2.1.1.

    git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]'

    Usage

    # preprocess datasets - optional but recommended CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml # finetune lora accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml # inference accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./lora-out" # gradio accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./lora-out" --gradio # remote yaml files - the yaml config can be hosted on a public URL # Note: the yaml config must directly link to the **raw** yaml accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml
OpenAccess-AI-Collective/axolotl
huggingface/transformers
huggingface/peft
huggingface/accelerate