Phorm.ai
OpenAccess-AI-Collective/axolotl

    how can I use pre-tokenized data

    To use pre-tokenized data with Axolotl, follow these steps:

    1. Ensure Your Data Format: Your dataset must have columns exactly named input_ids, attention_mask, and labels.

    2. Configuration: Do not specify a type: in your Axolotl configuration file. This indicates that you are using a custom dataset format, in this case, pre-tokenized data.

    Here's an example of how you might set up your configuration file (config.yml):

    - path: path_to_your_pre_tokenized_data
    1. Loading the Dataset: When loading your dataset, Axolotl will expect the dataset to be in the format mentioned above. You can load your dataset using the standard methods provided by the datasets library if you are using it, or any other method you prefer, as long as the final format matches.

    2. Using the Dataset: Once loaded, you can directly use this dataset for training or evaluation without any further tokenization steps, as the data is already in the required format.

    Here's a simple code example illustrating how you might load and use a pre-tokenized dataset:

    from datasets import load_dataset # Assuming your dataset is in a format recognized by the `datasets` library dataset = load_dataset('path/to/dataset', split='train') # Example of what the dataset should look like print(dataset[0]) # Output: {'input_ids': [101, 2054, 2003, 102, ...], 'attention_mask': [1, 1, 1, 1, ...], 'labels': [1]} # Now, you can directly use this dataset for training or evaluation with Axolotl

    Remember, the key here is that your dataset must already be tokenized and structured correctly. This means you should have already converted your text into input_ids and generated the appropriate attention_mask and labels for each example in your dataset.

    Search results
    [openaccess-ai-collective/axolotl] docs/dataset-formats/tokenized.qmd
    ---
    title: Custom Pre-Tokenized Dataset
    description: How to use a custom pre-tokenized dataset.
    order: 5
    ---
    
    - Do not pass a `type:` in your axolotl config.
    - Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
    
    ```{.yaml filename="config.yml"}
    - path: ...
    
    [huggingface/transformers] examples/research_projects/performer/run_mlm_performer.py
    def tokenize_function(examples): # Remove empty lines examples = [line for line in examples if len(line) > 0 and not line.isspace()] return tokenizer( examples, return_special_tokens_mask=True, padding=padding, truncation=True, max_length=data_args.max_seq_length, )
    [huggingface/transformers] examples/research_projects/codeparrot/scripts/pretokenizing.py
    def tokenize(example): output = {} output["input_ids"] = tokenizer(example["content"], truncation=False)["input_ids"] output["ratio_char_token"] = len(example["content"]) / len(output["input_ids"]) return output
    [huggingface/transformers] examples/tensorflow/language-modeling/run_mlm.py
    def tokenize_function(examples): # Remove empty lines examples[text_column_name] = [ line for line in examples[text_column_name] if len(line) > 0 and not line.isspace() ] return tokenizer( examples[text_column_name], padding=padding, truncation=True, max_length=max_seq_length, # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it # receives the `special_tokens_mask`. return_special_tokens_mask=True, )
    [huggingface/transformers] tests/trainer/test_trainer_seq2seq.py
    def prepare_data(examples): # Remove pairs where at least one record is none inputs = examples[INPUT_COLUMN] targets = examples[TARGET_COLUMN] model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True) labels = tokenizer(text_target=targets, max_length=MAX_TARGET_LENGTH, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs
    [huggingface/transformers] docs/source/en/main_classes/tokenizer.md

    Tokenizer

    A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the Rust library πŸ€— Tokenizers. The "Fast" implementations allows:

    1. a significant speed-up in particular when doing batched tokenization and
    2. additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).

    The base classes [PreTrainedTokenizer] and [PreTrainedTokenizerFast] implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and "Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository). They both rely on [~tokenization_utils_base.PreTrainedTokenizerBase] that contains the common methods, and [~tokenization_utils_base.SpecialTokensMixin].

    [PreTrainedTokenizer] and [PreTrainedTokenizerFast] thus implement the main methods for using all the tokenizers:

    • Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).
    • Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece...).
    • Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.

    [BatchEncoding] holds the output of the [~tokenization_utils_base.PreTrainedTokenizerBase]'s encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these methods (input_ids, attention_mask...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace tokenizers library), this class provides in addition several advanced alignment methods which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token).

    PreTrainedTokenizer

    [[autodoc]] PreTrainedTokenizer - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all

    PreTrainedTokenizerFast

    The [PreTrainedTokenizerFast] depend on the tokenizers library. The tokenizers obtained from the πŸ€— tokenizers library can be loaded very simply into πŸ€— transformers. Take a look at the Using tokenizers from πŸ€— tokenizers page to understand how this is done.

    [[autodoc]] PreTrainedTokenizerFast - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all

    BatchEncoding

    [[autodoc]] BatchEncoding

    [huggingface/transformers] docs/source/en/internal/tokenization_utils.md
    <!--Copyright 2020 The HuggingFace Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. -->
    [huggingface/transformers] examples/research_projects/codeparrot/scripts/pretokenizing.py
    ds = ds.map( tokenize, num_proc=args.num_workers, remove_columns=[ "repo_name", "path", "copies", "size", "content", "license", "hash", "line_mean", "line_max", "alpha_frac", "autogenerated", ], ) print(f"Dataset tokenized in {time.time()-t_start:.2f}s") t_start = time.time() ds.push_to_hub(args.tokenized_data_repo) print(f"Data pushed to the hub in {time.time()-t_start:.2f}s")
    [huggingface/transformers] docs/source/en/fast_tokenizers.md

    Use tokenizers from πŸ€— Tokenizers

    The [PreTrainedTokenizerFast] depends on the πŸ€— Tokenizers library. The tokenizers obtained from the πŸ€— Tokenizers library can be loaded very simply into πŸ€— Transformers.

    Before getting in the specifics, let's first start by creating a dummy tokenizer in a few lines:

    >>> from tokenizers import Tokenizer >>> from tokenizers.models import BPE >>> from tokenizers.trainers import BpeTrainer >>> from tokenizers.pre_tokenizers import Whitespace >>> tokenizer = Tokenizer(BPE(unk_token="[UNK]")) >>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) >>> tokenizer.pre_tokenizer = Whitespace() >>> files = [...] >>> tokenizer.train(files, trainer)

    We now have a tokenizer trained on the files we defined. We can either continue using it in that runtime, or save it to a JSON file for future re-use.

    Loading directly from the tokenizer object

    Let's see how to leverage this tokenizer object in the πŸ€— Transformers library. The [PreTrainedTokenizerFast] class allows for easy instantiation, by accepting the instantiated tokenizer object as an argument:

    >>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

    This object can now be used with all the methods shared by the πŸ€— Transformers tokenizers! Head to the tokenizer page for more information.

    Loading from a JSON file

    In order to load a tokenizer from a JSON file, let's first start by saving our tokenizer:

    >>> tokenizer.save("tokenizer.json")

    The path to which we saved this file can be passed to the [PreTrainedTokenizerFast] initialization method using the tokenizer_file parameter:

    >>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

    This object can now be used with all the methods shared by the πŸ€— Transformers tokenizers! Head to the tokenizer page for more information.

    [huggingface/transformers] src/transformers/models/perceiver/tokenization_perceiver.py
    # coding=utf-8 # Copyright 2021 The HuggingFace Inc. team. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ Tokenization class for Perceiver."""
    [huggingface/transformers] docs/source/zh/internal/tokenization_utils.md
    <!--Copyright 2020 The HuggingFace Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. -->
    [huggingface/transformers] src/transformers/models/cpm/tokenization_cpm.py
    # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Tokenization classes."""
    [huggingface/accelerate] examples/by_feature/early_stopping.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/peft] examples/conditional_generation/peft_adalora_seq2seq.py
    # data preprocessing tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
    [huggingface/accelerate] examples/by_feature/fsdp_with_peak_mem_tracking.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/accelerate] examples/by_feature/cross_validation.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/peft] examples/sequence_classification/peft_no_lora_accelerate.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/peft] examples/sft/utils.py
    def preprocess(samples): batch = [] for conversation in samples["messages"]: batch.append(tokenizer.apply_chat_template(conversation, tokenize=False)) return {"content": batch}
    [huggingface/accelerate] examples/by_feature/memory.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/accelerate] examples/by_feature/schedule_free.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/accelerate] examples/by_feature/checkpointing.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/accelerate] examples/by_feature/megatron_lm_gpt_pretraining.py
    def tokenize_function(examples): return tokenizer(examples[text_column_name])
    [openaccess-ai-collective/axolotl] src/axolotl/utils/data/pretraining.py
    def wrap_pretraining_dataset( dataset, tokenizer, cfg, ds_wrapper_fn, max_tokens=2048, batch_size=1, seed=42, buffer_size=10_000, ): if cfg.sample_packing: collate_fn = PretrainingBatchSamplerDataCollatorForSeq2Seq( tokenizer, return_tensors="pt", padding=True, pad_to_multiple_of=max_tokens * batch_size, multipack_attn=cfg.pretrain_multipack_attn, ) encode = functools.partial( encode_packed_pretraining, collate_fn, ds_wrapper_fn, max_seq_length=max_tokens, batch_size=batch_size, multipack_attn=cfg.pretrain_multipack_attn, ) # set this to 1 so downstream data_loader doesn't try to increase the batch again cfg.micro_batch_size = 1 else: encode = functools.partial(encode_pretraining, tokenizer, max_tokens) if cfg.shuffle_merged_datasets: dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size) else: LOG.debug("NOT shuffling merged pretraining datasets") # remove all the existing columns after mapping since they end up having # a different length than the encoded/tokenized column # this is empty during streaming/pretraining remove_columns = [] if dataset.features is None: for first_row in dataset: remove_columns = first_row.keys() break else: remove_columns = dataset.features.keys() dataset = dataset.map( encode, batched=True, batch_size=buffer_size, # input_columns="text", remove_columns=remove_columns, ) return dataset
    [huggingface/peft] examples/fp4_finetuning/finetune_fp4_opt_bnb_peft.py
    # import torch # from peft import PeftModel, PeftConfig # from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # # peft_model_id = "ybelkada/opt-6.7b-lora" # config = PeftConfig.from_pretrained(peft_model_id) # model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, quantization_config=BitsAndBytesConfig(load_in_8bit=True), device_map='auto') # tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) # ## Load the Lora model # model = PeftModel.from_pretrained(model, peft_model_id) # # """## Inference # # You can then directly use the trained model or the model that you have loaded from the πŸ€— Hub for inference as you would do it usually in `transformers`. # """ # batch = tokenizer("Two things are infinite: ", return_tensors="pt")
    [huggingface/peft] examples/conditional_generation/peft_lora_seq2seq_accelerate_fsdp.py
    def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer( inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt" ) labels = tokenizer(targets, max_length=2, padding="max_length", truncation=True, return_tensors="pt") labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs
    [huggingface/peft] examples/conditional_generation/peft_adalora_seq2seq.py
    def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt") labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt") labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs
    [openaccess-ai-collective/axolotl] src/axolotl/utils/data/sft.py
    def load_prepare_datasets( tokenizer: PreTrainedTokenizerBase, cfg, default_dataset_prepared_path, split="train", ) -> Tuple[Dataset, Dataset, List[Prompter]]: dataset, prompters = load_tokenized_prepared_datasets( tokenizer, cfg, default_dataset_prepared_path, split=split ) if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None: LOG.info( f"Using index #{cfg.dataset_shard_idx} of {cfg.dataset_shard_num} shards" ) dataset = dataset.shard( num_shards=cfg.dataset_shard_num, index=cfg.dataset_shard_idx, ) if split == "train" and cfg.val_set_size: # ensure we end up with the same fingerprint by doing rank0 first and being able to cache to_hash_train = ( dataset._fingerprint # pylint: disable=protected-access + "|" + str(cfg.val_set_size) + "|" + "train" + "|" + str(cfg.seed or 42) ) to_hash_test = ( dataset._fingerprint # pylint: disable=protected-access + "|" + str(cfg.val_set_size) + "|" + "test" + "|" + str(cfg.seed or 42) ) train_fingerprint = md5(to_hash_train) test_fingerprint = md5(to_hash_test) dataset = dataset.train_test_split( test_size=cfg.val_set_size, shuffle=False, seed=cfg.seed or 42, train_new_fingerprint=train_fingerprint, test_new_fingerprint=test_fingerprint, ) train_dataset = dataset["train"] eval_dataset = dataset["test"] elif split == "test": train_dataset = None eval_dataset = dataset else: train_dataset = dataset eval_dataset = None return train_dataset, eval_dataset, prompters
    [openaccess-ai-collective/axolotl] docs/dataset-formats/pretraining.qmd
    ---
    title: Pre-training
    description: Data format for a pre-training completion task.
    order: 1
    ---
    
    For pretraining, there is no prompt template or roles.  The only required field is `text`:
    
    ```{.json filename="data.jsonl"}
    {"text": "first row"}
    {"text": "second row"}
    ...
    

    :::{.callout-note}

    Streaming is recommended for large datasets

    Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:

    pretraining_dataset: # hf path only
    ...
    

    :::

    [openaccess-ai-collective/axolotl] src/axolotl/utils/data/sft.py
    def load_tokenized_prepared_datasets( tokenizer, cfg, default_dataset_prepared_path, split="train", ) -> Tuple[DatasetDict, List[Prompter]]: cfg_datasets = cfg.test_datasets if split == "test" else cfg.datasets tokenizer_name = cfg.tokenizer_config ds_hash = str( md5( ( str(cfg.sequence_len) + "@" + str(cfg.sample_packing) + "@" + str(cfg.eval_sample_packing) + "@" + str(cfg.group_by_length) + "@" + "|".join( sorted( [ f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}" for d in cfg_datasets ] ) ) + "|" + tokenizer_name ) ) ) prepared_ds_path = ( Path(cfg.dataset_prepared_path) / ds_hash if cfg.dataset_prepared_path else Path(default_dataset_prepared_path) / ds_hash ) dataset = None prompters = [] use_auth_token = cfg.hf_use_auth_token try: if cfg.push_dataset_to_hub: dataset = load_dataset( f"{cfg.push_dataset_to_hub}/{ds_hash}", token=use_auth_token, ) dataset = dataset[split] except Exception: # pylint: disable=broad-except # nosec pass # pylint: disable=duplicate-code if dataset: ... elif ( cfg.dataset_prepared_path and any(prepared_ds_path.glob("*")) and not cfg.is_preprocess ): LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...") dataset = load_from_disk(str(prepared_ds_path)) LOG.info("Prepared dataset loaded from disk...") else: LOG.info(f"Unable to find prepared dataset in {prepared_ds_path}") LOG.info("Loading raw datasets...") if not cfg.is_preprocess: LOG.warning( "Processing datasets during training can lead to VRAM instability. Please pre-process your dataset." ) if cfg.seed: seed = cfg.seed else: LOG.info("No seed provided, using default seed of 42") seed = 42 datasets = [] def for_d_in_datasets(dataset_configs): for dataset in dataset_configs: if dataset.name and isinstance(dataset.name, list): for name in dataset.name: yield DictDefault({**dataset, "name": name}) else: yield dataset # pylint: disable=invalid-name for config_dataset in for_d_in_datasets(cfg_datasets): ds: Optional[Union[Dataset, DatasetDict]] = None ds_from_hub = False try: load_dataset( config_dataset.path, name=config_dataset.name, streaming=True, token=use_auth_token, ) ds_from_hub = True except (FileNotFoundError, ConnectionError, HFValidationError, ValueError): pass ds_from_cloud = False storage_options = {} remote_file_system = None if config_dataset.path.startswith("s3://"): try: import aiobotocore.session # type: ignore import s3fs # type: ignore except ImportError as exc: raise ImportError( "s3:// paths require aiobotocore and s3fs to be installed" ) from exc # Takes credentials from ~/.aws/credentials for default profile s3_session = aiobotocore.session.AioSession(profile="default") storage_options = {"session": s3_session} remote_file_system = s3fs.S3FileSystem(**storage_options) elif config_dataset.path.startswith( "gs://" ) or config_dataset.path.startswith("gcs://"): try: import gcsfs # type: ignore except ImportError as exc: raise ImportError( "gs:// or gcs:// paths require gcsfs to be installed" ) from exc # gcsfs will use default credentials from the environment else anon # https://gcsfs.readthedocs.io/en/latest/#credentials storage_options = {"token": None} remote_file_system = gcsfs.GCSFileSystem(**storage_options) # TODO: Figure out how to get auth creds passed # elif config_dataset.path.startswith("adl://") or config_dataset.path.startswith("abfs://"): # try: # import adlfs # except ImportError as exc: # raise ImportError( # "adl:// or abfs:// paths require adlfs to be installed" # ) from exc # # Gen 1 # storage_options = { # "tenant_id": TENANT_ID, # "client_id": CLIENT_ID, # "client_secret": CLIENT_SECRET, # } # # Gen 2 # storage_options = { # "account_name": ACCOUNT_NAME, # "account_key": ACCOUNT_KEY, # } # remote_file_system = adlfs.AzureBlobFileSystem(**storage_options) try: if remote_file_system and remote_file_system.exists( config_dataset.path ): ds_from_cloud = True except (FileNotFoundError, ConnectionError): pass # prefer local dataset, even if hub exists local_path = Path(config_dataset.path) if local_path.exists(): if local_path.is_dir(): if config_dataset.data_files: ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type, name=config_dataset.name, data_files=config_dataset.data_files, streaming=False, split=None, ) else: ds = load_from_disk(config_dataset.path) elif local_path.is_file(): ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type, name=config_dataset.name, data_files=config_dataset.path, streaming=False, split=None, ) else: raise ValueError( "unhandled dataset load: local path exists, but is neither a directory or a file" ) elif ds_from_hub: ds = load_dataset( config_dataset.path, name=config_dataset.name, streaming=False, data_files=config_dataset.data_files, token=use_auth_token, ) elif ds_from_cloud and remote_file_system: if remote_file_system.isdir(config_dataset.path): ds = load_from_disk( config_dataset.path, storage_options=storage_options, ) elif remote_file_system.isfile(config_dataset.path): ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type, name=config_dataset.name, data_files=config_dataset.path, streaming=False, split=None, storage_options=storage_options, ) elif config_dataset.path.startswith("https://"): ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type, name=config_dataset.name, data_files=config_dataset.path, streaming=False, split=None, storage_options=storage_options, ) else: if isinstance(config_dataset.data_files, str): fp = hf_hub_download( repo_id=config_dataset.path, repo_type="dataset", filename=config_dataset.data_files, ) elif isinstance(config_dataset.data_files, list): fp = [] for file in config_dataset.data_files: fp.append( hf_hub_download( repo_id=config_dataset.path, repo_type="dataset", filename=file, ) ) else: raise ValueError( "data_files must be either a string or list of strings" ) ds = load_dataset( "json", name=config_dataset.name, data_files=fp, streaming=False, split=None, ) if not ds: raise ValueError("unhandled dataset load") d_base_type = d_prompt_style = None d_type = config_dataset.type if isinstance(d_type, str): d_type_split = d_type.split(":") d_base_type = d_type_split[0] d_prompt_style = d_type_split[1] if len(d_type_split) > 1 else None if isinstance(ds, DatasetDict): if config_dataset.split and config_dataset.split in ds: ds = ds[config_dataset.split] elif split in ds: ds = ds[split] else: raise ValueError( f"no {split} split found for dataset {config_dataset.path}, you may specify a split with 'split: `" ) # support for using a subset of the data if config_dataset.shards: shards_idx = config_dataset.get("shards_idx", 0) ds = ds.shuffle(seed=seed).shard( num_shards=config_dataset.shards, index=shards_idx ) dataset_wrapper, dataset_prompter = get_dataset_wrapper( config_dataset=config_dataset, tokenizer=tokenizer, cfg=cfg, dataset=ds, d_base_type=d_base_type, d_prompt_style=d_prompt_style, ) datasets.append(dataset_wrapper) prompters.append(dataset_prompter) LOG.info("merging datasets") dataset = concatenate_datasets(datasets) if len(datasets) > 1: if cfg.shuffle_merged_datasets: LOG.debug("shuffle merged datasets") dataset = dataset.shuffle(seed=seed) else: LOG.debug("NOT shuffling merged datasets") dataset, _ = process_datasets_for_packing(cfg, dataset, None) if cfg.local_rank == 0: LOG.info(f"Saving merged prepared dataset to disk... {prepared_ds_path}") dataset.save_to_disk(str(prepared_ds_path)) if cfg.push_dataset_to_hub: LOG.info( f"Saving merged prepared dataset with push_to_hub... {cfg.push_dataset_to_hub}/{ds_hash}" ) dataset.push_to_hub( f"{cfg.push_dataset_to_hub}/{ds_hash}", private=True ) return dataset, prompters
    [huggingface/peft] examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py
    def preprocess_function(examples): batch_size = len(examples[text_column]) inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]] targets = [str(x) for x in examples[label_column]] model_inputs = tokenizer(inputs) labels = tokenizer(targets, add_special_tokens=False) # don't add bos token because we concatenate with inputs for i in range(batch_size): sample_input_ids = model_inputs["input_ids"][i] label_input_ids = labels["input_ids"][i] + [tokenizer.eos_token_id] model_inputs["input_ids"][i] = sample_input_ids + label_input_ids labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i]) for i in range(batch_size): sample_input_ids = model_inputs["input_ids"][i] label_input_ids = labels["input_ids"][i] model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * ( max_length - len(sample_input_ids) ) + sample_input_ids model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[ "attention_mask" ][i] labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length]) model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length]) labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length]) model_inputs["labels"] = labels["input_ids"] return model_inputs
    [openaccess-ai-collective/axolotl] src/axolotl/utils/data/pretraining.py
    def encode_pretraining( tokenizer: PreTrainedTokenizerBase, max_tokens: int, examples: List[str] ) -> Dict[str, List]: res = tokenizer( examples, truncation=True, max_length=max_tokens - 2, add_special_tokens=True, ) # Convert to PyTorch tensors input_ids = [torch.tensor(seq) for seq in res["input_ids"]] attention_mask = [torch.tensor(seq) for seq in res["attention_mask"]] new_input_ids = [] new_attention_mask = [] # Append EOS and PAD tokens to input_ids, and correct attention_mask for i, _ in enumerate(input_ids): input_ids[i] = torch.cat( ( input_ids[i], torch.tensor([tokenizer.eos_token_id, tokenizer.pad_token_id]), ), dim=0, ) attention_mask[i] = torch.cat((attention_mask[i], torch.tensor([1, 0])), dim=0) # Concatenate tokens so that their lengths are less than max_tokens buffer_input_ids = torch.tensor([], dtype=torch.long) buffer_attention_mask = torch.tensor([], dtype=torch.long) for ids, mask in zip(input_ids, attention_mask): if buffer_input_ids.numel() == max_tokens: new_input_ids.append(buffer_input_ids) new_attention_mask.append(buffer_attention_mask) buffer_input_ids = torch.tensor([], dtype=torch.long) buffer_attention_mask = torch.tensor([], dtype=torch.long) buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0) buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0) elif buffer_input_ids.numel() + ids.numel() <= max_tokens: buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0) buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0) else: buffer_input_ids = torch.cat( ( buffer_input_ids, torch.full( (max_tokens - buffer_input_ids.numel(),), tokenizer.pad_token_id, dtype=torch.long, ), ), dim=0, ) buffer_attention_mask = torch.cat( ( buffer_attention_mask, torch.full( (max_tokens - buffer_attention_mask.numel(),), 0, dtype=torch.long, ), ), dim=0, ) new_input_ids.append(buffer_input_ids) new_attention_mask.append(buffer_attention_mask) buffer_input_ids = torch.tensor([], dtype=torch.long) buffer_attention_mask = torch.tensor([], dtype=torch.long) buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0) buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0) if buffer_input_ids.numel() > 0: # for any leftover tokens while buffer_input_ids.numel() < max_tokens: # make all sequences equal in size buffer_input_ids = torch.cat( ( buffer_input_ids, torch.full( (max_tokens - buffer_input_ids.numel(),), tokenizer.pad_token_id, dtype=torch.long, ), ), dim=0, ) buffer_attention_mask = torch.cat( ( buffer_attention_mask, torch.full( (max_tokens - buffer_attention_mask.numel(),), 0, dtype=torch.long, ), ), dim=0, ) new_input_ids.append(buffer_input_ids) new_attention_mask.append(buffer_attention_mask) ret = { "input_ids": [seq.tolist() for seq in new_input_ids], "labels": [seq.tolist() for seq in new_input_ids], "attention_mask": [seq.tolist() for seq in new_attention_mask], } LOG.debug(len(ret["input_ids"])) return ret
    [openaccess-ai-collective/axolotl] README.md

    Train

    Run

    accelerate launch -m axolotl.cli.train your_config.yml

    [!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

    Preprocess dataset

    You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

    • Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
    • (Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
    • (Optional): Use --debug to see preprocessed examples.
    python -m axolotl.cli.preprocess your_config.yml

    Multi-GPU

    Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

    DeepSpeed

    Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated

    We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.

    deepspeed: deepspeed_configs/zero1.json
    accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
    FSDP
    • llama FSDP
    fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
    FSDP + QLoRA

    Axolotl supports training with FSDP and QLoRA, see these docs for more information.

    Weights & Biases Logging

    Make sure your WANDB_API_KEY environment variable is set (recommended) or you login to wandb with wandb login.

    • wandb options
    wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
    Special Tokens

    It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:

    special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"

    When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.

    [openaccess-ai-collective/axolotl] src/axolotl/utils/data/pretraining.py
    """data handling specific to pretraining"""
    [openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/pretrain.py
    def _tokenize( self, prompt: str, add_eos_token: bool = True, strip_bos_token: bool = False ) -> BatchEncoding: res = self.tokenizer( prompt, truncation=True, max_length=self.max_length - 1, add_special_tokens=True, return_overflowing_tokens=True, stride=256, ) res["input_ids"] = [ seq + [self.tokenizer.eos_token_id] for seq in res["input_ids"] ] res["attention_mask"] = [seq + [1] for seq in res["attention_mask"]] return res
    [openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/pretrain.py
    class PretrainTokenizationStrategy(PromptTokenizingStrategy): """handles tokenization for pretraining with strides""" @property def supports_batched(self): return True def __init__(self, *args, max_length=None, text_column="text", **kwargs): super().__init__(*args, **kwargs) if max_length: self.max_length = max_length self.text_column = text_column def _tokenize( self, prompt: str, add_eos_token: bool = True, strip_bos_token: bool = False ) -> BatchEncoding: res = self.tokenizer( prompt, truncation=True, max_length=self.max_length - 1, add_special_tokens=True, return_overflowing_tokens=True, stride=256, ) res["input_ids"] = [ seq + [self.tokenizer.eos_token_id] for seq in res["input_ids"] ] res["attention_mask"] = [seq + [1] for seq in res["attention_mask"]] return res def tokenize_prompt(self, prompt): return self._tokenize(prompt[self.text_column])
    [openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/pretrain.py
    def load(tokenizer, cfg): strat = PretrainTokenizationStrategy( PretrainTokenizer(), tokenizer, cfg.train_on_inputs, cfg.sequence_len, text_column=cfg.pretraining_dataset[0]["text_column"] or "text", max_length=cfg.sequence_len * 64, ) return strat
    [openaccess-ai-collective/axolotl] src/axolotl/utils/tokenization.py
    """Module for tokenization utilities"""
    [openaccess-ai-collective/axolotl] README.md

    Axolotl

    Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.

    Features:

    • Train various Huggingface models such as llama, pythia, falcon, mpt
    • Supports fullfinetune, lora, qlora, relora, and gptq
    • Customize configurations using a simple yaml file or CLI overwrite
    • Load different dataset formats, use custom formats, or bring your own tokenized datasets
    • Integrated with xformer, flash attention, rope scaling, and multipacking
    • Works with single GPU or multiple GPUs via FSDP or Deepspeed
    • Easily run with Docker locally or on the cloud
    • Log results and optionally checkpoints to wandb or mlflow
    • And more!
    <a href="https://www.phorm.ai/query?projectId=e315ba4a-4e14-421f-ab05-38a1f9076f25"> </a> <table> <tr> <td>

    Table of Contents

    </td> <td> <div align="center"> <img src="image/axolotl.png" alt="axolotl" width="160"> <div> <p> <b>Axolotl provides a unified repository for fine-tuning <br />a variety of AI models with ease</b> </p> <p> Go ahead and Axolotl questions!! </p> <img src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/pre-commit.yml/badge.svg?branch=main" alt="pre-commit"> <img alt="PyTest Status" src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/tests.yml/badge.svg?branch=main"> </div> </div> </td> </tr> </table>

    Axolotl supports

    | | fp16/fp32 | lora | qlora | gptq | gptq w/flash attn | flash attn | xformers attn | |-------------|:----------|:-----|-------|------|-------------------|------------|--------------| | llama | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | | Mistral | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | | Mixtral-MoE | βœ… | βœ… | βœ… | ❓ | ❓ | ❓ | ❓ | | Mixtral8X22 | βœ… | βœ… | βœ… | ❓ | ❓ | ❓ | ❓ | | Pythia | βœ… | βœ… | βœ… | ❌ | ❌ | ❌ | ❓ | | cerebras | βœ… | βœ… | βœ… | ❌ | ❌ | ❌ | ❓ | | btlm | βœ… | βœ… | βœ… | ❌ | ❌ | ❌ | ❓ | | mpt | βœ… | ❌ | ❓ | ❌ | ❌ | ❌ | ❓ | | falcon | βœ… | βœ… | βœ… | ❌ | ❌ | ❌ | ❓ | | gpt-j | βœ… | βœ… | βœ… | ❌ | ❌ | ❓ | ❓ | | XGen | βœ… | ❓ | βœ… | ❓ | ❓ | ❓ | βœ… | | phi | βœ… | βœ… | βœ… | ❓ | ❓ | ❓ | ❓ | | RWKV | βœ… | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | | Qwen | βœ… | βœ… | βœ… | ❓ | ❓ | ❓ | ❓ | | Gemma | βœ… | βœ… | βœ… | ❓ | ❓ | βœ… | ❓ |

    βœ…: supported ❌: not supported ❓: untested

    Quickstart ⚑

    Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.

    Requirements: Python >=3.10 and Pytorch >=2.1.1.

    git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]'

    Usage

    # preprocess datasets - optional but recommended CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml # finetune lora accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml # inference accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./lora-out" # gradio accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./lora-out" --gradio # remote yaml files - the yaml config can be hosted on a public URL # Note: the yaml config must directly link to the **raw** yaml accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml

    Advanced Setup

    Environment

    Docker

    docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest

    Or run on the current files for development:

    docker compose up -d

    [!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.

    <details> <summary>Docker advanced</summary>

    A more powerful Docker command to run would be this:

    docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

    It additionally:

    • Prevents memory issues when running e.g. deepspeed (e.g. you could hit SIGBUS/signal 7 error) through --ipc and --ulimit args.
    • Persists the downloaded HF data (models etc.) and your modifications to axolotl code through --mount/-v args.
    • The --name argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...) or in your terminal.
    • The --privileged flag gives all capabilities to the container.
    • The --shm-size 10g argument increases the shared memory size. Use this if you see exitcode: -7 errors using deepspeed.

    More information on nvidia website

    </details>

    Conda/Pip venv

    1. Install python >=3.10

    2. Install pytorch stable https://pytorch.org/get-started/locally/

    3. Install Axolotl along with python dependencies

      pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
    4. (Optional) Login to Huggingface to use gated models/datasets.

      huggingface-cli login

      Get the token at huggingface.co/settings/tokens

    Cloud GPU

    For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest

    Bare Metal Cloud GPU

    LambdaLabs
    <details> <summary>Click to Expand</summary>
    1. Install python
    sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
    1. Install pip
    wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
    1. Install Pytorch https://pytorch.org/get-started/locally/

    2. Follow instructions on quickstart.

    3. Run

    pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
    1. Set path
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
    </details>
    GCP
    <details> <summary>Click to Expand</summary>

    Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.

    Make sure to run the below to uninstall xla.

    pip uninstall -y torch_xla[tpu]
    </details>

    Windows

    Please use WSL or Docker!

    Mac

    Use the below instead of the install method in QuickStart.

    pip3 install -e '.'
    

    More info: mac.md

    Google Colab

    Please use this example notebook.

    Launching on public clouds via SkyPilot

    To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:

    pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check

    Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1:

    git clone https://github.com/skypilot-org/skypilot.git
    cd skypilot/llm/axolotl
    

    Use one command to launch:

    # On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET

    Dataset

    Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.

    See these docs for more information on how to use different dataset formats.

    Config

    See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:

    • model

      base_model: ./llama-7b-hf # local or huggingface repo

      Note: The code will load the right architecture.

    • dataset

      datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
    • loading

      load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP

      Note: Repo does not do 4-bit quantization.

    • lora

      adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj

    All Config Options

    See these docs for all config options.

    Train

    Run

    accelerate launch -m axolotl.cli.train your_config.yml

    [!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

    Preprocess dataset

    You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

    • Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
    • (Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
    • (Optional): Use --debug to see preprocessed examples.
    python -m axolotl.cli.preprocess your_config.yml

    Multi-GPU

    Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

    DeepSpeed

    Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated

    We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.

    deepspeed: deepspeed_configs/zero1.json
    accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
    FSDP
    • llama FSDP
    fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
    FSDP + QLoRA

    Axolotl supports training with FSDP and QLoRA, see these docs for more information.

    Weights & Biases Logging

    Make sure your WANDB_API_KEY environment variable is set (recommended) or you login to wandb with wandb login.

    • wandb options
    wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
    Special Tokens

    It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:

    special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>"
    [openaccess-ai-collective/axolotl] docs/dataset_preprocessing.qmd
    ---
    title: Dataset Preprocessing
    description: How datasets are processed
    ---
    
    Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
    the (dataset format)[../dataset-formats/] and prompt strategies to:
     - parse the dataset based on the *dataset format*
     - transform the dataset to how you would interact with the model based on the *prompt strategy*
     - tokenize the dataset based on the configured model & tokenizer
     - shuffle and merge multiple datasets together if using more than one
    
    The processing of the datasets can happen one of two ways:
    
    1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
    2. When training is started
    
    What are the benefits of pre-processing? When training interactively or for sweeps
    (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
    slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
    training parameters so that it will intelligently pull from its cache when possible.
    
    The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
    YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.
    
    If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
    default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
    setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
    data is in the cache.
    
    What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
    prompt template. Because the trainer cannot readily detect these changes, we cannot change the
    calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
    and change your prompt templating logic, it may not pick up the changes you made and you will be
    training over the old prompt.
    
    
    [openaccess-ai-collective/axolotl] src/axolotl/utils/data/sft.py
    def prepare_dataset(cfg, tokenizer): prompters = [] if not cfg.pretraining_dataset: with zero_first(is_main_process()): if cfg.test_datasets: train_dataset, _, prompters = load_prepare_datasets( tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="train" ) _, eval_dataset, _ = load_prepare_datasets( tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="test" ) else: train_dataset, eval_dataset, prompters = load_prepare_datasets( tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH ) else: path = cfg.pretraining_dataset split = "train" name = None if isinstance(cfg.pretraining_dataset, list) and isinstance( cfg.pretraining_dataset[0], dict ): path = cfg.pretraining_dataset[0]["path"] name = cfg.pretraining_dataset[0]["name"] if "split" in cfg.pretraining_dataset[0]: split = cfg.pretraining_dataset[0]["split"] ds_wrapper_partial = functools.partial( get_dataset_wrapper, cfg.pretraining_dataset[0], tokenizer, cfg, cfg.pretraining_dataset[0]["type"] or "pretrain", ) train_dataset = wrap_pretraining_dataset( load_dataset(path, streaming=True, split=split, name=name), tokenizer, cfg, ds_wrapper_partial, max_tokens=cfg.sequence_len, batch_size=cfg.micro_batch_size, seed=cfg.seed or 42, buffer_size=cfg.pretrain_multipack_buffer_size or 10_000, ) # https://discuss.huggingface.co/t/how-to-use-huggingface-trainer-streaming-datasets-without-wrapping-it-with-torchdatas-iterablewrapper/25230 train_dataset = train_dataset.with_format("torch") eval_dataset = None return train_dataset, eval_dataset, cfg.max_steps, prompters if eval_dataset and cfg.sample_packing and cfg.eval_sample_packing is not False: total_eval_steps = calculate_total_num_steps(cfg, eval_dataset, update=False) if total_eval_steps == 0: raise ValueError( "eval dataset split is too small for sample_packing. You should set `eval_sample_packing: False`. " ) if cfg.max_steps: total_num_steps = min( calculate_total_num_steps(cfg, train_dataset), cfg.max_steps ) LOG.info(f"Maximum number of steps set at {total_num_steps}") else: total_num_steps = calculate_total_num_steps(cfg, train_dataset) return train_dataset, eval_dataset, total_num_steps, prompters
    [openaccess-ai-collective/axolotl] docs/input_output.qmd
    ---
    title: Template-free prompt construction
    description: "Template-free prompt construction with the `input_output` format"
    ---
    
    <!-- TOC -->
    
    - [Background](#background)
        - [Masking Inputs](#masking-inputs)
        - [You may not want prompt templates](#you-may-not-want-prompt-templates)
        - [The `input_output` format](#the-input_output-format)
    - [Usage](#usage)
        - [1. Prepare Data](#1-prepare-data)
        - [2. Use `type: input_output`](#2-use-type-input_output)
        - [3. Check the prompts](#3-check-the-prompts)
    
    <!-- /TOC -->
    
    <a id="markdown-background" name="background"></a>
    
    ## Background
    
    <a id="markdown-masking-inputs" name="masking-inputs"></a>
    
    ### Masking Inputs
    
    One of the most popular features of
    [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is
    setting the following configuration value:
    
    
    ```yaml
    train_on_inputs: false
    

    If you declare a dataset formats such as alpaca or chatml, axolotl knows what is an input (i.e.Β human) vs.Β an output (i.e.Β the assistant) and masks the input labels so that your model can focus on predicting the outputs only.

    <a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a>

    You may not want prompt templates

    However, there are many situations where you don't want to use one of these formats or templates. This is because they can:

    • Add unnecessary boilerplate to your prompts.
    • Create artifacts like special delimiters <|im_start|> that can quickly become footguns if you don't include them correctly at inference time.
    • Enforce a chat interface when you do not want one. Sometimes you just want to fine-tune a model to a very specific task and do NOT want multi-turn conversations, roles, etc.
    • Limit you to only certain roles that the template allows.

    <a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a>

    The input_output format

    You can construct your prompts without a template by using the input_output format, by setting type: input_output in your configuration file like this:

    config.yml

    train_on_inputs: false # Mask segments of your data datasets: - path: output.jsonl type: input_output # use template free prompt construction

    Unlike type: completion, which is also template-free, type: input_output allows you to mask segments of your text. More details on how this works are described below.

    <a id="markdown-usage" name="usage"></a>

    Usage

    This is how you can use the input_output format:

    <a id="markdown-1-prepare-data" name="1-prepare-data"></a>

    1. Prepare Data

    To use the input_output format, collect your data in the following format into a jsonl file (below is the first row from the file output.jsonl` pretty printed):

    $ head -n1 output.jsonl | python -m json.tool

    :::{.cell-output .cell-output-stdout} { "segments": [ { "label": true, "text": "<s>Hello\n" }, { "label": true, "text": "hi there!. " }, { "label": false, "text": "goodbye " }, { "label": true, "text": "farewell</s>" } ] } :::

    Set label:false when you want to mask a segment of text so that the model isn't trained on it. Some things to keep in mind:

    [!IMPORTANT]

    1. EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl concatenates all the segments as-is. The tokenizer doesn't add anything additional. Notice how I added spaces, newlines, <s> (BOS), and </s> (EOS) myself.
    2. Make sure you check the materialized output to validate that the prompt is getting assembled how you like.

    <a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a>

    2. Use type: input_output

    Let's materialize data with our output.jsonl file by setting type: input_output in our axolotl config:

    # training_config.yaml base_model: mistralai/Mistral-7B-v0.1 data_seed: 49 seed: 49 datasets: - path: output.jsonl type: input_output val_set_size: 0.1 sequence_len: 896 sample_packing: false micro_batch_size: 2 gradient_accumulation_steps: 3 eval_batch_size: 2 num_epochs: 1 learning_rate: 0.0002 train_on_inputs: false special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"

    You can use the following command to materialize your data. The --debug flag will print the tokens, along with the labels so you can verify that the correct items are being ignored:

    $ python -m axolotl.cli.preprocess training_config.yaml --debug ... [2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557) (13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2)

    The format is decoded_token(label, token_id), for example, <s>(1, 1) means that the token is <s>, the label is 1 and the token_id is 1. When the label is -100 then that token is ignored for training.

    <a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a>

    3. Check the prompts

    Here is another way to check the materialized output:

    from transformers import AutoTokenizer from datasets import load_from_disk import yaml directory = !ls last_run_prepared/ with open('training_config.yaml', 'r') as f: cfg = yaml.safe_load(f) model_id = cfg['base_model'] tok = AutoTokenizer.from_pretrained(model_id) ds = load_from_disk(f'last_run_prepared/{directory[0]}/')
    >>> row = ds[0] >>> print(tok.decode(row['input_ids'])) <s> Hello hi there!. goodbye farewell</s>

    We can check that the right tokens are ingored by comparing the labels to each token:

    import pandas as pd pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in zip(row['input_ids'], row['labels'])])

    | token | label | id | |-------|-------|-------| | 0 | <s> | 1 | | 1 | Hello | 22557 | | 2 | \n | 13 | | 3 | hi | 12014 | | 4 | there | 736 | | 5 | ! | 28808 | | 6 | . | 28723 | | 7 | | 28705 | | 8 | good | -100 | | 9 | bye | -100 | | 10 | | -100 | | 11 | fare | 19111 | | 12 | well | 5458 | | 13 | </s>| 2 |

    If we look at the input data, the above table seems correct! (The jsonl version is repeated below for reference):

    $ head -n1 output.jsonl | python -m json.tool

    :::{.cell-output .cell-output-stdout} { "segments": [ { "label": true, "text": "<s>Hello\n" }, { "label": true, "text": "hi there!. " }, { "label": false, "text": "goodbye " }, { "label": true, "text": "farewell</s>" } ] } :::

OpenAccess-AI-Collective/axolotl
huggingface/transformers
huggingface/peft
huggingface/accelerate