OpenAccess-AI-Collective/axolotl

how can I use pre-tokenized data

To use pre-tokenized data with Axolotl, follow these steps:

Ensure Your Data Format: Your dataset must have columns exactly named input_ids, attention_mask, and labels.
Configuration: Do not specify a type: in your Axolotl configuration file. This indicates that you are using a custom dataset format, in this case, pre-tokenized data.

Here's an example of how you might set up your configuration file (config.yml):

- path: path_to_your_pre_tokenized_data

Loading the Dataset: When loading your dataset, Axolotl will expect the dataset to be in the format mentioned above. You can load your dataset using the standard methods provided by the datasets library if you are using it, or any other method you prefer, as long as the final format matches.
Using the Dataset: Once loaded, you can directly use this dataset for training or evaluation without any further tokenization steps, as the data is already in the required format.

Here's a simple code example illustrating how you might load and use a pre-tokenized dataset:

from datasets import load_dataset

# Assuming your dataset is in a format recognized by the `datasets` library
dataset = load_dataset('path/to/dataset', split='train')

# Example of what the dataset should look like
print(dataset[0])
# Output: {'input_ids': [101, 2054, 2003, 102, ...], 'attention_mask': [1, 1, 1, 1, ...], 'labels': [1]}

# Now, you can directly use this dataset for training or evaluation with Axolotl

Remember, the key here is that your dataset must already be tokenized and structured correctly. This means you should have already converted your text into input_ids and generated the appropriate attention_mask and labels for each example in your dataset.

Search results

[openaccess-ai-collective/axolotl] docs/dataset-formats/tokenized.qmd

---
title: Custom Pre-Tokenized Dataset
description: How to use a custom pre-tokenized dataset.
order: 5
---

- Do not pass a `type:` in your axolotl config.
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`

```{.yaml filename="config.yml"}
- path: ...

[huggingface/transformers] examples/research_projects/performer/run_mlm_performer.py

def tokenize_function(examples):
        # Remove empty lines
        examples = [line for line in examples if len(line) > 0 and not line.isspace()]
        return tokenizer(
            examples,
            return_special_tokens_mask=True,
            padding=padding,
            truncation=True,
            max_length=data_args.max_seq_length,
        )

[huggingface/transformers] examples/research_projects/codeparrot/scripts/pretokenizing.py

def tokenize(example):
    output = {}
    output["input_ids"] = tokenizer(example["content"], truncation=False)["input_ids"]
    output["ratio_char_token"] = len(example["content"]) / len(output["input_ids"])
    return output

[huggingface/transformers] examples/tensorflow/language-modeling/run_mlm.py

def tokenize_function(examples):
            # Remove empty lines
            examples[text_column_name] = [
                line for line in examples[text_column_name] if len(line) > 0 and not line.isspace()
            ]
            return tokenizer(
                examples[text_column_name],
                padding=padding,
                truncation=True,
                max_length=max_seq_length,
                # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
                # receives the `special_tokens_mask`.
                return_special_tokens_mask=True,
            )

[huggingface/transformers] tests/trainer/test_trainer_seq2seq.py

def prepare_data(examples):
            # Remove pairs where at least one record is none
            inputs = examples[INPUT_COLUMN]
            targets = examples[TARGET_COLUMN]

            model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)
            labels = tokenizer(text_target=targets, max_length=MAX_TARGET_LENGTH, truncation=True)
            model_inputs["labels"] = labels["input_ids"]

            return model_inputs

[huggingface/transformers] docs/source/en/main_classes/tokenizer.md

Tokenizer

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the Rust library 🤗 Tokenizers. The "Fast" implementations allows:

a significant speed-up in particular when doing batched tokenization and
additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).

The base classes [PreTrainedTokenizer] and [PreTrainedTokenizerFast] implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and "Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository). They both rely on [~tokenization_utils_base.PreTrainedTokenizerBase] that contains the common methods, and [~tokenization_utils_base.SpecialTokensMixin].

[PreTrainedTokenizer] and [PreTrainedTokenizerFast] thus implement the main methods for using all the tokenizers:

Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).
Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece...).
Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.

[BatchEncoding] holds the output of the [~tokenization_utils_base.PreTrainedTokenizerBase]'s encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these methods (input_ids, attention_mask...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace tokenizers library), this class provides in addition several advanced alignment methods which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token).

PreTrainedTokenizer

[[autodoc]] PreTrainedTokenizer - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all

PreTrainedTokenizerFast

The [PreTrainedTokenizerFast] depend on the tokenizers library. The tokenizers obtained from the 🤗 tokenizers library can be loaded very simply into 🤗 transformers. Take a look at the Using tokenizers from 🤗 tokenizers page to understand how this is done.

[[autodoc]] PreTrainedTokenizerFast - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all

BatchEncoding

[[autodoc]] BatchEncoding

[huggingface/transformers] docs/source/en/internal/tokenization_utils.md

[huggingface/transformers] examples/research_projects/codeparrot/scripts/pretokenizing.py

ds = ds.map(
    tokenize,
    num_proc=args.num_workers,
    remove_columns=[
        "repo_name",
        "path",
        "copies",
        "size",
        "content",
        "license",
        "hash",
        "line_mean",
        "line_max",
        "alpha_frac",
        "autogenerated",
    ],
)
print(f"Dataset tokenized in {time.time()-t_start:.2f}s")

t_start = time.time()
ds.push_to_hub(args.tokenized_data_repo)
print(f"Data pushed to the hub in {time.time()-t_start:.2f}s")

[huggingface/transformers] docs/source/en/fast_tokenizers.md

Use tokenizers from 🤗 Tokenizers

The [PreTrainedTokenizerFast] depends on the 🤗 Tokenizers library. The tokenizers obtained from the 🤗 Tokenizers library can be loaded very simply into 🤗 Transformers.

Before getting in the specifics, let's first start by creating a dummy tokenizer in a few lines:

>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>> from tokenizers.pre_tokenizers import Whitespace

>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

>>> tokenizer.pre_tokenizer = Whitespace()
>>> files = [...]
>>> tokenizer.train(files, trainer)

We now have a tokenizer trained on the files we defined. We can either continue using it in that runtime, or save it to a JSON file for future re-use.

Loading directly from the tokenizer object

Let's see how to leverage this tokenizer object in the 🤗 Transformers library. The [PreTrainedTokenizerFast] class allows for easy instantiation, by accepting the instantiated tokenizer object as an argument:

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

This object can now be used with all the methods shared by the 🤗 Transformers tokenizers! Head to the tokenizer page for more information.

Loading from a JSON file

In order to load a tokenizer from a JSON file, let's first start by saving our tokenizer:

>>> tokenizer.save("tokenizer.json")

The path to which we saved this file can be passed to the [PreTrainedTokenizerFast] initialization method using the tokenizer_file parameter:

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

This object can now be used with all the methods shared by the 🤗 Transformers tokenizers! Head to the tokenizer page for more information.

[huggingface/transformers] src/transformers/models/perceiver/tokenization_perceiver.py

# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Tokenization class for Perceiver."""

[huggingface/transformers] docs/source/zh/internal/tokenization_utils.md

[huggingface/transformers] src/transformers/models/cpm/tokenization_cpm.py

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""

[huggingface/accelerate] examples/by_feature/early_stopping.py

def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
        return outputs

[huggingface/peft] examples/conditional_generation/peft_adalora_seq2seq.py

# data preprocessing
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

[huggingface/accelerate] examples/by_feature/fsdp_with_peak_mem_tracking.py

def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
        return outputs

[huggingface/accelerate] examples/by_feature/cross_validation.py

def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
        return outputs

[huggingface/peft] examples/sequence_classification/peft_no_lora_accelerate.py

def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
        return outputs

[huggingface/peft] examples/sft/utils.py

def preprocess(samples):
        batch = []
        for conversation in samples["messages"]:
            batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
        return {"content": batch}

[huggingface/accelerate] examples/by_feature/memory.py

def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
        return outputs

[huggingface/accelerate] examples/by_feature/schedule_free.py

def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
        return outputs

[huggingface/accelerate] examples/by_feature/checkpointing.py

def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
        return outputs

[huggingface/accelerate] examples/by_feature/megatron_lm_gpt_pretraining.py

def tokenize_function(examples):
        return tokenizer(examples[text_column_name])

[openaccess-ai-collective/axolotl] src/axolotl/utils/data/pretraining.py

def wrap_pretraining_dataset(
    dataset,
    tokenizer,
    cfg,
    ds_wrapper_fn,
    max_tokens=2048,
    batch_size=1,
    seed=42,
    buffer_size=10_000,
):
    if cfg.sample_packing:
        collate_fn = PretrainingBatchSamplerDataCollatorForSeq2Seq(
            tokenizer,
            return_tensors="pt",
            padding=True,
            pad_to_multiple_of=max_tokens * batch_size,
            multipack_attn=cfg.pretrain_multipack_attn,
        )
        encode = functools.partial(
            encode_packed_pretraining,
            collate_fn,
            ds_wrapper_fn,
            max_seq_length=max_tokens,
            batch_size=batch_size,
            multipack_attn=cfg.pretrain_multipack_attn,
        )
        # set this to 1 so downstream data_loader doesn't try to increase the batch again
        cfg.micro_batch_size = 1
    else:
        encode = functools.partial(encode_pretraining, tokenizer, max_tokens)

    if cfg.shuffle_merged_datasets:
        dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size)
    else:
        LOG.debug("NOT shuffling merged pretraining datasets")

    # remove all the existing columns after mapping since they end up having
    # a different length than the encoded/tokenized column
    # this is empty during streaming/pretraining
    remove_columns = []
    if dataset.features is None:
        for first_row in dataset:
            remove_columns = first_row.keys()
            break
    else:
        remove_columns = dataset.features.keys()

    dataset = dataset.map(
        encode,
        batched=True,
        batch_size=buffer_size,
        # input_columns="text",
        remove_columns=remove_columns,
    )
    return dataset

[huggingface/peft] examples/fp4_finetuning/finetune_fp4_opt_bnb_peft.py

# import torch
# from peft import PeftModel, PeftConfig
# from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
#
# peft_model_id = "ybelkada/opt-6.7b-lora"
# config = PeftConfig.from_pretrained(peft_model_id)
# model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, quantization_config=BitsAndBytesConfig(load_in_8bit=True), device_map='auto')
# tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
#
## Load the Lora model
# model = PeftModel.from_pretrained(model, peft_model_id)
#
# """## Inference
#
# You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference as you would do it usually in `transformers`.
# """
#
batch = tokenizer("Two things are infinite: ", return_tensors="pt")

[huggingface/peft] examples/conditional_generation/peft_lora_seq2seq_accelerate_fsdp.py

def preprocess_function(examples):
        inputs = examples[text_column]
        targets = examples[label_column]
        model_inputs = tokenizer(
            inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt"
        )
        labels = tokenizer(targets, max_length=2, padding="max_length", truncation=True, return_tensors="pt")
        labels = labels["input_ids"]
        labels[labels == tokenizer.pad_token_id] = -100
        model_inputs["labels"] = labels
        return model_inputs

[huggingface/peft] examples/conditional_generation/peft_adalora_seq2seq.py

def preprocess_function(examples):
    inputs = examples[text_column]
    targets = examples[label_column]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt")
    labels = labels["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

[openaccess-ai-collective/axolotl] src/axolotl/utils/data/sft.py

def load_prepare_datasets(
    tokenizer: PreTrainedTokenizerBase,
    cfg,
    default_dataset_prepared_path,
    split="train",
) -> Tuple[Dataset, Dataset, List[Prompter]]:
    dataset, prompters = load_tokenized_prepared_datasets(
        tokenizer, cfg, default_dataset_prepared_path, split=split
    )

    if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None:
        LOG.info(
            f"Using index #{cfg.dataset_shard_idx} of {cfg.dataset_shard_num} shards"
        )
        dataset = dataset.shard(
            num_shards=cfg.dataset_shard_num,
            index=cfg.dataset_shard_idx,
        )

    if split == "train" and cfg.val_set_size:
        # ensure we end up with the same fingerprint by doing rank0 first and being able to cache
        to_hash_train = (
            dataset._fingerprint  # pylint: disable=protected-access
            + "|"
            + str(cfg.val_set_size)
            + "|"
            + "train"
            + "|"
            + str(cfg.seed or 42)
        )
        to_hash_test = (
            dataset._fingerprint  # pylint: disable=protected-access
            + "|"
            + str(cfg.val_set_size)
            + "|"
            + "test"
            + "|"
            + str(cfg.seed or 42)
        )
        train_fingerprint = md5(to_hash_train)
        test_fingerprint = md5(to_hash_test)

        dataset = dataset.train_test_split(
            test_size=cfg.val_set_size,
            shuffle=False,
            seed=cfg.seed or 42,
            train_new_fingerprint=train_fingerprint,
            test_new_fingerprint=test_fingerprint,
        )

        train_dataset = dataset["train"]
        eval_dataset = dataset["test"]
    elif split == "test":
        train_dataset = None
        eval_dataset = dataset
    else:
        train_dataset = dataset
        eval_dataset = None

    return train_dataset, eval_dataset, prompters

[openaccess-ai-collective/axolotl] docs/dataset-formats/pretraining.qmd

---
title: Pre-training
description: Data format for a pre-training completion task.
order: 1
---

For pretraining, there is no prompt template or roles.  The only required field is `text`:

```{.json filename="data.jsonl"}
{"text": "first row"}
{"text": "second row"}
...

:::{.callout-note}

Streaming is recommended for large datasets

Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:

pretraining_dataset: # hf path only
...

:::

[openaccess-ai-collective/axolotl] src/axolotl/utils/data/sft.py

def load_tokenized_prepared_datasets(
    tokenizer,
    cfg,
    default_dataset_prepared_path,
    split="train",
) -> Tuple[DatasetDict, List[Prompter]]:
    cfg_datasets = cfg.test_datasets if split == "test" else cfg.datasets
    tokenizer_name = cfg.tokenizer_config
    ds_hash = str(
        md5(
            (
                str(cfg.sequence_len)
                + "@"
                + str(cfg.sample_packing)
                + "@"
                + str(cfg.eval_sample_packing)
                + "@"
                + str(cfg.group_by_length)
                + "@"
                + "|".join(
                    sorted(
                        [
                            f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}"
                            for d in cfg_datasets
                        ]
                    )
                )
                + "|"
                + tokenizer_name
            )
        )
    )
    prepared_ds_path = (
        Path(cfg.dataset_prepared_path) / ds_hash
        if cfg.dataset_prepared_path
        else Path(default_dataset_prepared_path) / ds_hash
    )
    dataset = None
    prompters = []
    use_auth_token = cfg.hf_use_auth_token
    try:
        if cfg.push_dataset_to_hub:
            dataset = load_dataset(
                f"{cfg.push_dataset_to_hub}/{ds_hash}",
                token=use_auth_token,
            )
            dataset = dataset[split]
    except Exception:  # pylint: disable=broad-except # nosec
        pass

    # pylint: disable=duplicate-code
    if dataset:
        ...
    elif (
        cfg.dataset_prepared_path
        and any(prepared_ds_path.glob("*"))
        and not cfg.is_preprocess
    ):
        LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...")
        dataset = load_from_disk(str(prepared_ds_path))
        LOG.info("Prepared dataset loaded from disk...")
    else:
        LOG.info(f"Unable to find prepared dataset in {prepared_ds_path}")
        LOG.info("Loading raw datasets...")
        if not cfg.is_preprocess:
            LOG.warning(
                "Processing datasets during training can lead to VRAM instability. Please pre-process your dataset."
            )

        if cfg.seed:
            seed = cfg.seed
        else:
            LOG.info("No seed provided, using default seed of 42")
            seed = 42

        datasets = []

        def for_d_in_datasets(dataset_configs):
            for dataset in dataset_configs:
                if dataset.name and isinstance(dataset.name, list):
                    for name in dataset.name:
                        yield DictDefault({**dataset, "name": name})
                else:
                    yield dataset

        # pylint: disable=invalid-name
        for config_dataset in for_d_in_datasets(cfg_datasets):
            ds: Optional[Union[Dataset, DatasetDict]] = None
            ds_from_hub = False
            try:
                load_dataset(
                    config_dataset.path,
                    name=config_dataset.name,
                    streaming=True,
                    token=use_auth_token,
                )
                ds_from_hub = True
            except (FileNotFoundError, ConnectionError, HFValidationError, ValueError):
                pass

            ds_from_cloud = False
            storage_options = {}
            remote_file_system = None
            if config_dataset.path.startswith("s3://"):
                try:
                    import aiobotocore.session  # type: ignore
                    import s3fs  # type: ignore
                except ImportError as exc:
                    raise ImportError(
                        "s3:// paths require aiobotocore and s3fs to be installed"
                    ) from exc

                # Takes credentials from ~/.aws/credentials for default profile
                s3_session = aiobotocore.session.AioSession(profile="default")
                storage_options = {"session": s3_session}
                remote_file_system = s3fs.S3FileSystem(**storage_options)
            elif config_dataset.path.startswith(
                "gs://"
            ) or config_dataset.path.startswith("gcs://"):
                try:
                    import gcsfs  # type: ignore
                except ImportError as exc:
                    raise ImportError(
                        "gs:// or gcs:// paths require gcsfs to be installed"
                    ) from exc

                # gcsfs will use default credentials from the environment else anon
                # https://gcsfs.readthedocs.io/en/latest/#credentials
                storage_options = {"token": None}
                remote_file_system = gcsfs.GCSFileSystem(**storage_options)
            # TODO: Figure out how to get auth creds passed
            # elif config_dataset.path.startswith("adl://") or config_dataset.path.startswith("abfs://"):
            #     try:
            #         import adlfs
            #     except ImportError as exc:
            #        raise ImportError(
            #            "adl:// or abfs:// paths require adlfs to be installed"
            #        ) from exc

            #     # Gen 1
            #     storage_options = {
            #         "tenant_id": TENANT_ID,
            #         "client_id": CLIENT_ID,
            #         "client_secret": CLIENT_SECRET,
            #     }
            #     # Gen 2
            #     storage_options = {
            #         "account_name": ACCOUNT_NAME,
            #         "account_key": ACCOUNT_KEY,
            #     }

            #     remote_file_system = adlfs.AzureBlobFileSystem(**storage_options)
            try:
                if remote_file_system and remote_file_system.exists(
                    config_dataset.path
                ):
                    ds_from_cloud = True
            except (FileNotFoundError, ConnectionError):
                pass

            # prefer local dataset, even if hub exists
            local_path = Path(config_dataset.path)
            if local_path.exists():
                if local_path.is_dir():
                    if config_dataset.data_files:
                        ds_type = get_ds_type(config_dataset)
                        ds = load_dataset(
                            ds_type,
                            name=config_dataset.name,
                            data_files=config_dataset.data_files,
                            streaming=False,
                            split=None,
                        )
                    else:
                        ds = load_from_disk(config_dataset.path)
                elif local_path.is_file():
                    ds_type = get_ds_type(config_dataset)

                    ds = load_dataset(
                        ds_type,
                        name=config_dataset.name,
                        data_files=config_dataset.path,
                        streaming=False,
                        split=None,
                    )
                else:
                    raise ValueError(
                        "unhandled dataset load: local path exists, but is neither a directory or a file"
                    )
            elif ds_from_hub:
                ds = load_dataset(
                    config_dataset.path,
                    name=config_dataset.name,
                    streaming=False,
                    data_files=config_dataset.data_files,
                    token=use_auth_token,
                )
            elif ds_from_cloud and remote_file_system:
                if remote_file_system.isdir(config_dataset.path):
                    ds = load_from_disk(
                        config_dataset.path,
                        storage_options=storage_options,
                    )
                elif remote_file_system.isfile(config_dataset.path):
                    ds_type = get_ds_type(config_dataset)
                    ds = load_dataset(
                        ds_type,
                        name=config_dataset.name,
                        data_files=config_dataset.path,
                        streaming=False,
                        split=None,
                        storage_options=storage_options,
                    )
            elif config_dataset.path.startswith("https://"):
                ds_type = get_ds_type(config_dataset)
                ds = load_dataset(
                    ds_type,
                    name=config_dataset.name,
                    data_files=config_dataset.path,
                    streaming=False,
                    split=None,
                    storage_options=storage_options,
                )
            else:
                if isinstance(config_dataset.data_files, str):
                    fp = hf_hub_download(
                        repo_id=config_dataset.path,
                        repo_type="dataset",
                        filename=config_dataset.data_files,
                    )
                elif isinstance(config_dataset.data_files, list):
                    fp = []
                    for file in config_dataset.data_files:
                        fp.append(
                            hf_hub_download(
                                repo_id=config_dataset.path,
                                repo_type="dataset",
                                filename=file,
                            )
                        )
                else:
                    raise ValueError(
                        "data_files must be either a string or list of strings"
                    )
                ds = load_dataset(
                    "json",
                    name=config_dataset.name,
                    data_files=fp,
                    streaming=False,
                    split=None,
                )
            if not ds:
                raise ValueError("unhandled dataset load")

            d_base_type = d_prompt_style = None
            d_type = config_dataset.type
            if isinstance(d_type, str):
                d_type_split = d_type.split(":")
                d_base_type = d_type_split[0]
                d_prompt_style = d_type_split[1] if len(d_type_split) > 1 else None

            if isinstance(ds, DatasetDict):
                if config_dataset.split and config_dataset.split in ds:
                    ds = ds[config_dataset.split]
                elif split in ds:
                    ds = ds[split]
                else:
                    raise ValueError(
                        f"no {split} split found for dataset {config_dataset.path}, you may specify a split with 'split: `"
                    )

            # support for using a subset of the data
            if config_dataset.shards:
                shards_idx = config_dataset.get("shards_idx", 0)
                ds = ds.shuffle(seed=seed).shard(
                    num_shards=config_dataset.shards, index=shards_idx
                )

            dataset_wrapper, dataset_prompter = get_dataset_wrapper(
                config_dataset=config_dataset,
                tokenizer=tokenizer,
                cfg=cfg,
                dataset=ds,
                d_base_type=d_base_type,
                d_prompt_style=d_prompt_style,
            )
            datasets.append(dataset_wrapper)
            prompters.append(dataset_prompter)

        LOG.info("merging datasets")
        dataset = concatenate_datasets(datasets)

        if len(datasets) > 1:
            if cfg.shuffle_merged_datasets:
                LOG.debug("shuffle merged datasets")
                dataset = dataset.shuffle(seed=seed)
            else:
                LOG.debug("NOT shuffling merged datasets")

        dataset, _ = process_datasets_for_packing(cfg, dataset, None)

        if cfg.local_rank == 0:
            LOG.info(f"Saving merged prepared dataset to disk... {prepared_ds_path}")
            dataset.save_to_disk(str(prepared_ds_path))
            if cfg.push_dataset_to_hub:
                LOG.info(
                    f"Saving merged prepared dataset with push_to_hub... {cfg.push_dataset_to_hub}/{ds_hash}"
                )
                dataset.push_to_hub(
                    f"{cfg.push_dataset_to_hub}/{ds_hash}", private=True
                )

    return dataset, prompters

[huggingface/peft] examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py

def preprocess_function(examples):
        batch_size = len(examples[text_column])
        inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]]
        targets = [str(x) for x in examples[label_column]]
        model_inputs = tokenizer(inputs)
        labels = tokenizer(targets, add_special_tokens=False)  # don't add bos token because we concatenate with inputs
        for i in range(batch_size):
            sample_input_ids = model_inputs["input_ids"][i]
            label_input_ids = labels["input_ids"][i] + [tokenizer.eos_token_id]
            model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
            labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
            model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])
        for i in range(batch_size):
            sample_input_ids = model_inputs["input_ids"][i]
            label_input_ids = labels["input_ids"][i]
            model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
                max_length - len(sample_input_ids)
            ) + sample_input_ids
            model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
                "attention_mask"
            ][i]
            labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
            model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
            model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
            labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

[openaccess-ai-collective/axolotl] src/axolotl/utils/data/pretraining.py

def encode_pretraining(
    tokenizer: PreTrainedTokenizerBase, max_tokens: int, examples: List[str]
) -> Dict[str, List]:
    res = tokenizer(
        examples,
        truncation=True,
        max_length=max_tokens - 2,
        add_special_tokens=True,
    )
    # Convert to PyTorch tensors
    input_ids = [torch.tensor(seq) for seq in res["input_ids"]]
    attention_mask = [torch.tensor(seq) for seq in res["attention_mask"]]
    new_input_ids = []
    new_attention_mask = []
    # Append EOS and PAD tokens to input_ids, and correct attention_mask
    for i, _ in enumerate(input_ids):
        input_ids[i] = torch.cat(
            (
                input_ids[i],
                torch.tensor([tokenizer.eos_token_id, tokenizer.pad_token_id]),
            ),
            dim=0,
        )
        attention_mask[i] = torch.cat((attention_mask[i], torch.tensor([1, 0])), dim=0)

    # Concatenate tokens so that their lengths are less than max_tokens
    buffer_input_ids = torch.tensor([], dtype=torch.long)
    buffer_attention_mask = torch.tensor([], dtype=torch.long)

    for ids, mask in zip(input_ids, attention_mask):
        if buffer_input_ids.numel() == max_tokens:
            new_input_ids.append(buffer_input_ids)
            new_attention_mask.append(buffer_attention_mask)
            buffer_input_ids = torch.tensor([], dtype=torch.long)
            buffer_attention_mask = torch.tensor([], dtype=torch.long)
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
        elif buffer_input_ids.numel() + ids.numel() <= max_tokens:
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
        else:
            buffer_input_ids = torch.cat(
                (
                    buffer_input_ids,
                    torch.full(
                        (max_tokens - buffer_input_ids.numel(),),
                        tokenizer.pad_token_id,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_attention_mask = torch.cat(
                (
                    buffer_attention_mask,
                    torch.full(
                        (max_tokens - buffer_attention_mask.numel(),),
                        0,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            new_input_ids.append(buffer_input_ids)
            new_attention_mask.append(buffer_attention_mask)
            buffer_input_ids = torch.tensor([], dtype=torch.long)
            buffer_attention_mask = torch.tensor([], dtype=torch.long)

            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)

    if buffer_input_ids.numel() > 0:  # for any leftover tokens
        while buffer_input_ids.numel() < max_tokens:  # make all sequences equal in size
            buffer_input_ids = torch.cat(
                (
                    buffer_input_ids,
                    torch.full(
                        (max_tokens - buffer_input_ids.numel(),),
                        tokenizer.pad_token_id,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_attention_mask = torch.cat(
                (
                    buffer_attention_mask,
                    torch.full(
                        (max_tokens - buffer_attention_mask.numel(),),
                        0,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
        new_input_ids.append(buffer_input_ids)
        new_attention_mask.append(buffer_attention_mask)

    ret = {
        "input_ids": [seq.tolist() for seq in new_input_ids],
        "labels": [seq.tolist() for seq in new_input_ids],
        "attention_mask": [seq.tolist() for seq in new_attention_mask],
    }

    LOG.debug(len(ret["input_ids"]))
    return ret

[openaccess-ai-collective/axolotl] README.md

Train

Run

accelerate launch -m axolotl.cli.train your_config.yml

[!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

Preprocess dataset

You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
(Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
(Optional): Use --debug to see preprocessed examples.

python -m axolotl.cli.preprocess your_config.yml

Multi-GPU

Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

DeepSpeed

Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated

We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.

deepspeed: deepspeed_configs/zero1.json

accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json

FSDP

llama FSDP

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

FSDP + QLoRA

Axolotl supports training with FSDP and QLoRA, see these docs for more information.

Weights & Biases Logging

Make sure your WANDB_API_KEY environment variable is set (recommended) or you login to wandb with wandb login.

wandb options

wandb_mode:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

Special Tokens

It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"

When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.

[openaccess-ai-collective/axolotl] src/axolotl/utils/data/pretraining.py

"""data handling specific to pretraining"""

[openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/pretrain.py

def _tokenize(
        self, prompt: str, add_eos_token: bool = True, strip_bos_token: bool = False
    ) -> BatchEncoding:
        res = self.tokenizer(
            prompt,
            truncation=True,
            max_length=self.max_length - 1,
            add_special_tokens=True,
            return_overflowing_tokens=True,
            stride=256,
        )
        res["input_ids"] = [
            seq + [self.tokenizer.eos_token_id] for seq in res["input_ids"]
        ]
        res["attention_mask"] = [seq + [1] for seq in res["attention_mask"]]

        return res

[openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/pretrain.py

class PretrainTokenizationStrategy(PromptTokenizingStrategy):
    """handles tokenization for pretraining with strides"""

    @property
    def supports_batched(self):
        return True

    def __init__(self, *args, max_length=None, text_column="text", **kwargs):
        super().__init__(*args, **kwargs)
        if max_length:
            self.max_length = max_length
        self.text_column = text_column

    def _tokenize(
        self, prompt: str, add_eos_token: bool = True, strip_bos_token: bool = False
    ) -> BatchEncoding:
        res = self.tokenizer(
            prompt,
            truncation=True,
            max_length=self.max_length - 1,
            add_special_tokens=True,
            return_overflowing_tokens=True,
            stride=256,
        )
        res["input_ids"] = [
            seq + [self.tokenizer.eos_token_id] for seq in res["input_ids"]
        ]
        res["attention_mask"] = [seq + [1] for seq in res["attention_mask"]]

        return res

    def tokenize_prompt(self, prompt):
        return self._tokenize(prompt[self.text_column])

[openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/pretrain.py

def load(tokenizer, cfg):
    strat = PretrainTokenizationStrategy(
        PretrainTokenizer(),
        tokenizer,
        cfg.train_on_inputs,
        cfg.sequence_len,
        text_column=cfg.pretraining_dataset[0]["text_column"] or "text",
        max_length=cfg.sequence_len * 64,
    )
    return strat

[openaccess-ai-collective/axolotl] src/axolotl/utils/tokenization.py

"""Module for tokenization utilities"""

[openaccess-ai-collective/axolotl] README.md

Axolotl

Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.

Features:

Train various Huggingface models such as llama, pythia, falcon, mpt
Supports fullfinetune, lora, qlora, relora, and gptq
Customize configurations using a simple yaml file or CLI overwrite
Load different dataset formats, use custom formats, or bring your own tokenized datasets
Integrated with xformer, flash attention, rope scaling, and multipacking
Works with single GPU or multiple GPUs via FSDP or Deepspeed
Easily run with Docker locally or on the cloud
Log results and optionally checkpoints to wandb or mlflow
And more!

Introduction
Supported Features
Quickstart
Environment
- Docker
- Conda/Pip venv
- Cloud GPU - Latitude.sh, JarvisLabs, RunPod
- Bare Metal Cloud GPU
- Windows
- Mac
- Google Colab
- Launching on public clouds via SkyPilot
Dataset
Config
Advanced Topics
- Multipack<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
- RLHF & DPO<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
- Dataset Pre-Processing<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
Common Errors
- Tokenization Mismatch b/w Training & Inference
Debugging Axolotl
Need Help?
Badge
Community Showcase
Contributing
Sponsors

</td> <td> <div align="center"> <img src="image/axolotl.png" alt="axolotl" width="160"> <div> <p> <b>Axolotl provides a unified repository for fine-tuning <br />a variety of AI models with ease</b> </p> <p> Go ahead and Axolotl questions!! </p> <img src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/pre-commit.yml/badge.svg?branch=main" alt="pre-commit"> <img alt="PyTest Status" src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/tests.yml/badge.svg?branch=main"> </div> </div> </td> </tr> </table>

Axolotl supports

| | fp16/fp32 | lora | qlora | gptq | gptq w/flash attn | flash attn | xformers attn | |-------------|:----------|:-----|-------|------|-------------------|------------|--------------| | llama | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mistral | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Mixtral-MoE | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Mixtral8X22 | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Pythia | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | cerebras | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | btlm | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | mpt | ✅ | ❌ | ❓ | ❌ | ❌ | ❌ | ❓ | | falcon | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❓ | | gpt-j | ✅ | ✅ | ✅ | ❌ | ❌ | ❓ | ❓ | | XGen | ✅ | ❓ | ✅ | ❓ | ❓ | ❓ | ✅ | | phi | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | RWKV | ✅ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | | Qwen | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Gemma | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ❓ |

✅: supported ❌: not supported ❓: untested

Quickstart ⚡

Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.

Requirements: Python >=3.10 and Pytorch >=2.1.1.

git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl

pip3 install packaging ninja
pip3 install -e '.[flash-attn,deepspeed]'

Usage

# preprocess datasets - optional but recommended
CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml

# finetune lora
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml

# inference
accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
    --lora_model_dir="./lora-out"

# gradio
accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
    --lora_model_dir="./lora-out" --gradio

# remote yaml files - the yaml config can be hosted on a public URL
# Note: the yaml config must directly link to the **raw** yaml
accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml

Advanced Setup

Environment

Docker

docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest

Or run on the current files for development:

docker compose up -d

[!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.

<details> <summary>Docker advanced</summary>

A more powerful Docker command to run would be this:

docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

It additionally:

Prevents memory issues when running e.g. deepspeed (e.g. you could hit SIGBUS/signal 7 error) through --ipc and --ulimit args.
Persists the downloaded HF data (models etc.) and your modifications to axolotl code through --mount/-v args.
The --name argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...) or in your terminal.
The --privileged flag gives all capabilities to the container.
The --shm-size 10g argument increases the shared memory size. Use this if you see exitcode: -7 errors using deepspeed.

More information on nvidia website

</details>

Conda/Pip venv

Install python >=3.10
Install pytorch stable https://pytorch.org/get-started/locally/

Install Axolotl along with python dependencies

pip3 install packaging
pip3 install -e '.[flash-attn,deepspeed]'

(Optional) Login to Huggingface to use gated models/datasets.
```
huggingface-cli login
```
Get the token at huggingface.co/settings/tokens

Cloud GPU

For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest

on Latitude.sh use this direct link
on JarvisLabs.ai use this direct link
on RunPod use this direct link

Bare Metal Cloud GPU

LambdaLabs

<details> <summary>Click to Expand</summary>

Install python

sudo apt update
sudo apt install -y python3.10

sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1
sudo update-alternatives --config python # pick 3.10 if given option
python -V # should be 3.10

Install pip

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

Install Pytorch https://pytorch.org/get-started/locally/
Follow instructions on quickstart.
Run

pip3 install protobuf==3.20.3
pip3 install -U --ignore-installed requests Pillow psutil scipy

Set path

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

</details>

GCP

<details> <summary>Click to Expand</summary>

Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.

Make sure to run the below to uninstall xla.

pip uninstall -y torch_xla[tpu]

</details>

Windows

Please use WSL or Docker!

Mac

Use the below instead of the install method in QuickStart.

pip3 install -e '.'

More info: mac.md

Google Colab

Please use this example notebook.

Launching on public clouds via SkyPilot

To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:

pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]"  # choose your clouds
sky check

Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1:

git clone https://github.com/skypilot-org/skypilot.git
cd skypilot/llm/axolotl

Use one command to launch:

# On-demand
HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN

# Managed spot (auto-recovery on preemption)
HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET

Dataset

Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.

See these docs for more information on how to use different dataset formats.

Config

See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:

model

base_model: ./llama-7b-hf # local or huggingface repo

Note: The code will load the right architecture.

dataset

datasets:
    # huggingface repo
  - path: vicgalle/alpaca-gpt4
    type: alpaca

    # huggingface repo with specific configuration/subset
  - path: EleutherAI/pile
    name: enron_emails
    type: completion # format from earlier
    field: text # Optional[str] default: text, field to use for completion data

    # huggingface repo with multiple named configurations/subsets
  - path: bigcode/commitpackft
    name:
      - ruby
      - python
      - typescript
    type: ... # unimplemented custom format

    # fastchat conversation
    # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
  - path: ...
    type: sharegpt
    conversation: chatml # default: vicuna_v1.1

    # local
  - path: data.jsonl # or json
    ds_type: json # see other options below
    type: alpaca

    # dataset with splits, but no train split
  - path: knowrohit07/know_sql
    type: context_qa.load_v2
    train_on_split: validation

    # loading from s3 or gcs
    # s3 creds will be loaded from the system default and gcs only supports public access
  - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
    ...

    # Loading Data From a Public URL
    # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly.
  - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP.
    ds_type: json # this is the default, see other options below.

load_in_4bit: true
load_in_8bit: true

bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically.
fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32
tf32: true # require >=ampere

bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision)
float16: true # use instead of fp16 when you don't want AMP

Note: Repo does not do 4-bit quantization.

lora

adapter: lora # 'qlora' or leave blank for full finetune
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj

All Config Options

See these docs for all config options.

Train

Run

accelerate launch -m axolotl.cli.train your_config.yml

[!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

Preprocess dataset

You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
(Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
(Optional): Use --debug to see preprocessed examples.

python -m axolotl.cli.preprocess your_config.yml

Multi-GPU

Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

DeepSpeed

We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.

deepspeed: deepspeed_configs/zero1.json

accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json

FSDP

llama FSDP

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

FSDP + QLoRA

Axolotl supports training with FSDP and QLoRA, see these docs for more information.

Weights & Biases Logging

Make sure your WANDB_API_KEY environment variable is set (recommended) or you login to wandb with wandb login.

wandb options

wandb_mode:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

Special Tokens

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"

[openaccess-ai-collective/axolotl] docs/dataset_preprocessing.qmd

---
title: Dataset Preprocessing
description: How datasets are processed
---

Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
the (dataset format)[../dataset-formats/] and prompt strategies to:
 - parse the dataset based on the *dataset format*
 - transform the dataset to how you would interact with the model based on the *prompt strategy*
 - tokenize the dataset based on the configured model & tokenizer
 - shuffle and merge multiple datasets together if using more than one

The processing of the datasets can happen one of two ways:

1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
2. When training is started

What are the benefits of pre-processing? When training interactively or for sweeps
(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
training parameters so that it will intelligently pull from its cache when possible.

The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.

If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
data is in the cache.

What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
prompt template. Because the trainer cannot readily detect these changes, we cannot change the
calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
and change your prompt templating logic, it may not pick up the changes you made and you will be
training over the old prompt.

[openaccess-ai-collective/axolotl] src/axolotl/utils/data/sft.py

def prepare_dataset(cfg, tokenizer):
    prompters = []
    if not cfg.pretraining_dataset:
        with zero_first(is_main_process()):
            if cfg.test_datasets:
                train_dataset, _, prompters = load_prepare_datasets(
                    tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="train"
                )
                _, eval_dataset, _ = load_prepare_datasets(
                    tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="test"
                )
            else:
                train_dataset, eval_dataset, prompters = load_prepare_datasets(
                    tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH
                )
    else:
        path = cfg.pretraining_dataset
        split = "train"
        name = None
        if isinstance(cfg.pretraining_dataset, list) and isinstance(
            cfg.pretraining_dataset[0], dict
        ):
            path = cfg.pretraining_dataset[0]["path"]
            name = cfg.pretraining_dataset[0]["name"]
            if "split" in cfg.pretraining_dataset[0]:
                split = cfg.pretraining_dataset[0]["split"]

        ds_wrapper_partial = functools.partial(
            get_dataset_wrapper,
            cfg.pretraining_dataset[0],
            tokenizer,
            cfg,
            cfg.pretraining_dataset[0]["type"] or "pretrain",
        )

        train_dataset = wrap_pretraining_dataset(
            load_dataset(path, streaming=True, split=split, name=name),
            tokenizer,
            cfg,
            ds_wrapper_partial,
            max_tokens=cfg.sequence_len,
            batch_size=cfg.micro_batch_size,
            seed=cfg.seed or 42,
            buffer_size=cfg.pretrain_multipack_buffer_size or 10_000,
        )
        # https://discuss.huggingface.co/t/how-to-use-huggingface-trainer-streaming-datasets-without-wrapping-it-with-torchdatas-iterablewrapper/25230
        train_dataset = train_dataset.with_format("torch")
        eval_dataset = None
        return train_dataset, eval_dataset, cfg.max_steps, prompters

    if eval_dataset and cfg.sample_packing and cfg.eval_sample_packing is not False:
        total_eval_steps = calculate_total_num_steps(cfg, eval_dataset, update=False)
        if total_eval_steps == 0:
            raise ValueError(
                "eval dataset split is too small for sample_packing. You should set `eval_sample_packing: False`. "
            )

    if cfg.max_steps:
        total_num_steps = min(
            calculate_total_num_steps(cfg, train_dataset), cfg.max_steps
        )
        LOG.info(f"Maximum number of steps set at {total_num_steps}")
    else:
        total_num_steps = calculate_total_num_steps(cfg, train_dataset)
    return train_dataset, eval_dataset, total_num_steps, prompters

[openaccess-ai-collective/axolotl] docs/input_output.qmd

---
title: Template-free prompt construction
description: "Template-free prompt construction with the `input_output` format"
---

<!-- TOC -->

- [Background](#background)
    - [Masking Inputs](#masking-inputs)
    - [You may not want prompt templates](#you-may-not-want-prompt-templates)
    - [The `input_output` format](#the-input_output-format)
- [Usage](#usage)
    - [1. Prepare Data](#1-prepare-data)
    - [2. Use `type: input_output`](#2-use-type-input_output)
    - [3. Check the prompts](#3-check-the-prompts)

<!-- /TOC -->

<a id="markdown-background" name="background"></a>

## Background

<a id="markdown-masking-inputs" name="masking-inputs"></a>

### Masking Inputs

One of the most popular features of
[axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is
setting the following configuration value:


```yaml
train_on_inputs: false

If you declare a dataset formats such as alpaca or chatml, axolotl knows what is an input (i.e. human) vs. an output (i.e. the assistant) and masks the input labels so that your model can focus on predicting the outputs only.

You may not want prompt templates

However, there are many situations where you don't want to use one of these formats or templates. This is because they can:

Add unnecessary boilerplate to your prompts.
Create artifacts like special delimiters <|im_start|> that can quickly become footguns if you don't include them correctly at inference time.
Enforce a chat interface when you do not want one. Sometimes you just want to fine-tune a model to a very specific task and do NOT want multi-turn conversations, roles, etc.
Limit you to only certain roles that the template allows.

The `input_output` format

You can construct your prompts without a template by using the input_output format, by setting type: input_output in your configuration file like this:

config.yml

train_on_inputs: false # Mask segments of your data
datasets:
  - path: output.jsonl
    type: input_output  # use template free prompt construction

Unlike type: completion, which is also template-free, type: input_output allows you to mask segments of your text. More details on how this works are described below.

Usage

This is how you can use the input_output format:

1. Prepare Data

To use the input_output format, collect your data in the following format into a jsonl file (below is the first row from the file output.jsonl` pretty printed):

$ head -n1 output.jsonl | python -m json.tool

:::{.cell-output .cell-output-stdout} { "segments": [ { "label": true, "text": "<s>Hello\n" }, { "label": true, "text": "hi there!. " }, { "label": false, "text": "goodbye " }, { "label": true, "text": "farewell</s>" } ] } :::

Set label:false when you want to mask a segment of text so that the model isn't trained on it. Some things to keep in mind:

[!IMPORTANT]

EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl concatenates all the segments as-is. The tokenizer doesn't add anything additional. Notice how I added spaces, newlines, <s> (BOS), and </s> (EOS) myself.

Make sure you check the materialized output to validate that the prompt is getting assembled how you like.

2. Use `type: input_output`

Let's materialize data with our output.jsonl file by setting type: input_output in our axolotl config:

# training_config.yaml
base_model: mistralai/Mistral-7B-v0.1
data_seed: 49
seed: 49

datasets:
  - path: output.jsonl
    type: input_output
val_set_size: 0.1

sequence_len: 896
sample_packing: false

micro_batch_size: 2
gradient_accumulation_steps: 3
eval_batch_size: 2
num_epochs: 1
learning_rate: 0.0002

train_on_inputs: false
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

You can use the following command to materialize your data. The --debug flag will print the tokens, along with the labels so you can verify that the correct items are being ignored:

$ python -m axolotl.cli.preprocess training_config.yaml --debug

...
[2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557)
(13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2)

The format is decoded_token(label, token_id), for example, <s>(1, 1) means that the token is <s>, the label is 1 and the token_id is 1. When the label is -100 then that token is ignored for training.

3. Check the prompts

Here is another way to check the materialized output:

from transformers import AutoTokenizer
from datasets import load_from_disk
import yaml

directory = !ls last_run_prepared/
with open('training_config.yaml', 'r') as f:
    cfg = yaml.safe_load(f)
model_id = cfg['base_model']
tok = AutoTokenizer.from_pretrained(model_id)
ds = load_from_disk(f'last_run_prepared/{directory[0]}/')

>>> row = ds[0]
>>> print(tok.decode(row['input_ids']))
<s> Hello
    hi there!.  goodbye  farewell</s>

We can check that the right tokens are ingored by comparing the labels to each token:

import pandas as pd
pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in
              zip(row['input_ids'], row['labels'])])

| token | label | id | |-------|-------|-------| | 0 | <s> | 1 | | 1 | Hello | 22557 | | 2 | \n | 13 | | 3 | hi | 12014 | | 4 | there | 736 | | 5 | ! | 28808 | | 6 | . | 28723 | | 7 | | 28705 | | 8 | good | -100 | | 9 | bye | -100 | | 10 | | -100 | | 11 | fare | 19111 | | 12 | well | 5458 | | 13 | </s>| 2 |

If we look at the input data, the above table seems correct! (The jsonl version is repeated below for reference):

$ head -n1 output.jsonl | python -m json.tool

OpenAccess-AI-Collective/axolotl

huggingface/transformers

huggingface/peft

huggingface/accelerate

how can I use pre-tokenized data

Tokenizer

PreTrainedTokenizer

PreTrainedTokenizerFast

BatchEncoding

Use tokenizers from 🤗 Tokenizers

Loading directly from the tokenizer object

Loading from a JSON file

Streaming is recommended for large datasets

Train

Preprocess dataset

Multi-GPU

DeepSpeed

FSDP

FSDP + QLoRA

Weights & Biases Logging

Special Tokens

Axolotl

Table of Contents

Axolotl supports

Quickstart ⚡

Usage

Advanced Setup

Environment

Docker

Conda/Pip venv

Cloud GPU

Bare Metal Cloud GPU

LambdaLabs

GCP

Windows

Mac

Google Colab

Launching on public clouds via SkyPilot

Dataset

Config

All Config Options

Train

Preprocess dataset

Multi-GPU

DeepSpeed

FSDP

FSDP + QLoRA

Weights & Biases Logging

Special Tokens

You may not want prompt templates

The input_output format

Usage

1. Prepare Data

2. Use type: input_output

3. Check the prompts

The `input_output` format

2. Use `type: input_output`