
    how can I use pre-tokenized data

    To use pre-tokenized data with Axolotl, follow these steps:

    1. Ensure Your Data Format: Your dataset must have columns exactly named input_ids, attention_mask, and labels.

    2. Configuration: Do not specify a type: in your Axolotl configuration file. This indicates that you are using a custom dataset format, in this case, pre-tokenized data.

    Here's an example of how you might set up your configuration file (config.yml):

    - path: path_to_your_pre_tokenized_data
    1. Loading the Dataset: When loading your dataset, Axolotl will expect the dataset to be in the format mentioned above. You can load your dataset using the standard methods provided by the datasets library if you are using it, or any other method you prefer, as long as the final format matches.

    2. Using the Dataset: Once loaded, you can directly use this dataset for training or evaluation without any further tokenization steps, as the data is already in the required format.

    Here's a simple code example illustrating how you might load and use a pre-tokenized dataset:

    from datasets import load_dataset # Assuming your dataset is in a format recognized by the `datasets` library dataset = load_dataset('path/to/dataset', split='train') # Example of what the dataset should look like print(dataset[0]) # Output: {'input_ids': [101, 2054, 2003, 102, ...], 'attention_mask': [1, 1, 1, 1, ...], 'labels': [1]} # Now, you can directly use this dataset for training or evaluation with Axolotl

    Remember, the key here is that your dataset must already be tokenized and structured correctly. This means you should have already converted your text into input_ids and generated the appropriate attention_mask and labels for each example in your dataset.

    A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the Rust library πŸ€— Tokenizers. The "Fast" implementations allows:

    1. a significant speed-up in particular when doing batched tokenization and
    2. additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).

    The base classes [PreTrainedTokenizer] and [PreTrainedTokenizerFast] implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and "Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository). They both rely on [~tokenization_utils_base.PreTrainedTokenizerBase] that contains the common methods, and [~tokenization_utils_base.SpecialTokensMixin].

    [PreTrainedTokenizer] and [PreTrainedTokenizerFast] thus implement the main methods for using all the tokenizers:

    • Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).
    • Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece...).
    • Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.

    [BatchEncoding] holds the output of the [~tokenization_utils_base.PreTrainedTokenizerBase]'s encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these methods (input_ids, attention_mask...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace tokenizers library), this class provides in addition several advanced alignment methods which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token).


    [[autodoc]] PreTrainedTokenizer - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all


    The [PreTrainedTokenizerFast] depend on the tokenizers library. The tokenizers obtained from the πŸ€— tokenizers library can be loaded very simply into πŸ€— transformers. Take a look at the Using tokenizers from πŸ€— tokenizers page to understand how this is done.

    [[autodoc]] PreTrainedTokenizerFast - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all


    [[autodoc]] BatchEncoding

    [huggingface/transformers] docs/source/en/internal/
    Use tokenizers from πŸ€— Tokenizers

    The [PreTrainedTokenizerFast] depends on the πŸ€— Tokenizers library. The tokenizers obtained from the πŸ€— Tokenizers library can be loaded very simply into πŸ€— Transformers.

    Before getting in the specifics, let's first start by creating a dummy tokenizer in a few lines:

    >>> from tokenizers import Tokenizer >>> from tokenizers.models import BPE >>> from tokenizers.trainers import BpeTrainer >>> from tokenizers.pre_tokenizers import Whitespace >>> tokenizer = Tokenizer(BPE(unk_token="[UNK]")) >>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) >>> tokenizer.pre_tokenizer = Whitespace() >>> files = [...] >>> tokenizer.train(files, trainer)

    We now have a tokenizer trained on the files we defined. We can either continue using it in that runtime, or save it to a JSON file for future re-use.

    Loading directly from the tokenizer object

    Let's see how to leverage this tokenizer object in the πŸ€— Transformers library. The [PreTrainedTokenizerFast] class allows for easy instantiation, by accepting the instantiated tokenizer object as an argument:

    >>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

    This object can now be used with all the methods shared by the πŸ€— Transformers tokenizers! Head to the tokenizer page for more information.

    Loading from a JSON file

    In order to load a tokenizer from a JSON file, let's first start by saving our tokenizer:


    The path to which we saved this file can be passed to the [PreTrainedTokenizerFast] initialization method using the tokenizer_file parameter:

    >>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

    This object can now be used with all the methods shared by the πŸ€— Transformers tokenizers! Head to the tokenizer page for more information.

    [huggingface/transformers] src/transformers/models/perceiver/
    [openaccess-ai-collective/axolotl] docs/dataset-formats/pretraining.qmd
    title: Pre-training
    description: Data format for a pre-training completion task.
    order: 1
    For pretraining, there is no prompt template or roles.  The only required field is `text`:
    ```{.json filename="data.jsonl"}
    {"text": "first row"}
    {"text": "second row"}


    Streaming is recommended for large datasets

    Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:

    pretraining_dataset: # hf path only


    [openaccess-ai-collective/axolotl] src/axolotl/utils/data/
    def load_tokenized_prepared_datasets( tokenizer, cfg, default_dataset_prepared_path, split="train", ) -> Tuple[DatasetDict, List[Prompter]]: cfg_datasets = cfg.test_datasets if split == "test" else cfg.datasets tokenizer_name = cfg.tokenizer_config ds_hash = str( md5( ( str(cfg.sequence_len) + "@" + str(cfg.sample_packing) + "@" + str(cfg.eval_sample_packing) + "@" + str(cfg.group_by_length) + "@" + "|".join( sorted( [ f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}" for d in cfg_datasets ] ) ) + "|" + tokenizer_name ) ) ) prepared_ds_path = ( Path(cfg.dataset_prepared_path) / ds_hash if cfg.dataset_prepared_path else Path(default_dataset_prepared_path) / ds_hash ) dataset = None prompters = [] use_auth_token = cfg.hf_use_auth_token try: if cfg.push_dataset_to_hub: dataset = load_dataset( f"{cfg.push_dataset_to_hub}/{ds_hash}", token=use_auth_token, ) dataset = dataset[split] except Exception: # pylint: disable=broad-except # nosec pass # pylint: disable=duplicate-code if dataset: ... elif ( cfg.dataset_prepared_path and any(prepared_ds_path.glob("*")) and not cfg.is_preprocess ):"Loading prepared dataset from disk at {prepared_ds_path}...") dataset = load_from_disk(str(prepared_ds_path))"Prepared dataset loaded from disk...") else:"Unable to find prepared dataset in {prepared_ds_path}")"Loading raw datasets...") if not cfg.is_preprocess: LOG.warning( "Processing datasets during training can lead to VRAM instability. Please pre-process your dataset." ) if cfg.seed: seed = cfg.seed else:"No seed provided, using default seed of 42") seed = 42 datasets = [] def for_d_in_datasets(dataset_configs): for dataset in dataset_configs: if and isinstance(, list): for name in yield DictDefault({**dataset, "name": name}) else: yield dataset # pylint: disable=invalid-name for config_dataset in for_d_in_datasets(cfg_datasets): ds: Optional[Union[Dataset, DatasetDict]] = None ds_from_hub = False try: load_dataset( config_dataset.path,, streaming=True, token=use_auth_token, ) ds_from_hub = True except (FileNotFoundError, ConnectionError, HFValidationError, ValueError): pass ds_from_cloud = False storage_options = {} remote_file_system = None if config_dataset.path.startswith("s3://"): try: import aiobotocore.session # type: ignore import s3fs # type: ignore except ImportError as exc: raise ImportError( "s3:// paths require aiobotocore and s3fs to be installed" ) from exc # Takes credentials from ~/.aws/credentials for default profile s3_session = aiobotocore.session.AioSession(profile="default") storage_options = {"session": s3_session} remote_file_system = s3fs.S3FileSystem(**storage_options) elif config_dataset.path.startswith( "gs://" ) or config_dataset.path.startswith("gcs://"): try: import gcsfs # type: ignore except ImportError as exc: raise ImportError( "gs:// or gcs:// paths require gcsfs to be installed" ) from exc # gcsfs will use default credentials from the environment else anon # storage_options = {"token": None} remote_file_system = gcsfs.GCSFileSystem(**storage_options) # TODO: Figure out how to get auth creds passed # elif config_dataset.path.startswith("adl://") or config_dataset.path.startswith("abfs://"): # try: # import adlfs # except ImportError as exc: # raise ImportError( # "adl:// or abfs:// paths require adlfs to be installed" # ) from exc # # Gen 1 # storage_options = { # "tenant_id": TENANT_ID, # "client_id": CLIENT_ID, # "client_secret": CLIENT_SECRET, # } # # Gen 2 # storage_options = { # "account_name": ACCOUNT_NAME, # "account_key": ACCOUNT_KEY, # } # remote_file_system = adlfs.AzureBlobFileSystem(**storage_options) try: if remote_file_system and remote_file_system.exists( config_dataset.path ): ds_from_cloud = True except (FileNotFoundError, ConnectionError): pass # prefer local dataset, even if hub exists local_path = Path(config_dataset.path) if local_path.exists(): if local_path.is_dir(): if config_dataset.data_files: ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type,, data_files=config_dataset.data_files, streaming=False, split=None, ) else: ds = load_from_disk(config_dataset.path) elif local_path.is_file(): ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type,, data_files=config_dataset.path, streaming=False, split=None, ) else: raise ValueError( "unhandled dataset load: local path exists, but is neither a directory or a file" ) elif ds_from_hub: ds = load_dataset( config_dataset.path,, streaming=False, data_files=config_dataset.data_files, token=use_auth_token, ) elif ds_from_cloud and remote_file_system: if remote_file_system.isdir(config_dataset.path): ds = load_from_disk( config_dataset.path, storage_options=storage_options, ) elif remote_file_system.isfile(config_dataset.path): ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type,, data_files=config_dataset.path, streaming=False, split=None, storage_options=storage_options, ) elif config_dataset.path.startswith("https://"): ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type,, data_files=config_dataset.path, streaming=False, split=None, storage_options=storage_options, ) else: if isinstance(config_dataset.data_files, str): fp = hf_hub_download( repo_id=config_dataset.path, repo_type="dataset", filename=config_dataset.data_files, ) elif isinstance(config_dataset.data_files, list): fp = [] for file in config_dataset.data_files: fp.append( hf_hub_download( repo_id=config_dataset.path, repo_type="dataset", filename=file, ) ) else: raise ValueError( "data_files must be either a string or list of strings" ) ds = load_dataset( "json",, data_files=fp, streaming=False, split=None, ) if not ds: raise ValueError("unhandled dataset load") d_base_type = d_prompt_style = None d_type = config_dataset.type if isinstance(d_type, str): d_type_split = d_type.split(":") d_base_type = d_type_split[0] d_prompt_style = d_type_split[1] if len(d_type_split) > 1 else None if isinstance(ds, DatasetDict): if config_dataset.split and config_dataset.split in ds: ds = ds[config_dataset.split] elif split in ds: ds = ds[split] else: raise ValueError( f"no {split} split found for dataset {config_dataset.path}, you may specify a split with 'split: `" ) # support for using a subset of the data if config_dataset.shards: shards_idx = config_dataset.get("shards_idx", 0) ds = ds.shuffle(seed=seed).shard( num_shards=config_dataset.shards, index=shards_idx ) dataset_wrapper, dataset_prompter = get_dataset_wrapper( config_dataset=config_dataset, tokenizer=tokenizer, cfg=cfg, dataset=ds, d_base_type=d_base_type, d_prompt_style=d_prompt_style, ) datasets.append(dataset_wrapper) prompters.append(dataset_prompter)"merging datasets") dataset = concatenate_datasets(datasets) if len(datasets) > 1: if cfg.shuffle_merged_datasets: LOG.debug("shuffle merged datasets") dataset = dataset.shuffle(seed=seed) else: LOG.debug("NOT shuffling merged datasets") dataset, _ = process_datasets_for_packing(cfg, dataset, None) if cfg.local_rank == 0:"Saving merged prepared dataset to disk... {prepared_ds_path}") dataset.save_to_disk(str(prepared_ds_path)) if cfg.push_dataset_to_hub: f"Saving merged prepared dataset with push_to_hub... {cfg.push_dataset_to_hub}/{ds_hash}" ) dataset.push_to_hub( f"{cfg.push_dataset_to_hub}/{ds_hash}", private=True ) return dataset, prompters
    [openaccess-ai-collective/axolotl] docs/dataset_preprocessing.qmd
    title: Dataset Preprocessing
    description: How datasets are processed
    Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
    the (dataset format)[../dataset-formats/] and prompt strategies to:
     - parse the dataset based on the *dataset format*
     - transform the dataset to how you would interact with the model based on the *prompt strategy*
     - tokenize the dataset based on the configured model & tokenizer
     - shuffle and merge multiple datasets together if using more than one
    The processing of the datasets can happen one of two ways:
    1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
    2. When training is started
    What are the benefits of pre-processing? When training interactively or for sweeps
    (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
    slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
    training parameters so that it will intelligently pull from its cache when possible.
    The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
    YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.
    If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
    default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
    setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
    data is in the cache.
    What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
    prompt template. Because the trainer cannot readily detect these changes, we cannot change the
    calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
    and change your prompt templating logic, it may not pick up the changes you made and you will be
    training over the old prompt.
    [openaccess-ai-collective/axolotl] src/axolotl/utils/data/
    def prepare_dataset(cfg, tokenizer): prompters = [] if not cfg.pretraining_dataset: with zero_first(is_main_process()): if cfg.test_datasets: train_dataset, _, prompters = load_prepare_datasets( tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="train" ) _, eval_dataset, _ = load_prepare_datasets( tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="test" ) else: train_dataset, eval_dataset, prompters = load_prepare_datasets( tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH ) else: path = cfg.pretraining_dataset split = "train" name = None if isinstance(cfg.pretraining_dataset, list) and isinstance( cfg.pretraining_dataset[0], dict ): path = cfg.pretraining_dataset[0]["path"] name = cfg.pretraining_dataset[0]["name"] if "split" in cfg.pretraining_dataset[0]: split = cfg.pretraining_dataset[0]["split"] ds_wrapper_partial = functools.partial( get_dataset_wrapper, cfg.pretraining_dataset[0], tokenizer, cfg, cfg.pretraining_dataset[0]["type"] or "pretrain", ) train_dataset = wrap_pretraining_dataset( load_dataset(path, streaming=True, split=split, name=name), tokenizer, cfg, ds_wrapper_partial, max_tokens=cfg.sequence_len, batch_size=cfg.micro_batch_size, seed=cfg.seed or 42, buffer_size=cfg.pretrain_multipack_buffer_size or 10_000, ) # train_dataset = train_dataset.with_format("torch") eval_dataset = None return train_dataset, eval_dataset, cfg.max_steps, prompters if eval_dataset and cfg.sample_packing and cfg.eval_sample_packing is not False: total_eval_steps = calculate_total_num_steps(cfg, eval_dataset, update=False) if total_eval_steps == 0: raise ValueError( "eval dataset split is too small for sample_packing. You should set `eval_sample_packing: False`. " ) if cfg.max_steps: total_num_steps = min( calculate_total_num_steps(cfg, train_dataset), cfg.max_steps )"Maximum number of steps set at {total_num_steps}") else: total_num_steps = calculate_total_num_steps(cfg, train_dataset) return train_dataset, eval_dataset, total_num_steps, prompters
    [openaccess-ai-collective/axolotl] docs/input_output.qmd
    title: Template-free prompt construction
    description: "Template-free prompt construction with the `input_output` format"
    <!-- TOC -->
    - [Background](#background)
        - [Masking Inputs](#masking-inputs)
        - [You may not want prompt templates](#you-may-not-want-prompt-templates)
        - [The `input_output` format](#the-input_output-format)
    - [Usage](#usage)
        - [1. Prepare Data](#1-prepare-data)
        - [2. Use `type: input_output`](#2-use-type-input_output)
        - [3. Check the prompts](#3-check-the-prompts)
    <!-- /TOC -->
    <a id="markdown-background" name="background"></a>
    ## Background
    <a id="markdown-masking-inputs" name="masking-inputs"></a>
    ### Masking Inputs
    One of the most popular features of
    [axolotl]( is
    setting the following configuration value:
    train_on_inputs: false

    If you declare a dataset formats such as alpaca or chatml, axolotl knows what is an input (i.e.Β human) vs.Β an output (i.e.Β the assistant) and masks the input labels so that your model can focus on predicting the outputs only.

    <a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a>

    You may not want prompt templates

    However, there are many situations where you don't want to use one of these formats or templates. This is because they can:

    • Add unnecessary boilerplate to your prompts.
    • Create artifacts like special delimiters <|im_start|> that can quickly become footguns if you don't include them correctly at inference time.
    • Enforce a chat interface when you do not want one. Sometimes you just want to fine-tune a model to a very specific task and do NOT want multi-turn conversations, roles, etc.
    • Limit you to only certain roles that the template allows.

    <a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a>

    The input_output format

    You can construct your prompts without a template by using the input_output format, by setting type: input_output in your configuration file like this:


    train_on_inputs: false # Mask segments of your data datasets: - path: output.jsonl type: input_output # use template free prompt construction

    Unlike type: completion, which is also template-free, type: input_output allows you to mask segments of your text. More details on how this works are described below.

    <a id="markdown-usage" name="usage"></a>


    This is how you can use the input_output format:

    <a id="markdown-1-prepare-data" name="1-prepare-data"></a>

    1. Prepare Data

    To use the input_output format, collect your data in the following format into a jsonl file (below is the first row from the file output.jsonl` pretty printed):

    $ head -n1 output.jsonl | python -m json.tool

    :::{.cell-output .cell-output-stdout} { "segments": [ { "label": true, "text": "<s>Hello\n" }, { "label": true, "text": "hi there!. " }, { "label": false, "text": "goodbye " }, { "label": true, "text": "farewell</s>" } ] } :::

    Set label:false when you want to mask a segment of text so that the model isn't trained on it. Some things to keep in mind:


    1. EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl concatenates all the segments as-is. The tokenizer doesn't add anything additional. Notice how I added spaces, newlines, <s> (BOS), and </s> (EOS) myself.
    2. Make sure you check the materialized output to validate that the prompt is getting assembled how you like.

    <a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a>

    2. Use type: input_output

    Let's materialize data with our output.jsonl file by setting type: input_output in our axolotl config:

    # training_config.yaml base_model: mistralai/Mistral-7B-v0.1 data_seed: 49 seed: 49 datasets: - path: output.jsonl type: input_output val_set_size: 0.1 sequence_len: 896 sample_packing: false micro_batch_size: 2 gradient_accumulation_steps: 3 eval_batch_size: 2 num_epochs: 1 learning_rate: 0.0002 train_on_inputs: false special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"

    You can use the following command to materialize your data. The --debug flag will print the tokens, along with the labels so you can verify that the correct items are being ignored:

    $ python -m axolotl.cli.preprocess training_config.yaml --debug ... [2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557) (13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2)

    The format is decoded_token(label, token_id), for example, <s>(1, 1) means that the token is <s>, the label is 1 and the token_id is 1. When the label is -100 then that token is ignored for training.

    <a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a>

    3. Check the prompts

    Here is another way to check the materialized output:

    from transformers import AutoTokenizer from datasets import load_from_disk import yaml directory = !ls last_run_prepared/ with open('training_config.yaml', 'r') as f: cfg = yaml.safe_load(f) model_id = cfg['base_model'] tok = AutoTokenizer.from_pretrained(model_id) ds = load_from_disk(f'last_run_prepared/{directory[0]}/')
    >>> row = ds[0] >>> print(tok.decode(row['input_ids'])) <s> Hello hi there!. goodbye farewell</s>

    We can check that the right tokens are ingored by comparing the labels to each token:

    import pandas as pd pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in zip(row['input_ids'], row['labels'])])

    | token | label | id | |-------|-------|-------| | 0 | <s> | 1 | | 1 | Hello | 22557 | | 2 | \n | 13 | | 3 | hi | 12014 | | 4 | there | 736 | | 5 | ! | 28808 | | 6 | . | 28723 | | 7 | | 28705 | | 8 | good | -100 | | 9 | bye | -100 | | 10 | | -100 | | 11 | fare | 19111 | | 12 | well | 5458 | | 13 | </s>| 2 |

    If we look at the input data, the above table seems correct! (The jsonl version is repeated below for reference):

    $ head -n1 output.jsonl | python -m json.tool

    :::{.cell-output .cell-output-stdout} { "segments": [ { "label": true, "text": "<s>Hello\n" }, { "label": true, "text": "hi there!. " }, { "label": false, "text": "goodbye " }, { "label": true, "text": "farewell</s>" } ] } :::
