To use pre-tokenized data with Axolotl, follow these steps:
Ensure Your Data Format: Your dataset must have columns exactly named input_ids
, attention_mask
, and labels
.
Configuration: Do not specify a type:
in your Axolotl configuration file. This indicates that you are using a custom dataset format, in this case, pre-tokenized data.
Here's an example of how you might set up your configuration file (config.yml
):
- path: path_to_your_pre_tokenized_data
Loading the Dataset: When loading your dataset, Axolotl will expect the dataset to be in the format mentioned above. You can load your dataset using the standard methods provided by the datasets
library if you are using it, or any other method you prefer, as long as the final format matches.
Using the Dataset: Once loaded, you can directly use this dataset for training or evaluation without any further tokenization steps, as the data is already in the required format.
Here's a simple code example illustrating how you might load and use a pre-tokenized dataset:
from datasets import load_dataset # Assuming your dataset is in a format recognized by the `datasets` library dataset = load_dataset('path/to/dataset', split='train') # Example of what the dataset should look like print(dataset[0]) # Output: {'input_ids': [101, 2054, 2003, 102, ...], 'attention_mask': [1, 1, 1, 1, ...], 'labels': [1]} # Now, you can directly use this dataset for training or evaluation with Axolotl
Remember, the key here is that your dataset must already be tokenized and structured correctly. This means you should have already converted your text into input_ids
and generated the appropriate attention_mask
and labels
for each example in your dataset.
---
title: Custom Pre-Tokenized Dataset
description: How to use a custom pre-tokenized dataset.
order: 5
---
- Do not pass a `type:` in your axolotl config.
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
```{.yaml filename="config.yml"}
- path: ...
def tokenize_function(examples): # Remove empty lines examples = [line for line in examples if len(line) > 0 and not line.isspace()] return tokenizer( examples, return_special_tokens_mask=True, padding=padding, truncation=True, max_length=data_args.max_seq_length, )
def tokenize(example): output = {} output["input_ids"] = tokenizer(example["content"], truncation=False)["input_ids"] output["ratio_char_token"] = len(example["content"]) / len(output["input_ids"]) return output
def tokenize_function(examples): # Remove empty lines examples[text_column_name] = [ line for line in examples[text_column_name] if len(line) > 0 and not line.isspace() ] return tokenizer( examples[text_column_name], padding=padding, truncation=True, max_length=max_seq_length, # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it # receives the `special_tokens_mask`. return_special_tokens_mask=True, )
def prepare_data(examples): # Remove pairs where at least one record is none inputs = examples[INPUT_COLUMN] targets = examples[TARGET_COLUMN] model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True) labels = tokenizer(text_target=targets, max_length=MAX_TARGET_LENGTH, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs
A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the Rust library π€ Tokenizers. The "Fast" implementations allows:
The base classes [PreTrainedTokenizer
] and [PreTrainedTokenizerFast
]
implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and
"Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library
(downloaded from HuggingFace's AWS S3 repository). They both rely on
[~tokenization_utils_base.PreTrainedTokenizerBase
] that contains the common methods, and
[~tokenization_utils_base.SpecialTokensMixin
].
[PreTrainedTokenizer
] and [PreTrainedTokenizerFast
] thus implement the main
methods for using all the tokenizers:
[BatchEncoding
] holds the output of the
[~tokenization_utils_base.PreTrainedTokenizerBase
]'s encoding methods (__call__
,
encode_plus
and batch_encode_plus
) and is derived from a Python dictionary. When the tokenizer is a pure python
tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by
these methods (input_ids
, attention_mask
...). When the tokenizer is a "Fast" tokenizer (i.e., backed by
HuggingFace tokenizers library), this class provides in addition
several advanced alignment methods which can be used to map between the original string (character and words) and the
token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
to a given token).
[[autodoc]] PreTrainedTokenizer - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all
The [PreTrainedTokenizerFast
] depend on the tokenizers library. The tokenizers obtained from the π€ tokenizers library can be
loaded very simply into π€ transformers. Take a look at the Using tokenizers from π€ tokenizers page to understand how this is done.
[[autodoc]] PreTrainedTokenizerFast - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all
[[autodoc]] BatchEncoding
ds = ds.map( tokenize, num_proc=args.num_workers, remove_columns=[ "repo_name", "path", "copies", "size", "content", "license", "hash", "line_mean", "line_max", "alpha_frac", "autogenerated", ], ) print(f"Dataset tokenized in {time.time()-t_start:.2f}s") t_start = time.time() ds.push_to_hub(args.tokenized_data_repo) print(f"Data pushed to the hub in {time.time()-t_start:.2f}s")
The [PreTrainedTokenizerFast
] depends on the π€ Tokenizers library. The tokenizers obtained from the π€ Tokenizers library can be
loaded very simply into π€ Transformers.
Before getting in the specifics, let's first start by creating a dummy tokenizer in a few lines:
>>> from tokenizers import Tokenizer >>> from tokenizers.models import BPE >>> from tokenizers.trainers import BpeTrainer >>> from tokenizers.pre_tokenizers import Whitespace >>> tokenizer = Tokenizer(BPE(unk_token="[UNK]")) >>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) >>> tokenizer.pre_tokenizer = Whitespace() >>> files = [...] >>> tokenizer.train(files, trainer)
We now have a tokenizer trained on the files we defined. We can either continue using it in that runtime, or save it to a JSON file for future re-use.
Let's see how to leverage this tokenizer object in the π€ Transformers library. The
[PreTrainedTokenizerFast
] class allows for easy instantiation, by accepting the instantiated
tokenizer object as an argument:
>>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
This object can now be used with all the methods shared by the π€ Transformers tokenizers! Head to the tokenizer page for more information.
In order to load a tokenizer from a JSON file, let's first start by saving our tokenizer:
>>> tokenizer.save("tokenizer.json")
The path to which we saved this file can be passed to the [PreTrainedTokenizerFast
] initialization
method using the tokenizer_file
parameter:
>>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
This object can now be used with all the methods shared by the π€ Transformers tokenizers! Head to the tokenizer page for more information.
# coding=utf-8 # Copyright 2021 The HuggingFace Inc. team. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ Tokenization class for Perceiver."""
# coding=utf-8 # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Tokenization classes."""
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
# data preprocessing tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def preprocess(samples): batch = [] for conversation in samples["messages"]: batch.append(tokenizer.apply_chat_template(conversation, tokenize=False)) return {"content": batch}
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def tokenize_function(examples): return tokenizer(examples[text_column_name])
def wrap_pretraining_dataset( dataset, tokenizer, cfg, ds_wrapper_fn, max_tokens=2048, batch_size=1, seed=42, buffer_size=10_000, ): if cfg.sample_packing: collate_fn = PretrainingBatchSamplerDataCollatorForSeq2Seq( tokenizer, return_tensors="pt", padding=True, pad_to_multiple_of=max_tokens * batch_size, multipack_attn=cfg.pretrain_multipack_attn, ) encode = functools.partial( encode_packed_pretraining, collate_fn, ds_wrapper_fn, max_seq_length=max_tokens, batch_size=batch_size, multipack_attn=cfg.pretrain_multipack_attn, ) # set this to 1 so downstream data_loader doesn't try to increase the batch again cfg.micro_batch_size = 1 else: encode = functools.partial(encode_pretraining, tokenizer, max_tokens) if cfg.shuffle_merged_datasets: dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size) else: LOG.debug("NOT shuffling merged pretraining datasets") # remove all the existing columns after mapping since they end up having # a different length than the encoded/tokenized column # this is empty during streaming/pretraining remove_columns = [] if dataset.features is None: for first_row in dataset: remove_columns = first_row.keys() break else: remove_columns = dataset.features.keys() dataset = dataset.map( encode, batched=True, batch_size=buffer_size, # input_columns="text", remove_columns=remove_columns, ) return dataset
# import torch # from peft import PeftModel, PeftConfig # from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # # peft_model_id = "ybelkada/opt-6.7b-lora" # config = PeftConfig.from_pretrained(peft_model_id) # model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, quantization_config=BitsAndBytesConfig(load_in_8bit=True), device_map='auto') # tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) # ## Load the Lora model # model = PeftModel.from_pretrained(model, peft_model_id) # # """## Inference # # You can then directly use the trained model or the model that you have loaded from the π€ Hub for inference as you would do it usually in `transformers`. # """ # batch = tokenizer("Two things are infinite: ", return_tensors="pt")
def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer( inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt" ) labels = tokenizer(targets, max_length=2, padding="max_length", truncation=True, return_tensors="pt") labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs
def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt") labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt") labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs
def load_prepare_datasets( tokenizer: PreTrainedTokenizerBase, cfg, default_dataset_prepared_path, split="train", ) -> Tuple[Dataset, Dataset, List[Prompter]]: dataset, prompters = load_tokenized_prepared_datasets( tokenizer, cfg, default_dataset_prepared_path, split=split ) if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None: LOG.info( f"Using index #{cfg.dataset_shard_idx} of {cfg.dataset_shard_num} shards" ) dataset = dataset.shard( num_shards=cfg.dataset_shard_num, index=cfg.dataset_shard_idx, ) if split == "train" and cfg.val_set_size: # ensure we end up with the same fingerprint by doing rank0 first and being able to cache to_hash_train = ( dataset._fingerprint # pylint: disable=protected-access + "|" + str(cfg.val_set_size) + "|" + "train" + "|" + str(cfg.seed or 42) ) to_hash_test = ( dataset._fingerprint # pylint: disable=protected-access + "|" + str(cfg.val_set_size) + "|" + "test" + "|" + str(cfg.seed or 42) ) train_fingerprint = md5(to_hash_train) test_fingerprint = md5(to_hash_test) dataset = dataset.train_test_split( test_size=cfg.val_set_size, shuffle=False, seed=cfg.seed or 42, train_new_fingerprint=train_fingerprint, test_new_fingerprint=test_fingerprint, ) train_dataset = dataset["train"] eval_dataset = dataset["test"] elif split == "test": train_dataset = None eval_dataset = dataset else: train_dataset = dataset eval_dataset = None return train_dataset, eval_dataset, prompters
---
title: Pre-training
description: Data format for a pre-training completion task.
order: 1
---
For pretraining, there is no prompt template or roles. The only required field is `text`:
```{.json filename="data.jsonl"}
{"text": "first row"}
{"text": "second row"}
...
:::{.callout-note}
Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
pretraining_dataset: # hf path only
...
:::
def load_tokenized_prepared_datasets( tokenizer, cfg, default_dataset_prepared_path, split="train", ) -> Tuple[DatasetDict, List[Prompter]]: cfg_datasets = cfg.test_datasets if split == "test" else cfg.datasets tokenizer_name = cfg.tokenizer_config ds_hash = str( md5( ( str(cfg.sequence_len) + "@" + str(cfg.sample_packing) + "@" + str(cfg.eval_sample_packing) + "@" + str(cfg.group_by_length) + "@" + "|".join( sorted( [ f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}" for d in cfg_datasets ] ) ) + "|" + tokenizer_name ) ) ) prepared_ds_path = ( Path(cfg.dataset_prepared_path) / ds_hash if cfg.dataset_prepared_path else Path(default_dataset_prepared_path) / ds_hash ) dataset = None prompters = [] use_auth_token = cfg.hf_use_auth_token try: if cfg.push_dataset_to_hub: dataset = load_dataset( f"{cfg.push_dataset_to_hub}/{ds_hash}", token=use_auth_token, ) dataset = dataset[split] except Exception: # pylint: disable=broad-except # nosec pass # pylint: disable=duplicate-code if dataset: ... elif ( cfg.dataset_prepared_path and any(prepared_ds_path.glob("*")) and not cfg.is_preprocess ): LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...") dataset = load_from_disk(str(prepared_ds_path)) LOG.info("Prepared dataset loaded from disk...") else: LOG.info(f"Unable to find prepared dataset in {prepared_ds_path}") LOG.info("Loading raw datasets...") if not cfg.is_preprocess: LOG.warning( "Processing datasets during training can lead to VRAM instability. Please pre-process your dataset." ) if cfg.seed: seed = cfg.seed else: LOG.info("No seed provided, using default seed of 42") seed = 42 datasets = [] def for_d_in_datasets(dataset_configs): for dataset in dataset_configs: if dataset.name and isinstance(dataset.name, list): for name in dataset.name: yield DictDefault({**dataset, "name": name}) else: yield dataset # pylint: disable=invalid-name for config_dataset in for_d_in_datasets(cfg_datasets): ds: Optional[Union[Dataset, DatasetDict]] = None ds_from_hub = False try: load_dataset( config_dataset.path, name=config_dataset.name, streaming=True, token=use_auth_token, ) ds_from_hub = True except (FileNotFoundError, ConnectionError, HFValidationError, ValueError): pass ds_from_cloud = False storage_options = {} remote_file_system = None if config_dataset.path.startswith("s3://"): try: import aiobotocore.session # type: ignore import s3fs # type: ignore except ImportError as exc: raise ImportError( "s3:// paths require aiobotocore and s3fs to be installed" ) from exc # Takes credentials from ~/.aws/credentials for default profile s3_session = aiobotocore.session.AioSession(profile="default") storage_options = {"session": s3_session} remote_file_system = s3fs.S3FileSystem(**storage_options) elif config_dataset.path.startswith( "gs://" ) or config_dataset.path.startswith("gcs://"): try: import gcsfs # type: ignore except ImportError as exc: raise ImportError( "gs:// or gcs:// paths require gcsfs to be installed" ) from exc # gcsfs will use default credentials from the environment else anon # https://gcsfs.readthedocs.io/en/latest/#credentials storage_options = {"token": None} remote_file_system = gcsfs.GCSFileSystem(**storage_options) # TODO: Figure out how to get auth creds passed # elif config_dataset.path.startswith("adl://") or config_dataset.path.startswith("abfs://"): # try: # import adlfs # except ImportError as exc: # raise ImportError( # "adl:// or abfs:// paths require adlfs to be installed" # ) from exc # # Gen 1 # storage_options = { # "tenant_id": TENANT_ID, # "client_id": CLIENT_ID, # "client_secret": CLIENT_SECRET, # } # # Gen 2 # storage_options = { # "account_name": ACCOUNT_NAME, # "account_key": ACCOUNT_KEY, # } # remote_file_system = adlfs.AzureBlobFileSystem(**storage_options) try: if remote_file_system and remote_file_system.exists( config_dataset.path ): ds_from_cloud = True except (FileNotFoundError, ConnectionError): pass # prefer local dataset, even if hub exists local_path = Path(config_dataset.path) if local_path.exists(): if local_path.is_dir(): if config_dataset.data_files: ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type, name=config_dataset.name, data_files=config_dataset.data_files, streaming=False, split=None, ) else: ds = load_from_disk(config_dataset.path) elif local_path.is_file(): ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type, name=config_dataset.name, data_files=config_dataset.path, streaming=False, split=None, ) else: raise ValueError( "unhandled dataset load: local path exists, but is neither a directory or a file" ) elif ds_from_hub: ds = load_dataset( config_dataset.path, name=config_dataset.name, streaming=False, data_files=config_dataset.data_files, token=use_auth_token, ) elif ds_from_cloud and remote_file_system: if remote_file_system.isdir(config_dataset.path): ds = load_from_disk( config_dataset.path, storage_options=storage_options, ) elif remote_file_system.isfile(config_dataset.path): ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type, name=config_dataset.name, data_files=config_dataset.path, streaming=False, split=None, storage_options=storage_options, ) elif config_dataset.path.startswith("https://"): ds_type = get_ds_type(config_dataset) ds = load_dataset( ds_type, name=config_dataset.name, data_files=config_dataset.path, streaming=False, split=None, storage_options=storage_options, ) else: if isinstance(config_dataset.data_files, str): fp = hf_hub_download( repo_id=config_dataset.path, repo_type="dataset", filename=config_dataset.data_files, ) elif isinstance(config_dataset.data_files, list): fp = [] for file in config_dataset.data_files: fp.append( hf_hub_download( repo_id=config_dataset.path, repo_type="dataset", filename=file, ) ) else: raise ValueError( "data_files must be either a string or list of strings" ) ds = load_dataset( "json", name=config_dataset.name, data_files=fp, streaming=False, split=None, ) if not ds: raise ValueError("unhandled dataset load") d_base_type = d_prompt_style = None d_type = config_dataset.type if isinstance(d_type, str): d_type_split = d_type.split(":") d_base_type = d_type_split[0] d_prompt_style = d_type_split[1] if len(d_type_split) > 1 else None if isinstance(ds, DatasetDict): if config_dataset.split and config_dataset.split in ds: ds = ds[config_dataset.split] elif split in ds: ds = ds[split] else: raise ValueError( f"no {split} split found for dataset {config_dataset.path}, you may specify a split with 'split: `" ) # support for using a subset of the data if config_dataset.shards: shards_idx = config_dataset.get("shards_idx", 0) ds = ds.shuffle(seed=seed).shard( num_shards=config_dataset.shards, index=shards_idx ) dataset_wrapper, dataset_prompter = get_dataset_wrapper( config_dataset=config_dataset, tokenizer=tokenizer, cfg=cfg, dataset=ds, d_base_type=d_base_type, d_prompt_style=d_prompt_style, ) datasets.append(dataset_wrapper) prompters.append(dataset_prompter) LOG.info("merging datasets") dataset = concatenate_datasets(datasets) if len(datasets) > 1: if cfg.shuffle_merged_datasets: LOG.debug("shuffle merged datasets") dataset = dataset.shuffle(seed=seed) else: LOG.debug("NOT shuffling merged datasets") dataset, _ = process_datasets_for_packing(cfg, dataset, None) if cfg.local_rank == 0: LOG.info(f"Saving merged prepared dataset to disk... {prepared_ds_path}") dataset.save_to_disk(str(prepared_ds_path)) if cfg.push_dataset_to_hub: LOG.info( f"Saving merged prepared dataset with push_to_hub... {cfg.push_dataset_to_hub}/{ds_hash}" ) dataset.push_to_hub( f"{cfg.push_dataset_to_hub}/{ds_hash}", private=True ) return dataset, prompters
def preprocess_function(examples): batch_size = len(examples[text_column]) inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]] targets = [str(x) for x in examples[label_column]] model_inputs = tokenizer(inputs) labels = tokenizer(targets, add_special_tokens=False) # don't add bos token because we concatenate with inputs for i in range(batch_size): sample_input_ids = model_inputs["input_ids"][i] label_input_ids = labels["input_ids"][i] + [tokenizer.eos_token_id] model_inputs["input_ids"][i] = sample_input_ids + label_input_ids labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i]) for i in range(batch_size): sample_input_ids = model_inputs["input_ids"][i] label_input_ids = labels["input_ids"][i] model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * ( max_length - len(sample_input_ids) ) + sample_input_ids model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[ "attention_mask" ][i] labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length]) model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length]) labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length]) model_inputs["labels"] = labels["input_ids"] return model_inputs
def encode_pretraining( tokenizer: PreTrainedTokenizerBase, max_tokens: int, examples: List[str] ) -> Dict[str, List]: res = tokenizer( examples, truncation=True, max_length=max_tokens - 2, add_special_tokens=True, ) # Convert to PyTorch tensors input_ids = [torch.tensor(seq) for seq in res["input_ids"]] attention_mask = [torch.tensor(seq) for seq in res["attention_mask"]] new_input_ids = [] new_attention_mask = [] # Append EOS and PAD tokens to input_ids, and correct attention_mask for i, _ in enumerate(input_ids): input_ids[i] = torch.cat( ( input_ids[i], torch.tensor([tokenizer.eos_token_id, tokenizer.pad_token_id]), ), dim=0, ) attention_mask[i] = torch.cat((attention_mask[i], torch.tensor([1, 0])), dim=0) # Concatenate tokens so that their lengths are less than max_tokens buffer_input_ids = torch.tensor([], dtype=torch.long) buffer_attention_mask = torch.tensor([], dtype=torch.long) for ids, mask in zip(input_ids, attention_mask): if buffer_input_ids.numel() == max_tokens: new_input_ids.append(buffer_input_ids) new_attention_mask.append(buffer_attention_mask) buffer_input_ids = torch.tensor([], dtype=torch.long) buffer_attention_mask = torch.tensor([], dtype=torch.long) buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0) buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0) elif buffer_input_ids.numel() + ids.numel() <= max_tokens: buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0) buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0) else: buffer_input_ids = torch.cat( ( buffer_input_ids, torch.full( (max_tokens - buffer_input_ids.numel(),), tokenizer.pad_token_id, dtype=torch.long, ), ), dim=0, ) buffer_attention_mask = torch.cat( ( buffer_attention_mask, torch.full( (max_tokens - buffer_attention_mask.numel(),), 0, dtype=torch.long, ), ), dim=0, ) new_input_ids.append(buffer_input_ids) new_attention_mask.append(buffer_attention_mask) buffer_input_ids = torch.tensor([], dtype=torch.long) buffer_attention_mask = torch.tensor([], dtype=torch.long) buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0) buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0) if buffer_input_ids.numel() > 0: # for any leftover tokens while buffer_input_ids.numel() < max_tokens: # make all sequences equal in size buffer_input_ids = torch.cat( ( buffer_input_ids, torch.full( (max_tokens - buffer_input_ids.numel(),), tokenizer.pad_token_id, dtype=torch.long, ), ), dim=0, ) buffer_attention_mask = torch.cat( ( buffer_attention_mask, torch.full( (max_tokens - buffer_attention_mask.numel(),), 0, dtype=torch.long, ), ), dim=0, ) new_input_ids.append(buffer_input_ids) new_attention_mask.append(buffer_attention_mask) ret = { "input_ids": [seq.tolist() for seq in new_input_ids], "labels": [seq.tolist() for seq in new_input_ids], "attention_mask": [seq.tolist() for seq in new_attention_mask], } LOG.debug(len(ret["input_ids"])) return ret
Run
accelerate launch -m axolotl.cli.train your_config.yml
[!TIP] You can also reference a config file that is hosted on a public URL, for example
accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml
You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.
dataset_prepared_path:
to a local folder for saving and loading pre-tokenized dataset.push_dataset_to_hub: hf_user/repo
to push it to Huggingface.--debug
to see preprocessed examples.python -m axolotl.cli.preprocess your_config.yml
Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.
Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated
We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.
deepspeed: deepspeed_configs/zero1.json
accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
Axolotl supports training with FSDP and QLoRA, see these docs for more information.
Make sure your WANDB_API_KEY
environment variable is set (recommended) or you login to wandb with wandb login
.
wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"
When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
"""data handling specific to pretraining"""
def _tokenize( self, prompt: str, add_eos_token: bool = True, strip_bos_token: bool = False ) -> BatchEncoding: res = self.tokenizer( prompt, truncation=True, max_length=self.max_length - 1, add_special_tokens=True, return_overflowing_tokens=True, stride=256, ) res["input_ids"] = [ seq + [self.tokenizer.eos_token_id] for seq in res["input_ids"] ] res["attention_mask"] = [seq + [1] for seq in res["attention_mask"]] return res
class PretrainTokenizationStrategy(PromptTokenizingStrategy): """handles tokenization for pretraining with strides""" @property def supports_batched(self): return True def __init__(self, *args, max_length=None, text_column="text", **kwargs): super().__init__(*args, **kwargs) if max_length: self.max_length = max_length self.text_column = text_column def _tokenize( self, prompt: str, add_eos_token: bool = True, strip_bos_token: bool = False ) -> BatchEncoding: res = self.tokenizer( prompt, truncation=True, max_length=self.max_length - 1, add_special_tokens=True, return_overflowing_tokens=True, stride=256, ) res["input_ids"] = [ seq + [self.tokenizer.eos_token_id] for seq in res["input_ids"] ] res["attention_mask"] = [seq + [1] for seq in res["attention_mask"]] return res def tokenize_prompt(self, prompt): return self._tokenize(prompt[self.text_column])
def load(tokenizer, cfg): strat = PretrainTokenizationStrategy( PretrainTokenizer(), tokenizer, cfg.train_on_inputs, cfg.sequence_len, text_column=cfg.pretraining_dataset[0]["text_column"] or "text", max_length=cfg.sequence_len * 64, ) return strat
"""Module for tokenization utilities"""
Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.
Features:
| | fp16/fp32 | lora | qlora | gptq | gptq w/flash attn | flash attn | xformers attn | |-------------|:----------|:-----|-------|------|-------------------|------------|--------------| | llama | β | β | β | β | β | β | β | | Mistral | β | β | β | β | β | β | β | | Mixtral-MoE | β | β | β | β | β | β | β | | Mixtral8X22 | β | β | β | β | β | β | β | | Pythia | β | β | β | β | β | β | β | | cerebras | β | β | β | β | β | β | β | | btlm | β | β | β | β | β | β | β | | mpt | β | β | β | β | β | β | β | | falcon | β | β | β | β | β | β | β | | gpt-j | β | β | β | β | β | β | β | | XGen | β | β | β | β | β | β | β | | phi | β | β | β | β | β | β | β | | RWKV | β | β | β | β | β | β | β | | Qwen | β | β | β | β | β | β | β | | Gemma | β | β | β | β | β | β | β |
β : supported β: not supported β: untested
Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.
Requirements: Python >=3.10 and Pytorch >=2.1.1.
git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]'
# preprocess datasets - optional but recommended CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml # finetune lora accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml # inference accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./lora-out" # gradio accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \ --lora_model_dir="./lora-out" --gradio # remote yaml files - the yaml config can be hosted on a public URL # Note: the yaml config must directly link to the **raw** yaml accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml
docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest
Or run on the current files for development:
docker compose up -d
<details> <summary>Docker advanced</summary>[!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.
A more powerful Docker command to run would be this:
docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest
It additionally:
--ipc
and --ulimit
args.--mount
/-v
args.--name
argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...
) or in your terminal.--privileged
flag gives all capabilities to the container.--shm-size 10g
argument increases the shared memory size. Use this if you see exitcode: -7
errors using deepspeed.More information on nvidia website
</details>Install python >=3.10
Install pytorch stable https://pytorch.org/get-started/locally/
Install Axolotl along with python dependencies
pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
(Optional) Login to Huggingface to use gated models/datasets.
huggingface-cli login
Get the token at huggingface.co/settings/tokens
For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest
sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
Install Pytorch https://pytorch.org/get-started/locally/
Follow instructions on quickstart.
Run
pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
</details>export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.
Make sure to run the below to uninstall xla.
</details>pip uninstall -y torch_xla[tpu]
Please use WSL or Docker!
Use the below instead of the install method in QuickStart.
pip3 install -e '.'
More info: mac.md
Please use this example notebook.
To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:
pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check
Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1
:
git clone https://github.com/skypilot-org/skypilot.git
cd skypilot/llm/axolotl
Use one command to launch:
# On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.
See these docs for more information on how to use different dataset formats.
See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:
model
base_model: ./llama-7b-hf # local or huggingface repo
Note: The code will load the right architecture.
dataset
datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
loading
load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP
Note: Repo does not do 4-bit quantization.
lora
adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj
See these docs for all config options.
Run
accelerate launch -m axolotl.cli.train your_config.yml
[!TIP] You can also reference a config file that is hosted on a public URL, for example
accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml
You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.
dataset_prepared_path:
to a local folder for saving and loading pre-tokenized dataset.push_dataset_to_hub: hf_user/repo
to push it to Huggingface.--debug
to see preprocessed examples.python -m axolotl.cli.preprocess your_config.yml
Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.
Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated
We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.
deepspeed: deepspeed_configs/zero1.json
accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
Axolotl supports training with FSDP and QLoRA, see these docs for more information.
Make sure your WANDB_API_KEY
environment variable is set (recommended) or you login to wandb with wandb login
.
wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>"
---
title: Dataset Preprocessing
description: How datasets are processed
---
Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
the (dataset format)[../dataset-formats/] and prompt strategies to:
- parse the dataset based on the *dataset format*
- transform the dataset to how you would interact with the model based on the *prompt strategy*
- tokenize the dataset based on the configured model & tokenizer
- shuffle and merge multiple datasets together if using more than one
The processing of the datasets can happen one of two ways:
1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
2. When training is started
What are the benefits of pre-processing? When training interactively or for sweeps
(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
training parameters so that it will intelligently pull from its cache when possible.
The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.
If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
data is in the cache.
What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
prompt template. Because the trainer cannot readily detect these changes, we cannot change the
calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
and change your prompt templating logic, it may not pick up the changes you made and you will be
training over the old prompt.
def prepare_dataset(cfg, tokenizer): prompters = [] if not cfg.pretraining_dataset: with zero_first(is_main_process()): if cfg.test_datasets: train_dataset, _, prompters = load_prepare_datasets( tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="train" ) _, eval_dataset, _ = load_prepare_datasets( tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="test" ) else: train_dataset, eval_dataset, prompters = load_prepare_datasets( tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH ) else: path = cfg.pretraining_dataset split = "train" name = None if isinstance(cfg.pretraining_dataset, list) and isinstance( cfg.pretraining_dataset[0], dict ): path = cfg.pretraining_dataset[0]["path"] name = cfg.pretraining_dataset[0]["name"] if "split" in cfg.pretraining_dataset[0]: split = cfg.pretraining_dataset[0]["split"] ds_wrapper_partial = functools.partial( get_dataset_wrapper, cfg.pretraining_dataset[0], tokenizer, cfg, cfg.pretraining_dataset[0]["type"] or "pretrain", ) train_dataset = wrap_pretraining_dataset( load_dataset(path, streaming=True, split=split, name=name), tokenizer, cfg, ds_wrapper_partial, max_tokens=cfg.sequence_len, batch_size=cfg.micro_batch_size, seed=cfg.seed or 42, buffer_size=cfg.pretrain_multipack_buffer_size or 10_000, ) # https://discuss.huggingface.co/t/how-to-use-huggingface-trainer-streaming-datasets-without-wrapping-it-with-torchdatas-iterablewrapper/25230 train_dataset = train_dataset.with_format("torch") eval_dataset = None return train_dataset, eval_dataset, cfg.max_steps, prompters if eval_dataset and cfg.sample_packing and cfg.eval_sample_packing is not False: total_eval_steps = calculate_total_num_steps(cfg, eval_dataset, update=False) if total_eval_steps == 0: raise ValueError( "eval dataset split is too small for sample_packing. You should set `eval_sample_packing: False`. " ) if cfg.max_steps: total_num_steps = min( calculate_total_num_steps(cfg, train_dataset), cfg.max_steps ) LOG.info(f"Maximum number of steps set at {total_num_steps}") else: total_num_steps = calculate_total_num_steps(cfg, train_dataset) return train_dataset, eval_dataset, total_num_steps, prompters
---
title: Template-free prompt construction
description: "Template-free prompt construction with the `input_output` format"
---
<!-- TOC -->
- [Background](#background)
- [Masking Inputs](#masking-inputs)
- [You may not want prompt templates](#you-may-not-want-prompt-templates)
- [The `input_output` format](#the-input_output-format)
- [Usage](#usage)
- [1. Prepare Data](#1-prepare-data)
- [2. Use `type: input_output`](#2-use-type-input_output)
- [3. Check the prompts](#3-check-the-prompts)
<!-- /TOC -->
<a id="markdown-background" name="background"></a>
## Background
<a id="markdown-masking-inputs" name="masking-inputs"></a>
### Masking Inputs
One of the most popular features of
[axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is
setting the following configuration value:
```yaml
train_on_inputs: false
If you declare a dataset formats
such as alpaca
or chatml
, axolotl knows what is an input
(i.e.Β human) vs.Β an output (i.e.Β the assistant) and masks the input
labels so that your model can focus on predicting the outputs only.
<a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a>
However, there are many situations where you don't want to use one of these formats or templates. This is because they can:
<|im_start|>
that can
quickly become footguns if you don't include them correctly at
inference time.<a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a>
input_output
formatYou can construct your prompts without a template by using the
input_output
format, by setting type: input_output
in your
configuration file like this:
config.yml
train_on_inputs: false # Mask segments of your data datasets: - path: output.jsonl type: input_output # use template free prompt construction
Unlike type: completion
, which is also template-free,
type: input_output
allows you to mask segments of your text. More
details on how this works are described below.
<a id="markdown-usage" name="usage"></a>
This is how you can use the input_output
format:
<a id="markdown-1-prepare-data" name="1-prepare-data"></a>
To use the input_output
format, collect your data in the following
format into a jsonl file (below is the first row from the file
output
.jsonl` pretty printed):
$ head -n1 output.jsonl | python -m json.tool
:::{.cell-output .cell-output-stdout} { "segments": [ { "label": true, "text": "<s>Hello\n" }, { "label": true, "text": "hi there!. " }, { "label": false, "text": "goodbye " }, { "label": true, "text": "farewell</s>" } ] } :::
Set label:false
when you want to mask a segment of text so that the
model isn't trained on it. Some things to keep in mind:
[!IMPORTANT]
- EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl concatenates all the segments as-is. The tokenizer doesn't add anything additional. Notice how I added spaces, newlines,
<s>
(BOS), and</s>
(EOS) myself.- Make sure you check the materialized output to validate that the prompt is getting assembled how you like.
<a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a>
type: input_output
Let's materialize data with our output.jsonl
file by setting
type: input_output
in our axolotl config:
# training_config.yaml base_model: mistralai/Mistral-7B-v0.1 data_seed: 49 seed: 49 datasets: - path: output.jsonl type: input_output val_set_size: 0.1 sequence_len: 896 sample_packing: false micro_batch_size: 2 gradient_accumulation_steps: 3 eval_batch_size: 2 num_epochs: 1 learning_rate: 0.0002 train_on_inputs: false special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
You can use the following command to materialize your data. The
--debug
flag will print the tokens, along with the labels so you can
verify that the correct items are being ignored:
$ python -m axolotl.cli.preprocess training_config.yaml --debug ... [2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557) (13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2)
The format is decoded_token
(label
, token_id
), for example,
<s>(1, 1)
means that the token is <s>
, the label is 1
and the
token_id is 1
. When the label is -100
then that token is ignored for
training.
<a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a>
Here is another way to check the materialized output:
from transformers import AutoTokenizer from datasets import load_from_disk import yaml directory = !ls last_run_prepared/ with open('training_config.yaml', 'r') as f: cfg = yaml.safe_load(f) model_id = cfg['base_model'] tok = AutoTokenizer.from_pretrained(model_id) ds = load_from_disk(f'last_run_prepared/{directory[0]}/')
>>> row = ds[0] >>> print(tok.decode(row['input_ids'])) <s> Hello hi there!. goodbye farewell</s>
We can check that the right tokens are ingored by comparing the labels to each token:
import pandas as pd pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in zip(row['input_ids'], row['labels'])])
| token | label | id | |-------|-------|-------| | 0 | <s> | 1 | | 1 | Hello | 22557 | | 2 | \n | 13 | | 3 | hi | 12014 | | 4 | there | 736 | | 5 | ! | 28808 | | 6 | . | 28723 | | 7 | | 28705 | | 8 | good | -100 | | 9 | bye | -100 | | 10 | | -100 | | 11 | fare | 19111 | | 12 | well | 5458 | | 13 | </s>| 2 |
If we look at the input data, the above table seems correct! (The jsonl version is repeated below for reference):
$ head -n1 output.jsonl | python -m json.tool
:::{.cell-output .cell-output-stdout} { "segments": [ { "label": true, "text": "<s>Hello\n" }, { "label": true, "text": "hi there!. " }, { "label": false, "text": "goodbye " }, { "label": true, "text": "farewell</s>" } ] } :::