DatasetGenerationError cause analysis | OpenAccess-AI-Collective/axolotl

OpenAccess-AI-Collective/axolotl

do you know why I get this error? datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

The datasets.exceptions.DatasetGenerationError typically occurs when there's an issue with generating a dataset using the Hugging Face datasets library. This error can happen for several reasons, including but not limited to:

Incorrect Dataset Name or Configuration: If you're trying to load a dataset from the Hugging Face Datasets Hub and you've specified an incorrect dataset name or configuration name, the library won't be able to find and load the dataset, resulting in this error.
Custom Dataset Script Errors: If you're using a custom dataset loading script (a Python script that defines how to download, process, and load a dataset), any errors in this script can lead to a DatasetGenerationError. Common issues include syntax errors, logic errors, or issues with how the data is processed and formatted.
Data Access Issues: If the dataset script is trying to access data that is not available or has restricted access (e.g., due to permissions issues or broken URLs), the dataset generation process can fail.
Dependencies Missing: Some datasets require additional Python libraries to process the data (e.g., image processing libraries for datasets containing images). If these dependencies are not installed, the dataset generation can fail.
Resource Limitations: Generating some datasets requires significant computational resources (e.g., memory). If your environment does not have enough resources, the dataset generation process might fail.

To resolve this error, you can try the following steps:

Verify Dataset Name/Configuration: Ensure that the dataset name and configuration (if applicable) you're using to load the dataset are correct. You can check the available datasets and their configurations on the Hugging Face Datasets Hub.
Check the Dataset Script: If you're using a custom dataset script, carefully review the script for any errors or issues. Running the script outside of the dataset loading context (e.g., as a standalone Python script) might help identify the problem.
Install Dependencies: Make sure all required dependencies for processing the dataset are installed in your environment.
Check Data Availability: Ensure that all data sources the dataset script tries to access are available and that there are no permissions issues.
Increase Resources: If possible, try running the dataset generation process in an environment with more computational resources (e.g., more memory).

If you're still encountering issues, providing more specific details about the dataset you're trying to load and the exact code you're using can help diagnose the problem more accurately.

Search results

[huggingface/transformers] src/transformers/utils/import_utils.py

# docstyle-ignore
DATASETS_IMPORT_ERROR = """
{0} requires the 🤗 Datasets library but it was not found in your environment. You can install it with:

pip install datasets

In a notebook or a colab, you can install it by executing a cell with

!pip install datasets

then restarting your kernel.

Note that if you have a local folder named `datasets` or a local python file named `datasets.py` in your current
working directory, python may try to import this instead of the 🤗 Datasets library. You should rename this folder or
that python file if that's the case. Please note that you may need to restart your runtime after installation.
"""

[openaccess-ai-collective/axolotl] src/axolotl/datasets.py

"""Module containing Dataset functionality"""

[huggingface/transformers] src/transformers/models/rag/retrieval_rag.py

def _check_dataset_format(self, with_index: bool):
        if not isinstance(self.dataset, Dataset):
            raise ValueError(f"Dataset should be a datasets.Dataset object, but got {type(self.dataset)}")
        if len({"title", "text", "embeddings"} - set(self.dataset.column_names)) > 0:
            raise ValueError(
                "Dataset should be a dataset with the following columns: "
                "title (str), text (str) and embeddings (arrays of dimension vector_size), "
                f"but got columns {self.dataset.column_names}"
            )
        if with_index and "embeddings" not in self.dataset.list_indexes():
            raise ValueError(
                "Missing faiss index in the dataset. Make sure you called `dataset.add_faiss_index` to compute it "
                "or `dataset.load_faiss_index` to load one from the disk."
            )

[huggingface/transformers] examples/research_projects/rag-end2end-retriever/finetune_rag.py

def get_dataset(self, type_path) -> Seq2SeqDataset:
        n_obs = self.n_obs[type_path]
        max_target_length = self.target_lens[type_path]
        dataset = Seq2SeqDataset(
            self.tokenizer,
            type_path=type_path,
            n_obs=n_obs,
            max_target_length=max_target_length,
            **self.dataset_kwargs,
        )
        return dataset

[huggingface/transformers] tests/pipelines/test_pipelines_text_generation.py

class MyDataset(Dataset):
            data = [
                [
                    {"role": "system", "content": "This is a system message."},
                    {"role": "user", "content": "This is a test"},
                    {"role": "assistant", "content": "This is a reply"},
                ],
            ]

            def __len__(self):
                return 1

            def __getitem__(self, i):
                return {"text": self.data[i]}

[huggingface/transformers] tests/pipelines/test_pipelines_text_generation.py

def test_small_chat_model_with_dataset_pt(self):
        from torch.utils.data import Dataset

        from transformers.pipelines.pt_utils import KeyDataset

        class MyDataset(Dataset):
            data = [
                [
                    {"role": "system", "content": "This is a system message."},
                    {"role": "user", "content": "This is a test"},
                    {"role": "assistant", "content": "This is a reply"},
                ],
            ]

            def __len__(self):
                return 1

            def __getitem__(self, i):
                return {"text": self.data[i]}

        text_generator = pipeline(
            task="text-generation", model="rocketknight1/tiny-gpt2-with-chatml-template", framework="pt"
        )

        dataset = MyDataset()
        key_dataset = KeyDataset(dataset, "text")

        for outputs in text_generator(key_dataset, do_sample=False, max_new_tokens=10):
            expected_chat = dataset.data[0] + [
                {
                    "role": "assistant",
                    "content": " factors factors factors factors factors factors factors factors factors factors",
                }
            ]
            self.assertEqual(
                outputs,
                [
                    {"generated_text": expected_chat},
                ],
            )

[huggingface/transformers] examples/research_projects/rag/finetune_rag.py

def get_dataset(self, type_path) -> Seq2SeqDataset:
        n_obs = self.n_obs[type_path]
        max_target_length = self.target_lens[type_path]
        dataset = Seq2SeqDataset(
            self.tokenizer,
            type_path=type_path,
            n_obs=n_obs,
            max_target_length=max_target_length,
            **self.dataset_kwargs,
        )
        return dataset

[openaccess-ai-collective/axolotl] src/axolotl/core/trainer_builder.py

def train_dataset(self, dataset):
        self._train_dataset = dataset

[huggingface/transformers] examples/legacy/multiple_choice/utils_multiple_choice.py

def get_dataset(self):
            self.dataset = self.dataset.apply(tf.data.experimental.assert_cardinality(len(self.features)))

            return self.dataset

[huggingface/transformers] tests/deepspeed/test_deepspeed.py

def get_dataset():
                data_file = str(self.tests_dir / "fixtures/tests_samples/SQUAD/sample.json")
                data_files = {"train": data_file, "validation": data_file}
                raw_datasets = datasets.load_dataset("json", data_files=data_files, field="data")
                train_dataset = raw_datasets["train"].map(_add_eos_to_examples).map(_convert_to_features, batched=True)
                valid_dataset = deepcopy(train_dataset)
                return train_dataset, valid_dataset

[huggingface/transformers] tests/trainer/test_trainer_utils.py

def __init__(self, p_stop=0.01, max_length=1000):
            self.p_stop = p_stop
            self.max_length = max_length
            self.generator = torch.Generator()

[huggingface/transformers] examples/research_projects/adversarial/utils_hans.py

def get_dataset(self):
            return self.dataset

[huggingface/peft] examples/boft_controlnet/utils/dataset.py

def make_dataset(args, tokenizer, accelerator, split="train"):
    # Get the datasets: you can either provide your own training and evaluation files (see below)
    # or specify a Dataset from the hub (the dataset will be downloaded automatically from the datasets Hub).

    # In distributed training, the load_dataset function guarantees that only one local process can concurrently
    # download the dataset.
    if args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        dataset = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            cache_dir=args.cache_dir,
        )
    else:
        if args.train_data_dir is not None:
            dataset = load_dataset(
                args.train_data_dir,
                cache_dir=args.cache_dir,
            )
        # See more about loading custom images at
        # https://huggingface.co/docs/datasets/v2.0.0/en/dataset_script

    # Preprocessing the datasets.
    # We need to tokenize inputs and targets.
    column_names = dataset[split].column_names

    # Get the column names for input/target.
    if args.image_column is None:
        image_column = column_names[0]
    else:
        image_column = args.image_column
        if image_column not in column_names:
            raise ValueError(
                f"`--image_column` value '{args.image_column}' not found in dataset columns. Dataset columns are: {', '.join(column_names)}"
            )

    if args.caption_column is None:
        caption_column = column_names[1]
    else:
        caption_column = args.caption_column
        if caption_column not in column_names:
            raise ValueError(
                f"`--caption_column` value '{args.caption_column}' not found in dataset columns. Dataset columns are: {', '.join(column_names)}"
            )

    if args.conditioning_image_column is None:
        conditioning_image_column = column_names[2]
    else:
        conditioning_image_column = args.conditioning_image_column
        if conditioning_image_column not in column_names:
            raise ValueError(
                f"`--conditioning_image_column` value '{args.conditioning_image_column}' not found in dataset columns. Dataset columns are: {', '.join(column_names)}"
            )

    def tokenize_captions(examples, is_train=True):
        captions = []
        for caption in examples[caption_column]:
            if random.random() < args.proportion_empty_prompts:
                captions.append("")
            elif isinstance(caption, str):
                captions.append(caption)
            elif isinstance(caption, (list, np.ndarray)):
                # take a random caption if there are multiple
                captions.append(random.choice(caption) if is_train else caption[0])
            else:
                raise ValueError(
                    f"Caption column `{caption_column}` should contain either strings or lists of strings."
                )
        inputs = tokenizer(
            captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
        )
        return inputs.input_ids

    image_transforms = transforms.Compose(
        [
            transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
            transforms.CenterCrop(args.resolution),
            transforms.ToTensor(),
            transforms.Normalize([0.5], [0.5]),
        ]
    )

    conditioning_image_transforms = transforms.Compose(
        [
            transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
            transforms.CenterCrop(args.resolution),
            transforms.ToTensor(),
        ]
    )

    def preprocess_train(examples):
        images = [image.convert("RGB") for image in examples[image_column]]
        images = [image_transforms(image) for image in images]

        conditioning_images = [image.convert("RGB") for image in examples[conditioning_image_column]]
        conditioning_images = [conditioning_image_transforms(image) for image in conditioning_images]

        examples["pixel_values"] = images
        examples["conditioning_pixel_values"] = conditioning_images
        examples["input_ids"] = tokenize_captions(examples)

        return examples

    with accelerator.main_process_first():
        if args.max_train_samples is not None:
            dataset[split] = dataset[split].shuffle(seed=args.seed).select(range(args.max_train_samples))
        # Set the training transforms
        split_dataset = dataset[split].with_transform(preprocess_train)

    return split_dataset

[huggingface/accelerate] src/accelerate/utils/megatron_lm.py

def __init__(self, **dataset_kwargs):
        parser = argparse.ArgumentParser()
        parser = _add_data_args(parser)
        parser = _add_validation_args(parser)
        data_args = parser.parse_known_args()
        self.dataset_args = vars(data_args[0])
        self.dataset_args.update(dataset_kwargs)
        self.dataset_args["megatron_dataset_flag"] = True

[openaccess-ai-collective/axolotl] src/axolotl/core/trainer_builder.py

def eval_dataset(self, dataset):
        self._eval_dataset = dataset

[openaccess-ai-collective/axolotl] src/axolotl/utils/config/models/input/v0_4_1/__init__.py

def check_dataset_or_pretraining_dataset(cls, data):
        if data.get("datasets") is None and data.get("pretraining_dataset") is None:
            raise ValueError("either datasets or pretraining_dataset is required")
        return data

[huggingface/peft] examples/conditional_generation/peft_adalora_seq2seq.py

processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["validation"]

train_dataloader = DataLoader(
    train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
)

[huggingface/peft] examples/conditional_generation/peft_adalora_seq2seq.py

# loading dataset
dataset = load_dataset("financial_phrasebank", "sentences_allagree")
dataset = dataset["train"].train_test_split(test_size=0.1)
dataset["validation"] = dataset["test"]

[huggingface/accelerate] src/accelerate/data_loader.py

def __init__(
        self,
        dataset,
        device=None,
        rng_types=None,
        synchronized_generator=None,
        skip_batches=0,
        _drop_last: bool = False,
        _non_blocking: bool = False,
        **kwargs,
    ):
        super().__init__(dataset, **kwargs)
        self.device = device
        self.rng_types = rng_types
        self.synchronized_generator = synchronized_generator
        self.skip_batches = skip_batches
        self.gradient_state = GradientState()
        self._drop_last = _drop_last
        self._non_blocking = _non_blocking
        self.iteration = 0

[huggingface/accelerate] src/accelerate/utils/offload.py

def __init__(self, dataset: Mapping, prefix: str):
        self.dataset = dataset
        self.prefix = prefix

[huggingface/peft] examples/boft_controlnet/utils/dataset.py

def log_validation(val_dataset, text_encoder, unet, controlnet, args, accelerator):
    pipeline = LightControlNetPipeline.from_pretrained(
        args.pretrained_model_name_or_path,
        controlnet=accelerator.unwrap_model(controlnet, keep_fp32_wrapper=True),
        unet=accelerator.unwrap_model(unet, keep_fp32_wrapper=True).model,
        text_encoder=accelerator.unwrap_model(text_encoder, keep_fp32_wrapper=True),
        safety_checker=None,
        revision=args.revision,
    )

    pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
    pipeline = pipeline.to(accelerator.device)

    pipeline.set_progress_bar_config(disable=True)

    generator = torch.Generator(device=accelerator.device).manual_seed(args.seed)

    image_logs = []

    for idx in range(args.num_validation_images):
        data = val_dataset[idx]
        validation_prompt = data["text"]
        validation_image = data["conditioning_pixel_values"]

        image = pipeline(
            validation_prompt,
            [validation_image],
            num_inference_steps=50,
            generator=generator,
        )[0][0]

        image_logs.append(
            {
                "validation_image": validation_image,
                "image": image,
                "validation_prompt": validation_prompt,
            }
        )

    for tracker in accelerator.trackers:
        formatted_images = []

        for log in image_logs:
            image = log["image"]
            validation_prompt = log["validation_prompt"]
            validation_image = log["validation_image"]

            formatted_images.append(wandb.Image(validation_image, caption="Controlnet conditioning"))

            image = wandb.Image(image, caption=validation_prompt)
            formatted_images.append(image)

        tracker.log({"validation": formatted_images})

    del pipeline
    torch.cuda.empty_cache()

[huggingface/accelerate] src/accelerate/utils/megatron_lm.py

def get_batch_megatron(data_iterator):
            """Generate a batch"""
            # Items and their type.
            keys = ["text"]
            datatype = torch.int64

            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = tensor_parallel.broadcast_data(keys, data, datatype)

            # Unpack.
            tokens_ = data_b["text"].long()
            labels = tokens_[:, 1:].contiguous()
            tokens = tokens_[:, :-1].contiguous()

            # Get the masks and postition ids.
            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
                tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, self.eod_mask_loss
            )

            return tokens, labels, loss_mask, attention_mask, position_ids

[openaccess-ai-collective/axolotl] src/axolotl/datasets.py

def __init__(  # pylint: disable=super-init-not-called
        self,
        prompt_tokenizer: PromptTokenizingStrategy,
        dataset: Dataset,
        process_count: Optional[int] = None,
        keep_in_memory: Optional[bool] = False,
        **kwargs,
    ):
        self.prompt_tokenizer = prompt_tokenizer
        self.process_count = process_count
        self.keep_in_memory = keep_in_memory
        super().__init__(
            self.process(dataset).data,
            **kwargs,
        )

[openaccess-ai-collective/axolotl] src/axolotl/utils/config/models/input/v0_4_1/__init__.py

def fix_sharegpt_datasets(cls, datasets):
        for idx, ds_cfg in enumerate(datasets):
            if not ds_cfg["type"]:
                continue
            if ds_cfg["type"] == "sharegpt:chat":
                LOG.warning(
                    PendingDeprecationWarning(
                        "`type: sharegpt:chat` will soon be deprecated. simply use `type: sharegpt` instead."
                    )
                )
                datasets[idx]["type"] = "sharegpt"
            if "sharegpt_simple" in ds_cfg["type"]:
                LOG.warning(
                    PendingDeprecationWarning(
                        "`type: sharegpt_simple` will soon be deprecated. simply use `type: sharegpt` instead."
                    )
                )
                datasets[idx]["type"] = datasets[idx]["type"].replace(
                    "sharegpt_simple", "sharegpt"
                )
        return datasets

[huggingface/peft] examples/boft_dreambooth/utils/dataset.py

class PromptDataset(Dataset):
    "A simple dataset to prepare the prompts to generate class images on multiple GPUs."

    def __init__(self, prompt, num_samples):
        self.prompt = prompt
        self.num_samples = num_samples

    def __len__(self):
        return self.num_samples

    def __getitem__(self, index):
        example = {}
        example["prompt"] = self.prompt
        example["index"] = index
        return example

[huggingface/peft] examples/sft/utils.py

def create_datasets(tokenizer, data_args, training_args, apply_chat_template=False):
    def preprocess(samples):
        batch = []
        for conversation in samples["messages"]:
            batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
        return {"content": batch}

    raw_datasets = DatasetDict()
    for split in data_args.splits.split(","):
        try:
            # Try first if dataset on a Hub repo
            dataset = load_dataset(data_args.dataset_name, split=split)
        except DatasetGenerationError:
            # If not, check local dataset
            dataset = load_from_disk(os.path.join(data_args.dataset_name, split))

        if "train" in split:
            raw_datasets["train"] = dataset
        elif "test" in split:
            raw_datasets["test"] = dataset
        else:
            raise ValueError(f"Split type {split} not recognized as one of test or train.")

    if apply_chat_template:
        raw_datasets = raw_datasets.map(
            preprocess,
            batched=True,
            remove_columns=raw_datasets["train"].column_names,
        )

    train_data = raw_datasets["train"]
    valid_data = raw_datasets["test"]
    print(f"Size of the train set: {len(train_data)}. Size of the validation set: {len(valid_data)}")
    print(f"A sample of train dataset: {train_data[0]}")

    return train_data, valid_data

[huggingface/peft] examples/int8_training/fine_tune_blip2_int8.py

def __init__(self, dataset, processor):
        self.dataset = dataset
        self.processor = processor

[huggingface/accelerate] src/accelerate/data_loader.py

def __len__(self):
        # We will just raise the downstream error if the underlying dataset is not sized
        if self.drop_last:
            return (len(self.dataset) // (self.batch_size * self.num_processes)) * self.batch_size
        else:
            return math.ceil(len(self.dataset) / (self.batch_size * self.num_processes)) * self.batch_size

[openaccess-ai-collective/axolotl] src/axolotl/utils/callbacks/__init__.py

def predict_with_generate():
                eval_src, eval_pred, eval_ref = [], [], []

                for batch in tqdm(eval_dataloader):
                    batch_labels = batch["labels"].to(device)
                    batch_input_ids = batch["input_ids"].to(device)

                    if "position_ids" in batch:
                        batch_pos_ids = batch["position_ids"].tolist()
                    else:
                        batch_pos_ids = [None] * len(batch["input_ids"])

                    prompt_token_ids_list = []
                    completion_token_ids_list = []

                    for input_ids_all, labels_all, pos_ids in zip(
                        batch_input_ids,
                        batch_labels,
                        batch_pos_ids,
                    ):
                        if pos_ids is None:
                            pos_ranges = [(0, len(input_ids_all) - 1)]
                        else:
                            pos_ranges = find_ranges(pos_ids)

                        for pos_range in pos_ranges:
                            start, end = pos_range
                            if start == end:
                                continue

                            input_ids = input_ids_all[start : end + 1]
                            labels = labels_all[start : end + 1]

                            tokens_without_loss = labels == IGNORE_INDEX
                            tokens_with_loss = labels != IGNORE_INDEX
                            tokens_exclude_padding = input_ids != tokenizer.pad_token_id
                            prompt_token_includes = (
                                tokens_without_loss & tokens_exclude_padding
                            )

                            prompt_token_ids = input_ids[prompt_token_includes]
                            prompt_token_ids_list.append(prompt_token_ids)

                            completion_token_ids = input_ids[tokens_with_loss]
                            completion_token_ids_list.append(completion_token_ids)

                    prompt_texts = tokenizer.batch_decode(
                        prompt_token_ids_list, skip_special_tokens=True
                    )
                    completion_texts = tokenizer.batch_decode(
                        completion_token_ids_list, skip_special_tokens=True
                    )

                    with torch.no_grad():
                        prompt_encoding = tokenizer(
                            prompt_texts, padding=True, return_tensors="pt"
                        ).to(self.cfg.device)
                        predictions = trainer.model.generate(
                            **prompt_encoding, generation_config=generation_config
                        )

                    prediction_all_tokens = predictions["sequences"].cpu().tolist()
                    prediction_without_prompt_tokens_list = []
                    for prompt_token_ids, prediction_tokens in zip(
                        prompt_token_ids_list, prediction_all_tokens
                    ):
                        prediction_without_prompt_tokens = prediction_tokens[
                            len(prompt_token_ids) :
                        ]
                        prediction_without_prompt_tokens_list.append(
                            prediction_without_prompt_tokens
                        )

                    predicted_texts = tokenizer.batch_decode(
                        prediction_without_prompt_tokens_list, skip_special_tokens=True
                    )

                    eval_src.extend(prompt_texts)
                    eval_pred.extend(predicted_texts)
                    eval_ref.extend(completion_texts)

                return eval_src, eval_pred, eval_ref

[huggingface/peft] examples/stable_diffusion/train_dreambooth.py

class PromptDataset(Dataset):
    "A simple dataset to prepare the prompts to generate class images on multiple GPUs."

    def __init__(self, prompt, num_samples):
        self.prompt = prompt
        self.num_samples = num_samples

    def __len__(self):
        return self.num_samples

    def __getitem__(self, index):
        example = {}
        example["prompt"] = self.prompt
        example["index"] = index
        return example

[huggingface/accelerate] src/accelerate/utils/offload.py

def __len__(self):
        return len(self.dataset)

[openaccess-ai-collective/axolotl] src/axolotl/datasets.py

def process(self, dataset):
        features = dataset.features.keys()
        num_proc = min(64, self.process_count if self.process_count else os.cpu_count())

        map_kwargs = {}
        if self.prompt_tokenizer.supports_batched:
            map_kwargs["batched"] = True
            map_kwargs["batch_size"] = 100
        return dataset.map(
            self.prompt_tokenizer.tokenize_prompt,
            num_proc=num_proc,
            remove_columns=features,
            keep_in_memory=self.keep_in_memory,
            desc="Tokenizing Prompts",
            **map_kwargs,
        )

[huggingface/accelerate] src/accelerate/data_loader.py

def __iter__(self):
        if (
            not hasattr(self.dataset, "set_epoch")
            and hasattr(self.dataset, "generator")
            and isinstance(self.dataset.generator, torch.Generator)
        ):
            self.dataset.generator.manual_seed(self.epoch)
        real_batch_size = self.batch_size if self.split_batches else (self.batch_size * self.num_processes)
        process_batch_size = (self.batch_size // self.num_processes) if self.split_batches else self.batch_size
        process_slice = range(self.process_index * process_batch_size, (self.process_index + 1) * process_batch_size)

        first_batch = None
        current_batch = []
        for element in self.dataset:
            current_batch.append(element)
            # Wait to have a full batch before yielding elements.
            if len(current_batch) == real_batch_size:
                for i in process_slice:
                    yield current_batch[i]
                if first_batch is None:
                    first_batch = current_batch.copy()
                current_batch = []

        # Finished if drop_last is True, otherwise complete the last batch with elements from the beginning.
        if not self.drop_last and len(current_batch) > 0:
            if first_batch is None:
                first_batch = current_batch.copy()
            while len(current_batch) < real_batch_size:
                current_batch += first_batch
            for i in process_slice:
                yield current_batch[i]

[huggingface/accelerate] src/accelerate/utils/megatron_lm.py

def get_batch_func(self, accelerator, megatron_dataset_flag):
        def get_batch_megatron(data_iterator):
            """Generate a batch"""
            # Items and their type.
            keys = ["text"]
            datatype = torch.int64

            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = tensor_parallel.broadcast_data(keys, data, datatype)

            # Unpack.
            tokens_ = data_b["text"].long()
            labels = tokens_[:, 1:].contiguous()
            tokens = tokens_[:, :-1].contiguous()

            # Get the masks and postition ids.
            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
                tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, self.eod_mask_loss
            )

            return tokens, labels, loss_mask, attention_mask, position_ids

        def get_batch_transformer(data_iterator):
            data = next(data_iterator)
            data = {"input_ids": data["input_ids"]}
            data = send_to_device(data, torch.cuda.current_device())

            tokens_ = data["input_ids"].long()
            padding = torch.zeros((tokens_.shape[0], 1), dtype=tokens_.dtype, device=tokens_.device) + self.eod_token
            tokens_ = torch.concat([tokens_, padding], dim=1)
            labels = tokens_[:, 1:].contiguous()
            tokens = tokens_[:, :-1].contiguous()
            # Get the masks and postition ids.
            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
                tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, True
            )
            return tokens, labels, loss_mask, attention_mask, position_ids

        if accelerator.state.megatron_lm_plugin.custom_get_batch_function is not None:
            return accelerator.state.megatron_lm_plugin.custom_get_batch_function
        if megatron_dataset_flag:
            try:
                # Use '--no-use-pep517 -e' to pip install nvidia's megatron from source
                from pretrain_gpt import get_batch

                return get_batch
            except ImportError:
                pass
            return get_batch_megatron
        else:
            return get_batch_transformer

[openaccess-ai-collective/axolotl] src/axolotl/datasets.py

def __init__(  # pylint: disable=super-init-not-called
        self,
        tokenizer,
        datasets,
        seq_length=2048,
    ):
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id
        self.datasets: List[IterableDataset] = datasets
        self.seq_length = seq_length

        vocab_size = len(tokenizer.get_vocab())

        if vocab_size <= torch.iinfo(torch.int16).max:
            self.tokens_dtype = torch.int16
        elif vocab_size <= torch.iinfo(torch.int32).max:
            self.tokens_dtype = torch.int32
        else:
            self.tokens_dtype = torch.int64

[huggingface/peft] examples/stable_diffusion/train_dreambooth.py

def __init__(self, prompt, num_samples):
        self.prompt = prompt
        self.num_samples = num_samples

[huggingface/accelerate] manim_animations/dataloaders/stage_2.py

def construct(self):
        # The dataset items
        fill = Rectangle(height=0.46,width=0.46).set_stroke(width=0)
        columns = [
            VGroup(*[Rectangle(height=0.25,width=0.25,color="green") for i in range(8)]).arrange(RIGHT,buff=0)
            for j in range(4)
        ]
        dataset_recs = VGroup(*columns).arrange(UP, buff=0)
        dataset_text = Text("Dataset", font_size=24)
        dataset = Group(dataset_recs,dataset_text).arrange(DOWN, buff=0.5, aligned_edge=DOWN)
        dataset.move_to([-2,0,0])
        self.add(dataset)
        
        code = Code(
            code="dataloader = DataLoader(...)\nfor batch in dataloader():\n\t...",
            tab_width=4,
            background="window",
            language="Python",
            font="Monospace",
            font_size=14,
            corner_radius=.2,
            insert_line_no=False,
            line_spacing=.75,
            style=Code.styles_list[1],
        )
        code.move_to([-3.5, 2.5, 0])
        self.add(code)

        # The dataloader itself
        dataloader = Group(
            Rectangle(color="red", height=2, width=2),
            Text("DataLoader", font_size=24)
        ).arrange(DOWN, buff=.5, aligned_edge=DOWN)

        sampler = Group(
            Rectangle(color="blue", height=1, width=1),
            Text("Sampler", font_size=12)
        ).arrange(DOWN, buff=.25, aligned_edge=DOWN)
        dataloader.move_to([1, 0, 0])
        sampler.move_to([.75,.25,0])
        self.add(dataloader)
        self.add(sampler)

        gpu_1 = Group(
            Rectangle(color="white", height=1, width=1),
            Text("GPU 1", font_size=12)
        ).arrange(DOWN, buff=.25, aligned_edge=DOWN).move_to([4, 2, 0])
        gpu_2 = Group(
            Rectangle(color="white", height=1, width=1),
            Text("GPU 2", font_size=12)
        ).arrange(DOWN, buff=.25, aligned_edge=DOWN).move_to([4, .5, 0])
        gpu_3 = Group(
            Rectangle(color="white", height=1, width=1),
            Text("GPU 3", font_size=12)
        ).arrange(DOWN, buff=.25, aligned_edge=DOWN).move_to([4, -1, 0])
        gpu_4 = Group(
            Rectangle(color="white", height=1, width=1),
            Text("GPU 4", font_size=12)
        ).arrange(DOWN, buff=.25, aligned_edge=DOWN).move_to([4, -2.5, 0])
        gpus = [gpu_1[0], gpu_2[0], gpu_3[0], gpu_4[0]]
        self.add(gpu_1, gpu_2, gpu_3, gpu_4)

        # Animate their existence
        self.play(
            Create(gpu_1[0], run_time=0.5),
            Create(gpu_2[0], run_time=0.5),
            Create(gpu_3[0], run_time=0.5),
            Create(gpu_4[0], run_time=0.5),
            Create(dataset_recs, run_time=1),
            Create(sampler[0], run_time=1),
            Create(dataloader[0], run_time=1)
        )

        step_1 = MarkupText(
            f"Without any special care, \nthe same data is sent though each sampler, \nand the same samples are spit out on each GPU",
            font_size=18
        )
        step_1.move_to([0, -2.5, 0])
        self.play(
            Write(step_1, run_time=4),
        )

        first_animations = []
        second_animations = []


        colors = ["BLUE_E", "DARK_BROWN", "GOLD_E", "GRAY_A"]
        current_color = colors[0]
        buff = 0
        lr_buff = .25
        old_target = None
        new_datasets = []
        for i,data in enumerate(dataset_recs[-1]):
            if i % 2 == 0:
                # current_color = colors[i//2]
                current_color = "BLUE_E"
            dataset_target = Rectangle(height=0.46/2,width=0.46/2).set_stroke(width=0.).set_fill(current_color, opacity=0.7)
            dataset_target.move_to(data)
            dataset_target.generate_target()
            aligned_edge = ORIGIN
            if i % 2 == 0:
                old_target = dataset_target.target
                buff -= .25
                aligned_edge = LEFT
                dataset_target.target.next_to(
                    sampler, buff=buff, direction=UP,
                    aligned_edge=LEFT
                )
            else:
                dataset_target.target.next_to(
                    old_target, direction=RIGHT, buff=0.01,
                )
            new_datasets.append(dataset_target)
            first_animations.append(data.animate(run_time=0.5).set_stroke(current_color))
            second_animations.append(MoveToTarget(dataset_target, run_time=1.5))
        self.play(*first_animations)
        self.play(*second_animations)
        self.wait()

        move_animation = []

        for j,gpu in enumerate(gpus):
            buff = 0
            for i,data in enumerate(new_datasets):
                if i % 2 == 0:
                    current_color = colors[i//2]
                if j != 3:
                    data = data.copy()
                data.generate_target()
                aligned_edge = ORIGIN
                if i % 2 == 0:
                    old_target = data.target
                    buff -= .25
                    aligned_edge = LEFT
                    data.target.next_to(
                        gpu, buff=buff, direction=UP,
                        aligned_edge=LEFT
                    )
                else:
                    data.target.next_to(
                        old_target, direction=RIGHT, buff=0.01,
                    )
                move_animation.append(MoveToTarget(data, run_time=1.5))


        self.play(*move_animation)

        self.remove(step_1)
        step_2 = MarkupText(
            f"This behavior is undesireable, because we want\neach GPU to see different data for efficient training.",
            font_size=18
        )
        step_2.move_to([0, -2.5, 0])

        self.play(
            Write(step_2, run_time=2.5),
        )
        self.wait()

[huggingface/accelerate] src/accelerate/data_loader.py

def set_epoch(self, epoch):
        self.epoch = epoch
        if hasattr(self.dataset, "set_epoch"):
            self.dataset.set_epoch(epoch)

[huggingface/peft] examples/oft_dreambooth/train_dreambooth.py

def __init__(
        self,
        instance_data_root,
        instance_prompt,
        tokenizer,
        class_data_root=None,
        class_prompt=None,
        size=512,
        center_crop=False,
    ):
        self.size = size
        self.center_crop = center_crop
        self.tokenizer = tokenizer

        self.instance_data_root = Path(instance_data_root)
        if not self.instance_data_root.exists():
            raise ValueError("Instance images root doesn't exists.")

        self.instance_images_path = list(Path(instance_data_root).iterdir())
        self.num_instance_images = len(self.instance_images_path)
        self.instance_prompt = instance_prompt
        self._length = self.num_instance_images

        if class_data_root is not None:
            self.class_data_root = Path(class_data_root)
            self.class_data_root.mkdir(parents=True, exist_ok=True)
            self.class_images_path = list(self.class_data_root.iterdir())
            self.num_class_images = len(self.class_images_path)
            self._length = max(self.num_class_images, self.num_instance_images)
            self.class_prompt = class_prompt
        else:
            self.class_data_root = None

        self.image_transforms = transforms.Compose(
            [
                transforms.Resize(size, interpolation=transforms.InterpolationMode.BILINEAR),
                transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size),
                transforms.ToTensor(),
                transforms.Normalize([0.5], [0.5]),
            ]
        )

[huggingface/peft] examples/lora_dreambooth/train_dreambooth.py

class PromptDataset(Dataset):
    "A simple dataset to prepare the prompts to generate class images on multiple GPUs."

    def __init__(self, prompt, num_samples):
        self.prompt = prompt
        self.num_samples = num_samples

    def __len__(self):
        return self.num_samples

    def __getitem__(self, index):
        example = {}
        example["prompt"] = self.prompt
        example["index"] = index
        return example

[openaccess-ai-collective/axolotl] src/axolotl/core/trainer_builder.py

def train_dataset(self):
        return self._train_dataset

[huggingface/accelerate] examples/complete_cv_example.py

def __init__(self, file_names, image_transform=None, label_to_id=None):
        self.file_names = file_names
        self.image_transform = image_transform
        self.label_to_id = label_to_id

[openaccess-ai-collective/axolotl] docs/dataset_preprocessing.qmd

---
title: Dataset Preprocessing
description: How datasets are processed
---

Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
the (dataset format)[../dataset-formats/] and prompt strategies to:
 - parse the dataset based on the *dataset format*
 - transform the dataset to how you would interact with the model based on the *prompt strategy*
 - tokenize the dataset based on the configured model & tokenizer
 - shuffle and merge multiple datasets together if using more than one

The processing of the datasets can happen one of two ways:

1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
2. When training is started

What are the benefits of pre-processing? When training interactively or for sweeps
(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
training parameters so that it will intelligently pull from its cache when possible.

The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.

If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
data is in the cache.

What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
prompt template. Because the trainer cannot readily detect these changes, we cannot change the
calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
and change your prompt templating logic, it may not pick up the changes you made and you will be
training over the old prompt.

[huggingface/accelerate] src/accelerate/data_loader.py

class IterableDatasetShard(IterableDataset):
    """
    Wraps a PyTorch `IterableDataset` to generate samples for one of the processes only. Instances of this class will
    always yield a number of samples that is a round multiple of the actual batch size (depending of the value of
    `split_batches`, this is either `batch_size` or `batch_size x num_processes`). Depending on the value of the
    `drop_last` attribute of the batch sampler passed, it will either stop the iteration at the first batch that would
    be too small or loop with indices from the beginning.

    Args:
        dataset (`torch.utils.data.dataset.IterableDataset`):
            The batch sampler to split in several shards.
        batch_size (`int`, *optional*, defaults to 1):
            The size of the batches per shard (if `split_batches=False`) or the size of the batches (if
            `split_batches=True`).
        drop_last (`bool`, *optional*, defaults to `False`):
            Whether or not to drop the last incomplete batch or complete the last batches by using the samples from the
            beginning.
        num_processes (`int`, *optional*, defaults to 1):
            The number of processes running concurrently.
        process_index (`int`, *optional*, defaults to 0):
            The index of the current process.
        split_batches (`bool`, *optional*, defaults to `False`):
            Whether the shards should be created by splitting a batch to give a piece of it on each process, or by
            yielding different full batches on each process.

            On two processes with an iterable dataset yielding of `[0, 1, 2, 3, 4, 5, 6, 7]`, this will result in:

            - the shard on process 0 to yield `[0, 1, 2, 3]` and the shard on process 1 to yield `[4, 5, 6, 7]` if this
              argument is set to `False`.
            - the shard on process 0 to yield `[0, 1, 4, 5]` and the sampler on process 1 to yield `[2, 3, 6, 7]` if
              this argument is set to `True`.
    """

    def __init__(
        self,
        dataset: IterableDataset,
        batch_size: int = 1,
        drop_last: bool = False,
        num_processes: int = 1,
        process_index: int = 0,
        split_batches: bool = False,
    ):
        if split_batches and batch_size > 1 and batch_size % num_processes != 0:
            raise ValueError(
                f"To use `IterableDatasetShard` in `split_batches` mode, the batch size ({batch_size}) "
                f"needs to be a round multiple of the number of processes ({num_processes})."
            )
        self.dataset = dataset
        self.batch_size = batch_size
        self.drop_last = drop_last
        self.num_processes = num_processes
        self.process_index = process_index
        self.split_batches = split_batches

    def set_epoch(self, epoch):
        self.epoch = epoch
        if hasattr(self.dataset, "set_epoch"):
            self.dataset.set_epoch(epoch)

    def __len__(self):
        # We will just raise the downstream error if the underlying dataset is not sized
        if self.drop_last:
            return (len(self.dataset) // (self.batch_size * self.num_processes)) * self.batch_size
        else:
            return math.ceil(len(self.dataset) / (self.batch_size * self.num_processes)) * self.batch_size

    def __iter__(self):
        if (
            not hasattr(self.dataset, "set_epoch")
            and hasattr(self.dataset, "generator")
            and isinstance(self.dataset.generator, torch.Generator)
        ):
            self.dataset.generator.manual_seed(self.epoch)
        real_batch_size = self.batch_size if self.split_batches else (self.batch_size * self.num_processes)
        process_batch_size = (self.batch_size // self.num_processes) if self.split_batches else self.batch_size
        process_slice = range(self.process_index * process_batch_size, (self.process_index + 1) * process_batch_size)

        first_batch = None
        current_batch = []
        for element in self.dataset:
            current_batch.append(element)
            # Wait to have a full batch before yielding elements.
            if len(current_batch) == real_batch_size:
                for i in process_slice:
                    yield current_batch[i]
                if first_batch is None:
                    first_batch = current_batch.copy()
                current_batch = []

        # Finished if drop_last is True, otherwise complete the last batch with elements from the beginning.
        if not self.drop_last and len(current_batch) > 0:
            if first_batch is None:
                first_batch = current_batch.copy()
            while len(current_batch) < real_batch_size:
                current_batch += first_batch
            for i in process_slice:
                yield current_batch[i]

[huggingface/accelerate] examples/cv_example.py

class PetsDataset(Dataset):
    def __init__(self, file_names, image_transform=None, label_to_id=None):
        self.file_names = file_names
        self.image_transform = image_transform
        self.label_to_id = label_to_id

    def __len__(self):
        return len(self.file_names)

    def __getitem__(self, idx):
        fname = self.file_names[idx]
        raw_image = PIL.Image.open(fname)
        image = raw_image.convert("RGB")
        if self.image_transform is not None:
            image = self.image_transform(image)
        label = extract_label(fname)
        if self.label_to_id is not None:
            label = self.label_to_id[label]
        return {"image": image, "label": label}

[openaccess-ai-collective/axolotl] docs/dataset-formats/pretraining.qmd

---
title: Pre-training
description: Data format for a pre-training completion task.
order: 1
---

For pretraining, there is no prompt template or roles.  The only required field is `text`:

```{.json filename="data.jsonl"}
{"text": "first row"}
{"text": "second row"}
...

:::{.callout-note}

Streaming is recommended for large datasets

Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:

pretraining_dataset: # hf path only
...

:::

[openaccess-ai-collective/axolotl] README.md

Common Errors 🧰

Tokenization Mismatch b/w Inference & Training

For many formats, Axolotl constructs prompts by concatenating token ids after tokenizing strings. The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.

If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens. When you are starting out with a new format, you should always do the following:

Materialize some data using python -m axolotl.cli.preprocess your_config.yml --debug, and then decode the first few rows with your model's tokenizer.
During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
Make sure the inference string from #2 looks exactly like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same, adjust your inference server accordingly.
As an additional troubleshooting step, you can look at the token ids between 1 and 2 to make sure they are identical.

Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See this blog post for a concrete example.

[openaccess-ai-collective/axolotl] README.md

Introduction
Supported Features
Quickstart
Environment
- Docker
- Conda/Pip venv
- Cloud GPU - Latitude.sh, JarvisLabs, RunPod
- Bare Metal Cloud GPU
- Windows
- Mac
- Google Colab
- Launching on public clouds via SkyPilot
- Launching on public clouds via dstack
Dataset
Config
Advanced Topics
- Multipack<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
- RLHF & DPO<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
- Dataset Pre-Processing<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
Common Errors
- Tokenization Mismatch b/w Training & Inference
Debugging Axolotl
Need Help?
Badge
Community Showcase
Contributing
Sponsors

</td> <td> <div align="center"> <img src="image/axolotl.png" alt="axolotl" width="160"> <div> <p> <b>Axolotl provides a unified repository for fine-tuning <br />a variety of AI models with ease</b> </p> <p> Go ahead and Axolotl questions!! </p> <img src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/pre-commit.yml/badge.svg?branch=main" alt="pre-commit"> <img alt="PyTest Status" src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/tests.yml/badge.svg?branch=main"> </div> </div> </td> </tr> </table>

[openaccess-ai-collective/axolotl] docs/dataset-formats/index.qmd

---
title: Dataset Formats
description: Supported dataset formats.
listing:
  fields: [title, description]
  type: table
  sort-ui: false
  filter-ui: false
  max-description-length: 250
---

Axolotl supports a variety of dataset formats.  It is recommended to use a JSONL format.  The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.

Below are these various formats organized by task:

[openaccess-ai-collective/axolotl] docs/dataset-formats/tokenized.qmd

---
title: Custom Pre-Tokenized Dataset
description: How to use a custom pre-tokenized dataset.
order: 5
---

- Pass an empty `type:` in your axolotl config.
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
- To indicate that a token should be ignored during training, set its corresponding label to `-100`.
- Do not add BOS/EOS. Axolotl will add them for you based on the default tokenizer for the model you're using.
- For pretraining, do not truncate/pad documents to the context window length.
- For instruction training, documents must be truncated/padded as desired.

Sample config:

```{.yaml filename="config.yml"}
datasets:
  - path: /path/to/your/file.jsonl
    ds_type: json
    type:

Sample jsonl:

{"input_ids":[271,299,99],"attention_mask":[1,1,1],"labels":[271,-100,99]}
{"input_ids":[87,227,8383,12],"attention_mask":[1,1,1,1],"labels":[87,227,8383,12]}

[openaccess-ai-collective/axolotl] docs/dataset-formats/inst_tune.qmd

---
title: Instruction Tuning
description: Instruction tuning formats for supervised fine-tuning.
order: 2
---

## alpaca

instruction; input(optional)

```{.json filename="data.jsonl"}
{"instruction": "...", "input": "...", "output": "..."}

jeopardy

question and answer

{"question": "...", "category": "...", "answer": "..."}

oasst

instruction

{"INSTRUCTION": "...", "RESPONSE": "..."}

gpteacher

instruction; input(optional)

{"instruction": "...", "input": "...", "response": "..."}

reflection

instruction with reflect; input(optional)

{"instruction": "...", "input": "...", "output": "...", "reflection": "...", "corrected": "..."}

explainchoice

question, choices, (solution OR explanation)

{"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}

concisechoice

question, choices, (solution OR explanation)

{"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}

summarizetldr

article and summary

{"article": "...", "summary": "..."}

alpaca_chat

basic instruct for alpaca chat

{"instruction": "...", "input": "...", "response": "..."}

alpaca_chat.load_qa

question and answer for alpaca chat

{"question": "...", "answer": "..."}

alpaca_chat.load_concise

question and answer for alpaca chat, for concise answers

{"instruction": "...", "input": "...", "response": "..."}

alpaca_chat.load_camel_ai

question and answer for alpaca chat, for load_camel_ai

{"message_1": "...", "message_2": "..."}

alpaca_w_system.load_open_orca

support for open orca datasets with included system prompts, instruct

{"system_prompt": "...", "question": "...", "response": "..."}

context_qa

in context question answering from an article

{"article": "...", "question": "...", "answer": "..."}

context_qa.load_v2

in context question answering (alternate)

{"context": "...", "question": "...", "answer": "..."}

context_qa.load_404

in context question answering from an article, with default response for no answer from context

{"article": "...", "unanswerable_question": "..."}

creative_acr.load_answer

instruction and revision

{"instruction": "...", "revision": "..."}

creative_acr.load_critique

critique

{"scores": "...", "critiques": "...", "instruction": "...", "answer": "..."}

creative_acr.load_revise

critique and revise

{"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."}

metharme

instruction, adds additional eos tokens

{"prompt": "...", "generation": "..."}

How to add custom prompt format

For a dataset that is preprocessed for instruction purposes:

{"input": "...", "output": "..."}

You can use this example in your YAML config:

datasets:
  - path: repo
    type:
      system_prompt: ""
      field_system: system
      field_instruction: input
      field_output: output
      format: "[INST] {instruction} [/INST]"
      no_input_format: "[INST] {instruction} [/INST]"

See full config options under here.

[openaccess-ai-collective/axolotl] docs/dataset-formats/conversation.qmd

---
title: Conversation
description: Conversation format for supervised fine-tuning.
order: 3
---

## sharegpt

conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)

```{.json filename="data.jsonl"}
{"conversations": [{"from": "...", "value": "..."}]}

Note: type: sharegpt opens special configs:

conversation: enables conversions to many Conversation types. Refer to the 'name' here for options.
roles: allows you to specify the roles for input and output. This is useful for datasets with custom roles such as tool etc to support masking.
field_human: specify the key to use instead of human in the conversation.
field_model: specify the key to use instead of gpt in the conversation.

datasets:
    path: ...
    type: sharegpt

    conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
    field_human: # Optional[str]. Human key to use for conversation.
    field_model: # Optional[str]. Assistant key to use for conversation.
    # Add additional keys from your dataset as input or output roles
    roles:
      input: # Optional[List[str]]. These will be masked based on train_on_input
      output: # Optional[List[str]].

pygmalion

{"conversations": [{"role": "...", "value": "..."}]}

sharegpt.load_role

conversations where role is used instead of from

{"conversations": [{"role": "...", "value": "..."}]}

sharegpt.load_guanaco

conversations where from is prompter assistant instead of default sharegpt

{"conversations": [{"from": "...", "value": "..."}]}

sharegpt_jokes

creates a chat where bot is asked to tell a joke, then explain why the joke is funny

{"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}

[openaccess-ai-collective/axolotl] README.md

Train

Run

accelerate launch -m axolotl.cli.train your_config.yml

[!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

Preprocess dataset

You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
(Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
(Optional): Use --debug to see preprocessed examples.

python -m axolotl.cli.preprocess your_config.yml

Multi-GPU

Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

DeepSpeed

Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated

We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.

deepspeed: deepspeed_configs/zero1.json

accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json

FSDP

llama FSDP

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

FSDP + QLoRA

Axolotl supports training with FSDP and QLoRA, see these docs for more information.

Weights & Biases Logging

Make sure your WANDB_API_KEY environment variable is set (recommended) or you login to wandb with wandb login.

wandb options

wandb_mode:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

Special Tokens

It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"

When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.

OpenAccess-AI-Collective/axolotl

huggingface/transformers

huggingface/peft

huggingface/accelerate

do you know why I get this error? datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Streaming is recommended for large datasets

Common Errors 🧰

Tokenization Mismatch b/w Inference & Training

Table of Contents

jeopardy

oasst

gpteacher

reflection

explainchoice

concisechoice

summarizetldr

alpaca_chat

alpaca_chat.load_qa

alpaca_chat.load_concise

alpaca_chat.load_camel_ai

alpaca_w_system.load_open_orca

context_qa

context_qa.load_v2

context_qa.load_404

creative_acr.load_answer

creative_acr.load_critique

creative_acr.load_revise

metharme

How to add custom prompt format

pygmalion

sharegpt.load_role

sharegpt.load_guanaco

sharegpt_jokes

Train

Preprocess dataset

Multi-GPU

DeepSpeed

FSDP

FSDP + QLoRA

Weights & Biases Logging

Special Tokens