The datasets.exceptions.DatasetGenerationError
typically occurs when there's an issue with generating a dataset using the Hugging Face datasets
library. This error can happen for several reasons, including but not limited to:
Incorrect Dataset Name or Configuration: If you're trying to load a dataset from the Hugging Face Datasets Hub and you've specified an incorrect dataset name or configuration name, the library won't be able to find and load the dataset, resulting in this error.
Custom Dataset Script Errors: If you're using a custom dataset loading script (a Python script that defines how to download, process, and load a dataset), any errors in this script can lead to a DatasetGenerationError
. Common issues include syntax errors, logic errors, or issues with how the data is processed and formatted.
Data Access Issues: If the dataset script is trying to access data that is not available or has restricted access (e.g., due to permissions issues or broken URLs), the dataset generation process can fail.
Dependencies Missing: Some datasets require additional Python libraries to process the data (e.g., image processing libraries for datasets containing images). If these dependencies are not installed, the dataset generation can fail.
Resource Limitations: Generating some datasets requires significant computational resources (e.g., memory). If your environment does not have enough resources, the dataset generation process might fail.
To resolve this error, you can try the following steps:
Verify Dataset Name/Configuration: Ensure that the dataset name and configuration (if applicable) you're using to load the dataset are correct. You can check the available datasets and their configurations on the Hugging Face Datasets Hub.
Check the Dataset Script: If you're using a custom dataset script, carefully review the script for any errors or issues. Running the script outside of the dataset loading context (e.g., as a standalone Python script) might help identify the problem.
Install Dependencies: Make sure all required dependencies for processing the dataset are installed in your environment.
Check Data Availability: Ensure that all data sources the dataset script tries to access are available and that there are no permissions issues.
Increase Resources: If possible, try running the dataset generation process in an environment with more computational resources (e.g., more memory).
If you're still encountering issues, providing more specific details about the dataset you're trying to load and the exact code you're using can help diagnose the problem more accurately.
# docstyle-ignore DATASETS_IMPORT_ERROR = """ {0} requires the 🤗 Datasets library but it was not found in your environment. You can install it with:
pip install datasets
In a notebook or a colab, you can install it by executing a cell with
!pip install datasets
then restarting your kernel.
Note that if you have a local folder named `datasets` or a local python file named `datasets.py` in your current
working directory, python may try to import this instead of the 🤗 Datasets library. You should rename this folder or
that python file if that's the case. Please note that you may need to restart your runtime after installation.
"""
"""Module containing Dataset functionality"""
def _check_dataset_format(self, with_index: bool): if not isinstance(self.dataset, Dataset): raise ValueError(f"Dataset should be a datasets.Dataset object, but got {type(self.dataset)}") if len({"title", "text", "embeddings"} - set(self.dataset.column_names)) > 0: raise ValueError( "Dataset should be a dataset with the following columns: " "title (str), text (str) and embeddings (arrays of dimension vector_size), " f"but got columns {self.dataset.column_names}" ) if with_index and "embeddings" not in self.dataset.list_indexes(): raise ValueError( "Missing faiss index in the dataset. Make sure you called `dataset.add_faiss_index` to compute it " "or `dataset.load_faiss_index` to load one from the disk." )
def get_dataset(self, type_path) -> Seq2SeqDataset: n_obs = self.n_obs[type_path] max_target_length = self.target_lens[type_path] dataset = Seq2SeqDataset( self.tokenizer, type_path=type_path, n_obs=n_obs, max_target_length=max_target_length, **self.dataset_kwargs, ) return dataset
class MyDataset(Dataset): data = [ [ {"role": "system", "content": "This is a system message."}, {"role": "user", "content": "This is a test"}, {"role": "assistant", "content": "This is a reply"}, ], ] def __len__(self): return 1 def __getitem__(self, i): return {"text": self.data[i]}
def test_small_chat_model_with_dataset_pt(self): from torch.utils.data import Dataset from transformers.pipelines.pt_utils import KeyDataset class MyDataset(Dataset): data = [ [ {"role": "system", "content": "This is a system message."}, {"role": "user", "content": "This is a test"}, {"role": "assistant", "content": "This is a reply"}, ], ] def __len__(self): return 1 def __getitem__(self, i): return {"text": self.data[i]} text_generator = pipeline( task="text-generation", model="rocketknight1/tiny-gpt2-with-chatml-template", framework="pt" ) dataset = MyDataset() key_dataset = KeyDataset(dataset, "text") for outputs in text_generator(key_dataset, do_sample=False, max_new_tokens=10): expected_chat = dataset.data[0] + [ { "role": "assistant", "content": " factors factors factors factors factors factors factors factors factors factors", } ] self.assertEqual( outputs, [ {"generated_text": expected_chat}, ], )
def get_dataset(self, type_path) -> Seq2SeqDataset: n_obs = self.n_obs[type_path] max_target_length = self.target_lens[type_path] dataset = Seq2SeqDataset( self.tokenizer, type_path=type_path, n_obs=n_obs, max_target_length=max_target_length, **self.dataset_kwargs, ) return dataset
def train_dataset(self, dataset): self._train_dataset = dataset
def get_dataset(self): self.dataset = self.dataset.apply(tf.data.experimental.assert_cardinality(len(self.features))) return self.dataset
def get_dataset(): data_file = str(self.tests_dir / "fixtures/tests_samples/SQUAD/sample.json") data_files = {"train": data_file, "validation": data_file} raw_datasets = datasets.load_dataset("json", data_files=data_files, field="data") train_dataset = raw_datasets["train"].map(_add_eos_to_examples).map(_convert_to_features, batched=True) valid_dataset = deepcopy(train_dataset) return train_dataset, valid_dataset
def __init__(self, p_stop=0.01, max_length=1000): self.p_stop = p_stop self.max_length = max_length self.generator = torch.Generator()
def get_dataset(self): return self.dataset
def make_dataset(args, tokenizer, accelerator, split="train"): # Get the datasets: you can either provide your own training and evaluation files (see below) # or specify a Dataset from the hub (the dataset will be downloaded automatically from the datasets Hub). # In distributed training, the load_dataset function guarantees that only one local process can concurrently # download the dataset. if args.dataset_name is not None: # Downloading and loading a dataset from the hub. dataset = load_dataset( args.dataset_name, args.dataset_config_name, cache_dir=args.cache_dir, ) else: if args.train_data_dir is not None: dataset = load_dataset( args.train_data_dir, cache_dir=args.cache_dir, ) # See more about loading custom images at # https://huggingface.co/docs/datasets/v2.0.0/en/dataset_script # Preprocessing the datasets. # We need to tokenize inputs and targets. column_names = dataset[split].column_names # Get the column names for input/target. if args.image_column is None: image_column = column_names[0] else: image_column = args.image_column if image_column not in column_names: raise ValueError( f"`--image_column` value '{args.image_column}' not found in dataset columns. Dataset columns are: {', '.join(column_names)}" ) if args.caption_column is None: caption_column = column_names[1] else: caption_column = args.caption_column if caption_column not in column_names: raise ValueError( f"`--caption_column` value '{args.caption_column}' not found in dataset columns. Dataset columns are: {', '.join(column_names)}" ) if args.conditioning_image_column is None: conditioning_image_column = column_names[2] else: conditioning_image_column = args.conditioning_image_column if conditioning_image_column not in column_names: raise ValueError( f"`--conditioning_image_column` value '{args.conditioning_image_column}' not found in dataset columns. Dataset columns are: {', '.join(column_names)}" ) def tokenize_captions(examples, is_train=True): captions = [] for caption in examples[caption_column]: if random.random() < args.proportion_empty_prompts: captions.append("") elif isinstance(caption, str): captions.append(caption) elif isinstance(caption, (list, np.ndarray)): # take a random caption if there are multiple captions.append(random.choice(caption) if is_train else caption[0]) else: raise ValueError( f"Caption column `{caption_column}` should contain either strings or lists of strings." ) inputs = tokenizer( captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt" ) return inputs.input_ids image_transforms = transforms.Compose( [ transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR), transforms.CenterCrop(args.resolution), transforms.ToTensor(), transforms.Normalize([0.5], [0.5]), ] ) conditioning_image_transforms = transforms.Compose( [ transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR), transforms.CenterCrop(args.resolution), transforms.ToTensor(), ] ) def preprocess_train(examples): images = [image.convert("RGB") for image in examples[image_column]] images = [image_transforms(image) for image in images] conditioning_images = [image.convert("RGB") for image in examples[conditioning_image_column]] conditioning_images = [conditioning_image_transforms(image) for image in conditioning_images] examples["pixel_values"] = images examples["conditioning_pixel_values"] = conditioning_images examples["input_ids"] = tokenize_captions(examples) return examples with accelerator.main_process_first(): if args.max_train_samples is not None: dataset[split] = dataset[split].shuffle(seed=args.seed).select(range(args.max_train_samples)) # Set the training transforms split_dataset = dataset[split].with_transform(preprocess_train) return split_dataset
def __init__(self, **dataset_kwargs): parser = argparse.ArgumentParser() parser = _add_data_args(parser) parser = _add_validation_args(parser) data_args = parser.parse_known_args() self.dataset_args = vars(data_args[0]) self.dataset_args.update(dataset_kwargs) self.dataset_args["megatron_dataset_flag"] = True
def eval_dataset(self, dataset): self._eval_dataset = dataset
def check_dataset_or_pretraining_dataset(cls, data): if data.get("datasets") is None and data.get("pretraining_dataset") is None: raise ValueError("either datasets or pretraining_dataset is required") return data
processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=False, desc="Running tokenizer on dataset", ) train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["validation"] train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True )
# loading dataset dataset = load_dataset("financial_phrasebank", "sentences_allagree") dataset = dataset["train"].train_test_split(test_size=0.1) dataset["validation"] = dataset["test"]
def __init__( self, dataset, device=None, rng_types=None, synchronized_generator=None, skip_batches=0, _drop_last: bool = False, _non_blocking: bool = False, **kwargs, ): super().__init__(dataset, **kwargs) self.device = device self.rng_types = rng_types self.synchronized_generator = synchronized_generator self.skip_batches = skip_batches self.gradient_state = GradientState() self._drop_last = _drop_last self._non_blocking = _non_blocking self.iteration = 0
def __init__(self, dataset: Mapping, prefix: str): self.dataset = dataset self.prefix = prefix
def log_validation(val_dataset, text_encoder, unet, controlnet, args, accelerator): pipeline = LightControlNetPipeline.from_pretrained( args.pretrained_model_name_or_path, controlnet=accelerator.unwrap_model(controlnet, keep_fp32_wrapper=True), unet=accelerator.unwrap_model(unet, keep_fp32_wrapper=True).model, text_encoder=accelerator.unwrap_model(text_encoder, keep_fp32_wrapper=True), safety_checker=None, revision=args.revision, ) pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) pipeline = pipeline.to(accelerator.device) pipeline.set_progress_bar_config(disable=True) generator = torch.Generator(device=accelerator.device).manual_seed(args.seed) image_logs = [] for idx in range(args.num_validation_images): data = val_dataset[idx] validation_prompt = data["text"] validation_image = data["conditioning_pixel_values"] image = pipeline( validation_prompt, [validation_image], num_inference_steps=50, generator=generator, )[0][0] image_logs.append( { "validation_image": validation_image, "image": image, "validation_prompt": validation_prompt, } ) for tracker in accelerator.trackers: formatted_images = [] for log in image_logs: image = log["image"] validation_prompt = log["validation_prompt"] validation_image = log["validation_image"] formatted_images.append(wandb.Image(validation_image, caption="Controlnet conditioning")) image = wandb.Image(image, caption=validation_prompt) formatted_images.append(image) tracker.log({"validation": formatted_images}) del pipeline torch.cuda.empty_cache()
def get_batch_megatron(data_iterator): """Generate a batch""" # Items and their type. keys = ["text"] datatype = torch.int64 # Broadcast data. if data_iterator is not None: data = next(data_iterator) else: data = None data_b = tensor_parallel.broadcast_data(keys, data, datatype) # Unpack. tokens_ = data_b["text"].long() labels = tokens_[:, 1:].contiguous() tokens = tokens_[:, :-1].contiguous() # Get the masks and postition ids. attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids( tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, self.eod_mask_loss ) return tokens, labels, loss_mask, attention_mask, position_ids
def __init__( # pylint: disable=super-init-not-called self, prompt_tokenizer: PromptTokenizingStrategy, dataset: Dataset, process_count: Optional[int] = None, keep_in_memory: Optional[bool] = False, **kwargs, ): self.prompt_tokenizer = prompt_tokenizer self.process_count = process_count self.keep_in_memory = keep_in_memory super().__init__( self.process(dataset).data, **kwargs, )
def fix_sharegpt_datasets(cls, datasets): for idx, ds_cfg in enumerate(datasets): if not ds_cfg["type"]: continue if ds_cfg["type"] == "sharegpt:chat": LOG.warning( PendingDeprecationWarning( "`type: sharegpt:chat` will soon be deprecated. simply use `type: sharegpt` instead." ) ) datasets[idx]["type"] = "sharegpt" if "sharegpt_simple" in ds_cfg["type"]: LOG.warning( PendingDeprecationWarning( "`type: sharegpt_simple` will soon be deprecated. simply use `type: sharegpt` instead." ) ) datasets[idx]["type"] = datasets[idx]["type"].replace( "sharegpt_simple", "sharegpt" ) return datasets
class PromptDataset(Dataset): "A simple dataset to prepare the prompts to generate class images on multiple GPUs." def __init__(self, prompt, num_samples): self.prompt = prompt self.num_samples = num_samples def __len__(self): return self.num_samples def __getitem__(self, index): example = {} example["prompt"] = self.prompt example["index"] = index return example
def create_datasets(tokenizer, data_args, training_args, apply_chat_template=False): def preprocess(samples): batch = [] for conversation in samples["messages"]: batch.append(tokenizer.apply_chat_template(conversation, tokenize=False)) return {"content": batch} raw_datasets = DatasetDict() for split in data_args.splits.split(","): try: # Try first if dataset on a Hub repo dataset = load_dataset(data_args.dataset_name, split=split) except DatasetGenerationError: # If not, check local dataset dataset = load_from_disk(os.path.join(data_args.dataset_name, split)) if "train" in split: raw_datasets["train"] = dataset elif "test" in split: raw_datasets["test"] = dataset else: raise ValueError(f"Split type {split} not recognized as one of test or train.") if apply_chat_template: raw_datasets = raw_datasets.map( preprocess, batched=True, remove_columns=raw_datasets["train"].column_names, ) train_data = raw_datasets["train"] valid_data = raw_datasets["test"] print(f"Size of the train set: {len(train_data)}. Size of the validation set: {len(valid_data)}") print(f"A sample of train dataset: {train_data[0]}") return train_data, valid_data
def __init__(self, dataset, processor): self.dataset = dataset self.processor = processor
def __len__(self): # We will just raise the downstream error if the underlying dataset is not sized if self.drop_last: return (len(self.dataset) // (self.batch_size * self.num_processes)) * self.batch_size else: return math.ceil(len(self.dataset) / (self.batch_size * self.num_processes)) * self.batch_size
def predict_with_generate(): eval_src, eval_pred, eval_ref = [], [], [] for batch in tqdm(eval_dataloader): batch_labels = batch["labels"].to(device) batch_input_ids = batch["input_ids"].to(device) if "position_ids" in batch: batch_pos_ids = batch["position_ids"].tolist() else: batch_pos_ids = [None] * len(batch["input_ids"]) prompt_token_ids_list = [] completion_token_ids_list = [] for input_ids_all, labels_all, pos_ids in zip( batch_input_ids, batch_labels, batch_pos_ids, ): if pos_ids is None: pos_ranges = [(0, len(input_ids_all) - 1)] else: pos_ranges = find_ranges(pos_ids) for pos_range in pos_ranges: start, end = pos_range if start == end: continue input_ids = input_ids_all[start : end + 1] labels = labels_all[start : end + 1] tokens_without_loss = labels == IGNORE_INDEX tokens_with_loss = labels != IGNORE_INDEX tokens_exclude_padding = input_ids != tokenizer.pad_token_id prompt_token_includes = ( tokens_without_loss & tokens_exclude_padding ) prompt_token_ids = input_ids[prompt_token_includes] prompt_token_ids_list.append(prompt_token_ids) completion_token_ids = input_ids[tokens_with_loss] completion_token_ids_list.append(completion_token_ids) prompt_texts = tokenizer.batch_decode( prompt_token_ids_list, skip_special_tokens=True ) completion_texts = tokenizer.batch_decode( completion_token_ids_list, skip_special_tokens=True ) with torch.no_grad(): prompt_encoding = tokenizer( prompt_texts, padding=True, return_tensors="pt" ).to(self.cfg.device) predictions = trainer.model.generate( **prompt_encoding, generation_config=generation_config ) prediction_all_tokens = predictions["sequences"].cpu().tolist() prediction_without_prompt_tokens_list = [] for prompt_token_ids, prediction_tokens in zip( prompt_token_ids_list, prediction_all_tokens ): prediction_without_prompt_tokens = prediction_tokens[ len(prompt_token_ids) : ] prediction_without_prompt_tokens_list.append( prediction_without_prompt_tokens ) predicted_texts = tokenizer.batch_decode( prediction_without_prompt_tokens_list, skip_special_tokens=True ) eval_src.extend(prompt_texts) eval_pred.extend(predicted_texts) eval_ref.extend(completion_texts) return eval_src, eval_pred, eval_ref
class PromptDataset(Dataset): "A simple dataset to prepare the prompts to generate class images on multiple GPUs." def __init__(self, prompt, num_samples): self.prompt = prompt self.num_samples = num_samples def __len__(self): return self.num_samples def __getitem__(self, index): example = {} example["prompt"] = self.prompt example["index"] = index return example
def __len__(self): return len(self.dataset)
def process(self, dataset): features = dataset.features.keys() num_proc = min(64, self.process_count if self.process_count else os.cpu_count()) map_kwargs = {} if self.prompt_tokenizer.supports_batched: map_kwargs["batched"] = True map_kwargs["batch_size"] = 100 return dataset.map( self.prompt_tokenizer.tokenize_prompt, num_proc=num_proc, remove_columns=features, keep_in_memory=self.keep_in_memory, desc="Tokenizing Prompts", **map_kwargs, )
def __iter__(self): if ( not hasattr(self.dataset, "set_epoch") and hasattr(self.dataset, "generator") and isinstance(self.dataset.generator, torch.Generator) ): self.dataset.generator.manual_seed(self.epoch) real_batch_size = self.batch_size if self.split_batches else (self.batch_size * self.num_processes) process_batch_size = (self.batch_size // self.num_processes) if self.split_batches else self.batch_size process_slice = range(self.process_index * process_batch_size, (self.process_index + 1) * process_batch_size) first_batch = None current_batch = [] for element in self.dataset: current_batch.append(element) # Wait to have a full batch before yielding elements. if len(current_batch) == real_batch_size: for i in process_slice: yield current_batch[i] if first_batch is None: first_batch = current_batch.copy() current_batch = [] # Finished if drop_last is True, otherwise complete the last batch with elements from the beginning. if not self.drop_last and len(current_batch) > 0: if first_batch is None: first_batch = current_batch.copy() while len(current_batch) < real_batch_size: current_batch += first_batch for i in process_slice: yield current_batch[i]
def get_batch_func(self, accelerator, megatron_dataset_flag): def get_batch_megatron(data_iterator): """Generate a batch""" # Items and their type. keys = ["text"] datatype = torch.int64 # Broadcast data. if data_iterator is not None: data = next(data_iterator) else: data = None data_b = tensor_parallel.broadcast_data(keys, data, datatype) # Unpack. tokens_ = data_b["text"].long() labels = tokens_[:, 1:].contiguous() tokens = tokens_[:, :-1].contiguous() # Get the masks and postition ids. attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids( tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, self.eod_mask_loss ) return tokens, labels, loss_mask, attention_mask, position_ids def get_batch_transformer(data_iterator): data = next(data_iterator) data = {"input_ids": data["input_ids"]} data = send_to_device(data, torch.cuda.current_device()) tokens_ = data["input_ids"].long() padding = torch.zeros((tokens_.shape[0], 1), dtype=tokens_.dtype, device=tokens_.device) + self.eod_token tokens_ = torch.concat([tokens_, padding], dim=1) labels = tokens_[:, 1:].contiguous() tokens = tokens_[:, :-1].contiguous() # Get the masks and postition ids. attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids( tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, True ) return tokens, labels, loss_mask, attention_mask, position_ids if accelerator.state.megatron_lm_plugin.custom_get_batch_function is not None: return accelerator.state.megatron_lm_plugin.custom_get_batch_function if megatron_dataset_flag: try: # Use '--no-use-pep517 -e' to pip install nvidia's megatron from source from pretrain_gpt import get_batch return get_batch except ImportError: pass return get_batch_megatron else: return get_batch_transformer
def __init__( # pylint: disable=super-init-not-called self, tokenizer, datasets, seq_length=2048, ): self.tokenizer = tokenizer self.concat_token_id = tokenizer.eos_token_id self.datasets: List[IterableDataset] = datasets self.seq_length = seq_length vocab_size = len(tokenizer.get_vocab()) if vocab_size <= torch.iinfo(torch.int16).max: self.tokens_dtype = torch.int16 elif vocab_size <= torch.iinfo(torch.int32).max: self.tokens_dtype = torch.int32 else: self.tokens_dtype = torch.int64
def __init__(self, prompt, num_samples): self.prompt = prompt self.num_samples = num_samples
def construct(self): # The dataset items fill = Rectangle(height=0.46,width=0.46).set_stroke(width=0) columns = [ VGroup(*[Rectangle(height=0.25,width=0.25,color="green") for i in range(8)]).arrange(RIGHT,buff=0) for j in range(4) ] dataset_recs = VGroup(*columns).arrange(UP, buff=0) dataset_text = Text("Dataset", font_size=24) dataset = Group(dataset_recs,dataset_text).arrange(DOWN, buff=0.5, aligned_edge=DOWN) dataset.move_to([-2,0,0]) self.add(dataset) code = Code( code="dataloader = DataLoader(...)\nfor batch in dataloader():\n\t...", tab_width=4, background="window", language="Python", font="Monospace", font_size=14, corner_radius=.2, insert_line_no=False, line_spacing=.75, style=Code.styles_list[1], ) code.move_to([-3.5, 2.5, 0]) self.add(code) # The dataloader itself dataloader = Group( Rectangle(color="red", height=2, width=2), Text("DataLoader", font_size=24) ).arrange(DOWN, buff=.5, aligned_edge=DOWN) sampler = Group( Rectangle(color="blue", height=1, width=1), Text("Sampler", font_size=12) ).arrange(DOWN, buff=.25, aligned_edge=DOWN) dataloader.move_to([1, 0, 0]) sampler.move_to([.75,.25,0]) self.add(dataloader) self.add(sampler) gpu_1 = Group( Rectangle(color="white", height=1, width=1), Text("GPU 1", font_size=12) ).arrange(DOWN, buff=.25, aligned_edge=DOWN).move_to([4, 2, 0]) gpu_2 = Group( Rectangle(color="white", height=1, width=1), Text("GPU 2", font_size=12) ).arrange(DOWN, buff=.25, aligned_edge=DOWN).move_to([4, .5, 0]) gpu_3 = Group( Rectangle(color="white", height=1, width=1), Text("GPU 3", font_size=12) ).arrange(DOWN, buff=.25, aligned_edge=DOWN).move_to([4, -1, 0]) gpu_4 = Group( Rectangle(color="white", height=1, width=1), Text("GPU 4", font_size=12) ).arrange(DOWN, buff=.25, aligned_edge=DOWN).move_to([4, -2.5, 0]) gpus = [gpu_1[0], gpu_2[0], gpu_3[0], gpu_4[0]] self.add(gpu_1, gpu_2, gpu_3, gpu_4) # Animate their existence self.play( Create(gpu_1[0], run_time=0.5), Create(gpu_2[0], run_time=0.5), Create(gpu_3[0], run_time=0.5), Create(gpu_4[0], run_time=0.5), Create(dataset_recs, run_time=1), Create(sampler[0], run_time=1), Create(dataloader[0], run_time=1) ) step_1 = MarkupText( f"Without any special care, \nthe same data is sent though each sampler, \nand the same samples are spit out on each GPU", font_size=18 ) step_1.move_to([0, -2.5, 0]) self.play( Write(step_1, run_time=4), ) first_animations = [] second_animations = [] colors = ["BLUE_E", "DARK_BROWN", "GOLD_E", "GRAY_A"] current_color = colors[0] buff = 0 lr_buff = .25 old_target = None new_datasets = [] for i,data in enumerate(dataset_recs[-1]): if i % 2 == 0: # current_color = colors[i//2] current_color = "BLUE_E" dataset_target = Rectangle(height=0.46/2,width=0.46/2).set_stroke(width=0.).set_fill(current_color, opacity=0.7) dataset_target.move_to(data) dataset_target.generate_target() aligned_edge = ORIGIN if i % 2 == 0: old_target = dataset_target.target buff -= .25 aligned_edge = LEFT dataset_target.target.next_to( sampler, buff=buff, direction=UP, aligned_edge=LEFT ) else: dataset_target.target.next_to( old_target, direction=RIGHT, buff=0.01, ) new_datasets.append(dataset_target) first_animations.append(data.animate(run_time=0.5).set_stroke(current_color)) second_animations.append(MoveToTarget(dataset_target, run_time=1.5)) self.play(*first_animations) self.play(*second_animations) self.wait() move_animation = [] for j,gpu in enumerate(gpus): buff = 0 for i,data in enumerate(new_datasets): if i % 2 == 0: current_color = colors[i//2] if j != 3: data = data.copy() data.generate_target() aligned_edge = ORIGIN if i % 2 == 0: old_target = data.target buff -= .25 aligned_edge = LEFT data.target.next_to( gpu, buff=buff, direction=UP, aligned_edge=LEFT ) else: data.target.next_to( old_target, direction=RIGHT, buff=0.01, ) move_animation.append(MoveToTarget(data, run_time=1.5)) self.play(*move_animation) self.remove(step_1) step_2 = MarkupText( f"This behavior is undesireable, because we want\neach GPU to see different data for efficient training.", font_size=18 ) step_2.move_to([0, -2.5, 0]) self.play( Write(step_2, run_time=2.5), ) self.wait()
def set_epoch(self, epoch): self.epoch = epoch if hasattr(self.dataset, "set_epoch"): self.dataset.set_epoch(epoch)
def __init__( self, instance_data_root, instance_prompt, tokenizer, class_data_root=None, class_prompt=None, size=512, center_crop=False, ): self.size = size self.center_crop = center_crop self.tokenizer = tokenizer self.instance_data_root = Path(instance_data_root) if not self.instance_data_root.exists(): raise ValueError("Instance images root doesn't exists.") self.instance_images_path = list(Path(instance_data_root).iterdir()) self.num_instance_images = len(self.instance_images_path) self.instance_prompt = instance_prompt self._length = self.num_instance_images if class_data_root is not None: self.class_data_root = Path(class_data_root) self.class_data_root.mkdir(parents=True, exist_ok=True) self.class_images_path = list(self.class_data_root.iterdir()) self.num_class_images = len(self.class_images_path) self._length = max(self.num_class_images, self.num_instance_images) self.class_prompt = class_prompt else: self.class_data_root = None self.image_transforms = transforms.Compose( [ transforms.Resize(size, interpolation=transforms.InterpolationMode.BILINEAR), transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size), transforms.ToTensor(), transforms.Normalize([0.5], [0.5]), ] )
class PromptDataset(Dataset): "A simple dataset to prepare the prompts to generate class images on multiple GPUs." def __init__(self, prompt, num_samples): self.prompt = prompt self.num_samples = num_samples def __len__(self): return self.num_samples def __getitem__(self, index): example = {} example["prompt"] = self.prompt example["index"] = index return example
def train_dataset(self): return self._train_dataset
def __init__(self, file_names, image_transform=None, label_to_id=None): self.file_names = file_names self.image_transform = image_transform self.label_to_id = label_to_id
---
title: Dataset Preprocessing
description: How datasets are processed
---
Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
the (dataset format)[../dataset-formats/] and prompt strategies to:
- parse the dataset based on the *dataset format*
- transform the dataset to how you would interact with the model based on the *prompt strategy*
- tokenize the dataset based on the configured model & tokenizer
- shuffle and merge multiple datasets together if using more than one
The processing of the datasets can happen one of two ways:
1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
2. When training is started
What are the benefits of pre-processing? When training interactively or for sweeps
(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
training parameters so that it will intelligently pull from its cache when possible.
The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.
If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
data is in the cache.
What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
prompt template. Because the trainer cannot readily detect these changes, we cannot change the
calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
and change your prompt templating logic, it may not pick up the changes you made and you will be
training over the old prompt.
class IterableDatasetShard(IterableDataset): """ Wraps a PyTorch `IterableDataset` to generate samples for one of the processes only. Instances of this class will always yield a number of samples that is a round multiple of the actual batch size (depending of the value of `split_batches`, this is either `batch_size` or `batch_size x num_processes`). Depending on the value of the `drop_last` attribute of the batch sampler passed, it will either stop the iteration at the first batch that would be too small or loop with indices from the beginning. Args: dataset (`torch.utils.data.dataset.IterableDataset`): The batch sampler to split in several shards. batch_size (`int`, *optional*, defaults to 1): The size of the batches per shard (if `split_batches=False`) or the size of the batches (if `split_batches=True`). drop_last (`bool`, *optional*, defaults to `False`): Whether or not to drop the last incomplete batch or complete the last batches by using the samples from the beginning. num_processes (`int`, *optional*, defaults to 1): The number of processes running concurrently. process_index (`int`, *optional*, defaults to 0): The index of the current process. split_batches (`bool`, *optional*, defaults to `False`): Whether the shards should be created by splitting a batch to give a piece of it on each process, or by yielding different full batches on each process. On two processes with an iterable dataset yielding of `[0, 1, 2, 3, 4, 5, 6, 7]`, this will result in: - the shard on process 0 to yield `[0, 1, 2, 3]` and the shard on process 1 to yield `[4, 5, 6, 7]` if this argument is set to `False`. - the shard on process 0 to yield `[0, 1, 4, 5]` and the sampler on process 1 to yield `[2, 3, 6, 7]` if this argument is set to `True`. """ def __init__( self, dataset: IterableDataset, batch_size: int = 1, drop_last: bool = False, num_processes: int = 1, process_index: int = 0, split_batches: bool = False, ): if split_batches and batch_size > 1 and batch_size % num_processes != 0: raise ValueError( f"To use `IterableDatasetShard` in `split_batches` mode, the batch size ({batch_size}) " f"needs to be a round multiple of the number of processes ({num_processes})." ) self.dataset = dataset self.batch_size = batch_size self.drop_last = drop_last self.num_processes = num_processes self.process_index = process_index self.split_batches = split_batches def set_epoch(self, epoch): self.epoch = epoch if hasattr(self.dataset, "set_epoch"): self.dataset.set_epoch(epoch) def __len__(self): # We will just raise the downstream error if the underlying dataset is not sized if self.drop_last: return (len(self.dataset) // (self.batch_size * self.num_processes)) * self.batch_size else: return math.ceil(len(self.dataset) / (self.batch_size * self.num_processes)) * self.batch_size def __iter__(self): if ( not hasattr(self.dataset, "set_epoch") and hasattr(self.dataset, "generator") and isinstance(self.dataset.generator, torch.Generator) ): self.dataset.generator.manual_seed(self.epoch) real_batch_size = self.batch_size if self.split_batches else (self.batch_size * self.num_processes) process_batch_size = (self.batch_size // self.num_processes) if self.split_batches else self.batch_size process_slice = range(self.process_index * process_batch_size, (self.process_index + 1) * process_batch_size) first_batch = None current_batch = [] for element in self.dataset: current_batch.append(element) # Wait to have a full batch before yielding elements. if len(current_batch) == real_batch_size: for i in process_slice: yield current_batch[i] if first_batch is None: first_batch = current_batch.copy() current_batch = [] # Finished if drop_last is True, otherwise complete the last batch with elements from the beginning. if not self.drop_last and len(current_batch) > 0: if first_batch is None: first_batch = current_batch.copy() while len(current_batch) < real_batch_size: current_batch += first_batch for i in process_slice: yield current_batch[i]
class PetsDataset(Dataset): def __init__(self, file_names, image_transform=None, label_to_id=None): self.file_names = file_names self.image_transform = image_transform self.label_to_id = label_to_id def __len__(self): return len(self.file_names) def __getitem__(self, idx): fname = self.file_names[idx] raw_image = PIL.Image.open(fname) image = raw_image.convert("RGB") if self.image_transform is not None: image = self.image_transform(image) label = extract_label(fname) if self.label_to_id is not None: label = self.label_to_id[label] return {"image": image, "label": label}
---
title: Pre-training
description: Data format for a pre-training completion task.
order: 1
---
For pretraining, there is no prompt template or roles. The only required field is `text`:
```{.json filename="data.jsonl"}
{"text": "first row"}
{"text": "second row"}
...
:::{.callout-note}
Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
pretraining_dataset: # hf path only
...
:::
See also the FAQ's and debugging guide.
If you encounter a 'Cuda out of memory' error, it means your GPU ran out of memory during the training process. Here's how to resolve it:
Please reduce any below
micro_batch_size
eval_batch_size
gradient_accumulation_steps
sequence_len
If it does not help, try running without deepspeed and without accelerate (replace "accelerate launch" with "python") in the command.
Using adamw_bnb_8bit might also save you some memory.
failed (exitcode: -9)
Usually means your system has run out of system memory. Similarly, you should consider reducing the same settings as when you run out of VRAM. Additionally, look into upgrading your system RAM which should be simpler than GPU upgrades.
RuntimeError: expected scalar type Float but found Half
Try set fp16: true
NotImplementedError: No operator found for
memory_efficient_attention_forward
...
Try to turn off xformers.
accelerate config missing
It's safe to ignore it.
NCCL Timeouts during training
See the NCCL guide.
For many formats, Axolotl constructs prompts by concatenating token ids after tokenizing strings. The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.
If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens. When you are starting out with a new format, you should always do the following:
python -m axolotl.cli.preprocess your_config.yml --debug
, and then decode the first few rows with your model's tokenizer.Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See this blog post for a concrete example.
---
title: Dataset Formats
description: Supported dataset formats.
listing:
fields: [title, description]
type: table
sort-ui: false
filter-ui: false
max-description-length: 250
---
Axolotl supports a variety of dataset formats. It is recommended to use a JSONL format. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.
Below are these various formats organized by task:
---
title: Custom Pre-Tokenized Dataset
description: How to use a custom pre-tokenized dataset.
order: 5
---
- Pass an empty `type:` in your axolotl config.
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
- To indicate that a token should be ignored during training, set its corresponding label to `-100`.
- Do not add BOS/EOS. Axolotl will add them for you based on the default tokenizer for the model you're using.
- For pretraining, do not truncate/pad documents to the context window length.
- For instruction training, documents must be truncated/padded as desired.
Sample config:
```{.yaml filename="config.yml"}
datasets:
- path: /path/to/your/file.jsonl
ds_type: json
type:
Sample jsonl:
{"input_ids":[271,299,99],"attention_mask":[1,1,1],"labels":[271,-100,99]} {"input_ids":[87,227,8383,12],"attention_mask":[1,1,1,1],"labels":[87,227,8383,12]}
---
title: Instruction Tuning
description: Instruction tuning formats for supervised fine-tuning.
order: 2
---
## alpaca
instruction; input(optional)
```{.json filename="data.jsonl"}
{"instruction": "...", "input": "...", "output": "..."}
question and answer
{"question": "...", "category": "...", "answer": "..."}
instruction
{"INSTRUCTION": "...", "RESPONSE": "..."}
instruction; input(optional)
{"instruction": "...", "input": "...", "response": "..."}
instruction with reflect; input(optional)
{"instruction": "...", "input": "...", "output": "...", "reflection": "...", "corrected": "..."}
question, choices, (solution OR explanation)
{"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}
question, choices, (solution OR explanation)
{"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}
article and summary
{"article": "...", "summary": "..."}
basic instruct for alpaca chat
{"instruction": "...", "input": "...", "response": "..."}
question and answer for alpaca chat
{"question": "...", "answer": "..."}
question and answer for alpaca chat, for concise answers
{"instruction": "...", "input": "...", "response": "..."}
question and answer for alpaca chat, for load_camel_ai
{"message_1": "...", "message_2": "..."}
support for open orca datasets with included system prompts, instruct
{"system_prompt": "...", "question": "...", "response": "..."}
in context question answering from an article
{"article": "...", "question": "...", "answer": "..."}
in context question answering (alternate)
{"context": "...", "question": "...", "answer": "..."}
in context question answering from an article, with default response for no answer from context
{"article": "...", "unanswerable_question": "..."}
instruction and revision
{"instruction": "...", "revision": "..."}
critique
{"scores": "...", "critiques": "...", "instruction": "...", "answer": "..."}
critique and revise
{"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."}
instruction, adds additional eos tokens
{"prompt": "...", "generation": "..."}
For a dataset that is preprocessed for instruction purposes:
{"input": "...", "output": "..."}
You can use this example in your YAML config:
datasets:
- path: repo
type:
system_prompt: ""
field_system: system
field_instruction: input
field_output: output
format: "[INST] {instruction} [/INST]"
no_input_format: "[INST] {instruction} [/INST]"
See full config options under here.
---
title: Conversation
description: Conversation format for supervised fine-tuning.
order: 3
---
## sharegpt
conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
```{.json filename="data.jsonl"}
{"conversations": [{"from": "...", "value": "..."}]}
Note: type: sharegpt
opens special configs:
conversation
: enables conversions to many Conversation types. Refer to the 'name' here for options.roles
: allows you to specify the roles for input and output. This is useful for datasets with custom roles such as tool
etc to support masking.field_human
: specify the key to use instead of human
in the conversation.field_model
: specify the key to use instead of gpt
in the conversation.datasets: path: ... type: sharegpt conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py field_human: # Optional[str]. Human key to use for conversation. field_model: # Optional[str]. Assistant key to use for conversation. # Add additional keys from your dataset as input or output roles roles: input: # Optional[List[str]]. These will be masked based on train_on_input output: # Optional[List[str]].
{"conversations": [{"role": "...", "value": "..."}]}
conversations where role
is used instead of from
{"conversations": [{"role": "...", "value": "..."}]}
conversations where from
is prompter
assistant
instead of default sharegpt
{"conversations": [{"from": "...", "value": "..."}]}
creates a chat where bot is asked to tell a joke, then explain why the joke is funny
{"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}
Run
accelerate launch -m axolotl.cli.train your_config.yml
[!TIP] You can also reference a config file that is hosted on a public URL, for example
accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml
You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.
dataset_prepared_path:
to a local folder for saving and loading pre-tokenized dataset.push_dataset_to_hub: hf_user/repo
to push it to Huggingface.--debug
to see preprocessed examples.python -m axolotl.cli.preprocess your_config.yml
Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.
Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated
We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.
deepspeed: deepspeed_configs/zero1.json
accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
Axolotl supports training with FSDP and QLoRA, see these docs for more information.
Make sure your WANDB_API_KEY
environment variable is set (recommended) or you login to wandb with wandb login
.
wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"
When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.