OpenAccess-AI-Collective/axolotl

what is prompt_style?

prompt_style is an attribute used in the Axolotl codebase to specify the style or format of prompts that are generated for interacting with language models. It determines how the prompts are structured, which can significantly impact the model's performance and the relevance of its responses. The prompt_style attribute is often used in classes that generate or manage prompts, allowing for customization of the interaction based on the specific needs of the task or the preferences for the model's output format.

Different values of prompt_style correspond to different formatting strategies for the prompts. For example, in the Axolotl codebase, the PromptStyle enumeration defines several styles such as INSTRUCT, CHAT, and CHATML, each representing a unique way of structuring prompts:

class PromptStyle(Enum):
    """
    Enum for prompt styles
    """
    INSTRUCT = "instruct"
    CHAT = "chat"
    CHATML = "chatml"

These styles can be used to tailor the prompts for different types of interactions, such as instructional prompts, chat-like conversations, or machine learning-specific formats. Depending on the selected prompt_style, the generated prompts will follow a specific template or structure, influencing how the language model interprets the input and generates its responses.

For instance, when a Prompter class instance sets its prompt_style to INSTRUCT, it might generate prompts that are more directive or instructional in nature, guiding the model to perform a specific task. Conversely, setting prompt_style to CHAT might result in prompts that mimic a conversational exchange between a user and the model.

Here's an example of how a Prompter class might use prompt_style to generate different types of prompts:

class ReflectAlpacaPrompter:
    def __init__(self, prompt_style="instruct"):
        self.prompt_style = prompt_style
        self.match_prompt_style()

    def match_prompt_style(self):
        if self.prompt_style == PromptStyle.INSTRUCT.value:
            # Set up prompt template for instructional style
            ...
        elif self.prompt_style == PromptStyle.CHAT.value:
            # Set up prompt template for chat style
            ...

In this example, the ReflectAlpacaPrompter class initializes with a prompt_style and uses the match_prompt_style method to configure its prompt generation strategy based on the selected style.

Search results

[openaccess-ai-collective/axolotl] src/axolotl/prompters.py

class PromptStyle(Enum):
    """
    Enum for prompt styles
    """

    INSTRUCT = "instruct"
    CHAT = "chat"
    CHATML = "chatml"

[openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/user_defined.py

def match_prompt_style(self):
            self.turn_format = turn_format
            self.turn_no_input_format = turn_no_input_format
            self.system_format = system_format

[openaccess-ai-collective/axolotl] src/axolotl/prompters.py

def match_prompt_style(self):
        self.turn_format = "USER: {instruction}\n{input}\nASSISTANT:"
        self.turn_no_input_format = "USER: {instruction}\nASSISTANT:"

[openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/context_qa.py

def match_prompt_style(self):
        # pylint: disable=duplicate-code
        self.turn_format = "{instruction}\n{input}"
        self.turn_no_input_format = "{instruction}"
        self.system_format = "{system}"

[openaccess-ai-collective/axolotl] src/axolotl/prompters.py

def match_prompt_style(self):
        if self.prompt_style == PromptStyle.INSTRUCT.value:
            self.prompt_input = (
                self.system_prompt
                + "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
            )
            self.prompt_no_input = (
                self.system_no_input_prompt
                + "### Instruction:\n{instruction}\n\n### Response:\n"
            )
            self.agent_label = "### Thought:\n{output}\n\n### Agent Reflection:\n{reflection}\n\n### Final Response:\n{corrected}"
            self.response_split = "### Final Response:"
        if self.prompt_style == PromptStyle.CHAT.value:
            self.prompt_input = (
                self.system_prompt + "USER: {instruction}\n{input}\nASSISTANT:"
            )
            self.prompt_no_input = (
                self.system_no_input_prompt + "USER: {instruction}\nASSISTANT:"
            )
            self.agent_label = (
                "\nTHOUGHT: {output}\nASSISTANT REFLECTION: {reflection}\nASSISTANT:"
            )
            self.response_split = "ASSISTANT:"

[openaccess-ai-collective/axolotl] src/axolotl/prompters.py

def match_prompt_style(self):
        # pylint: disable=duplicate-code
        if self.prompt_style == PromptStyle.INSTRUCT.value:
            self.turn_format = "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
            self.turn_no_input_format = (
                "### Instruction:\n{instruction}\n\n### Response:\n"
            )
            self.system_format = "{system}\n\n"
        if self.prompt_style == PromptStyle.CHAT.value:
            self.turn_format = "USER: {instruction}\n{input}\nASSISTANT:"
            self.turn_no_input_format = "USER: {instruction}\nASSISTANT:"
            self.system_format = "SYSTEM: {system}\n"
        if self.prompt_style == PromptStyle.CHATML.value:
            self.turn_format = "<|im_start|>user\n{instruction}\n{input}<|im_end|>\n<|im_start|>assistant\n"
            self.turn_no_input_format = (
                "<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant\n"
            )
            self.system_format = "<|im_start|>system\n{system}<|im_end|>\n"

[openaccess-ai-collective/axolotl] src/axolotl/prompters.py

def match_prompt_style(self):
        self.turn_format = "USER: Summarize the following article as a TL;DR.\n{instruction}\n{input}\nASSISTANT:"
        self.turn_no_input_format = "USER: Summarize the following article as a TL;DR.\n{instruction}\nASSISTANT:"

[openaccess-ai-collective/axolotl] src/axolotl/prompters.py

def __init__(self, prompt_style="instruct"):
        self.prompt_style = prompt_style
        self.match_prompt_style()

[openaccess-ai-collective/axolotl] src/axolotl/prompters.py

def __init__(self, prompt_style=PromptStyle.INSTRUCT.value):
        self.prompt_style = prompt_style if prompt_style else PromptStyle.INSTRUCT.value
        self.match_prompt_style()

[openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/alpaca_w_system.py

def match_prompt_style(self):
        # pylint: disable=duplicate-code
        if self.prompt_style == PromptStyle.INSTRUCT.value:
            self.turn_format = "### Human:\n{instruction}\n### Additional Context:\n{input}\n### Assistant:\n"
            self.turn_no_input_format = "### Human:\n{instruction}\n### Assistant:\n"
            self.system_format = "### System:\n{system}\n"
        if self.prompt_style == PromptStyle.CHAT.value:
            self.turn_format = "USER: {instruction}\n{input}\nASSISTANT:"
            self.turn_no_input_format = "USER: {instruction}\nASSISTANT:"
            self.system_format = "SYSTEM: {system}\n"
        if self.prompt_style == PromptStyle.CHATML.value:
            self.turn_format = "<|im_start|>user\n{instruction}\n{input}<|im_end|>\n<|im_start|>assistant\n"
            self.turn_no_input_format = (
                "<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant\n"
            )
            self.system_format = "<|im_start|>system\n{system}<|im_end|>\n"

[huggingface/peft] docs/source/task_guides/prompt_based_methods.md

[huggingface/peft] docs/source/conceptual_guides/prompting.md

Prompt tuning

<div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/prompt-tuning.png"/> </div> <small>Only train and store a significantly smaller set of task-specific prompt parameters <a href="https://hf.co/papers/2104.08691">(image source)</a>.</small>

Prompt tuning was developed for text classification tasks on T5 models, and all downstream tasks are cast as a text generation task. For example, sequence classification usually assigns a single class label to a sequence of text. By casting it as a text generation task, the tokens that make up the class label are generated. Prompts are added to the input as a series of tokens. Typically, the model parameters are fixed which means the prompt tokens are also fixed by the model parameters.

The key idea behind prompt tuning is that prompt tokens have their own parameters that are updated independently. This means you can keep the pretrained model's parameters frozen, and only update the gradients of the prompt token embeddings. The results are comparable to the traditional method of training the entire model, and prompt tuning performance scales as model size increases.

Take a look at Prompt tuning for causal language modeling for a step-by-step guide on how to train a model with prompt tuning.

[huggingface/transformers] src/transformers/tools/prompts.py

DEFAULT_PROMPTS_REPO = "huggingface-tools/default-prompts"
PROMPT_FILES = {"chat": "chat_prompt_template.txt", "run": "run_prompt_template.txt"}

[huggingface/peft] docs/source/package_reference/prompt_tuning.md

[huggingface/transformers] docs/source/en/tasks/prompting.md

[huggingface/peft] docs/source/package_reference/prompt_tuning.md

Prompt tuning

Prompt tuning adds task-specific prompts to the input, and these prompt parameters are updated independently of the pretrained model parameters which are frozen.

The abstract from the paper is:

In this work, we explore "prompt tuning", a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's "few-shot" learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.

PromptTuningConfig

[[autodoc]] tuners.prompt_tuning.config.PromptTuningConfig

PromptEmbedding

[[autodoc]] tuners.prompt_tuning.model.PromptEmbedding

[huggingface/peft] examples/loftq_finetuning/train_gsm8k_llama.py

def prompt_process(sent_1, sent_2, prompt_1="", prompt_2="", prompt_3=""):
        sent_2 = sent_2.replace("####", "The final answer is")
        return prompt_1 + sent_1 + prompt_2 + sent_2 + prompt_3

[huggingface/peft] docs/source/conceptual_guides/prompting.md

P-tuning

<div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/p-tuning.png"/> </div> <small>Prompt tokens can be inserted anywhere in the input sequence, and they are optimized by a prompt encoder <a href="https://hf.co/papers/2103.10385">(image source)</a>.</small>

P-tuning is designed for natural language understanding (NLU) tasks and all language models. It is another variation of a soft prompt method; P-tuning also adds a trainable embedding tensor that can be optimized to find better prompts, and it uses a prompt encoder (a bidirectional long-short term memory network or LSTM) to optimize the prompt parameters. Unlike prefix tuning though:

the prompt tokens can be inserted anywhere in the input sequence, and it isn't restricted to only the beginning
the prompt tokens are only added to the input instead of adding them to every layer of the model
introducing anchor tokens can improve performance because they indicate characteristics of a component in the input sequence

The results suggest that P-tuning is more efficient than manually crafting prompts, and it enables GPT-like models to compete with BERT-like models on NLU tasks.

Take a look at P-tuning for sequence classification for a step-by-step guide on how to train a model with P-tuning.

[huggingface/transformers] docs/source/en/custom_tools.md

Customizing the whole prompt

To give the user maximum flexibility, the whole prompt template as explained in above can be overwritten by the user. In this case make sure that your custom prompt includes an introduction section, a tool section, an example section, and an unfinished example section. If you want to overwrite the run prompt template, you can do as follows:

template = """ [...] """

agent = HfAgent(your_endpoint, run_prompt_template=template)

Please make sure to have the <<all_tools>> string and the <<prompt>> defined somewhere in the template so that the agent can be aware of the tools, it has available to it as well as correctly insert the user's prompt.

</Tip>

Similarly, one can overwrite the chat prompt template. Note that the chat mode always uses the following format for the exchanges:

Human: <<task>>

Assistant:

Therefore it is important that the examples of the custom chat prompt template also make use of this format. You can overwrite the chat template at instantiation as follows.

template = """ [...] """

agent = HfAgent(url_endpoint=your_endpoint, chat_prompt_template=template)

Please make sure to have the <<all_tools>> string defined somewhere in the template so that the agent can be aware of the tools, it has available to it.

</Tip>

In both cases, you can pass a repo ID instead of the prompt template if you would like to use a template hosted by someone in the community. The default prompts live in this repo as an example.

To upload your custom prompt on a repo on the Hub and share it with the community just make sure:

to use a dataset repository
to put the prompt template for the run command in a file named run_prompt_template.txt
to put the prompt template for the chat command in a file named chat_prompt_template.txt

[huggingface/peft] docs/source/conceptual_guides/prompting.md

Soft prompts

Training large pretrained language models is very time-consuming and compute-intensive. As they continue to grow in size, there is increasing interest in more efficient training methods such as prompting. Prompting primes a frozen pretrained model for a specific downstream task by including a text prompt that describes the task or even demonstrates an example of the task. With prompting, you can avoid fully training a separate model for each downstream task, and use the same frozen pretrained model instead. This is a lot easier because you can use the same model for several different tasks, and it is significantly more efficient to train and store a smaller set of prompt parameters than to train all the model's parameters.

There are two categories of prompting methods:

hard prompts are manually handcrafted text prompts with discrete input tokens; the downside is that it requires a lot of effort to create a good prompt
soft prompts are learnable tensors concatenated with the input embeddings that can be optimized to a dataset; the downside is that they aren't human readable because you aren't matching these "virtual tokens" to the embeddings of a real word

This conceptual guide provides a brief overview of the soft prompt methods included in 🤗 PEFT: prompt tuning, prefix tuning, P-tuning, and multitask prompt tuning.

Prompt tuning

Take a look at Prompt tuning for causal language modeling for a step-by-step guide on how to train a model with prompt tuning.

Prefix tuning

<div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/prefix-tuning.png"/> </div> <small>Optimize the prefix parameters for each task <a href="https://hf.co/papers/2101.00190">(image source)</a>.</small>

Prefix tuning was designed for natural language generation (NLG) tasks on GPT models. It is very similar to prompt tuning; prefix tuning also prepends a sequence of task-specific vectors to the input that can be trained and updated while keeping the rest of the pretrained model's parameters frozen.

The main difference is that the prefix parameters are inserted in all of the model layers, whereas prompt tuning only adds the prompt parameters to the model input embeddings. The prefix parameters are also optimized by a separate feed-forward network (FFN) instead of training directly on the soft prompts because it causes instability and hurts performance. The FFN is discarded after updating the soft prompts.

As a result, the authors found that prefix tuning demonstrates comparable performance to fully finetuning a model, despite having 1000x fewer parameters, and it performs even better in low-data settings.

Take a look at Prefix tuning for conditional generation for a step-by-step guide on how to train a model with prefix tuning.

P-tuning

the prompt tokens can be inserted anywhere in the input sequence, and it isn't restricted to only the beginning
the prompt tokens are only added to the input instead of adding them to every layer of the model
introducing anchor tokens can improve performance because they indicate characteristics of a component in the input sequence

The results suggest that P-tuning is more efficient than manually crafting prompts, and it enables GPT-like models to compete with BERT-like models on NLU tasks.

Take a look at P-tuning for sequence classification for a step-by-step guide on how to train a model with P-tuning.

Multitask prompt tuning

<div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/mpt.png"/> </div> <small><a href="https://hf.co/papers/2103.10385">Multitask prompt tuning enables parameter-efficient transfer learning</a>.</small>

Multitask prompt tuning (MPT) learns a single prompt from data for multiple task types that can be shared for different target tasks. Other existing approaches learn a separate soft prompt for each task that need to be retrieved or aggregated for adaptation to target tasks. MPT consists of two stages:

source training - for each task, its soft prompt is decomposed into task-specific vectors. The task-specific vectors are multiplied together to form another matrix W, and the Hadamard product is used between W and a shared prompt matrix P to generate a task-specific prompt matrix. The task-specific prompts are distilled into a single prompt matrix that is shared across all tasks. This prompt is trained with multitask training.
target adaptation - to adapt the single prompt for a target task, a target prompt is initialized and expressed as the Hadamard product of the shared prompt matrix and the task-specific low-rank prompt matrix.

<div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/mpt-decomposition.png"/> </div> <small><a href="https://hf.co/papers/2103.10385">Prompt decomposition</a>.</small>

[huggingface/peft] docs/source/package_reference/p_tuning.md

P-tuning

P-tuning adds trainable prompt embeddings to the input that is optimized by a prompt encoder to find a better prompt, eliminating the need to manually design prompts. The prompt tokens can be added anywhere in the input sequence, and p-tuning also introduces anchor tokens for improving performance.

The abstract from the paper is:

While GPTs with traditional fine-tuning fail to achieve strong results on natural language understanding (NLU), we show that GPTs can be better than or comparable to similar-sized BERTs on NLU tasks with a novel method P-tuning -- which employs trainable continuous prompt embeddings. On the knowledge probing (LAMA) benchmark, the best GPT recovers 64% (P@1) of world knowledge without any additional text provided during test time, which substantially improves the previous best by 20+ percentage points. On the SuperGlue benchmark, GPTs achieve comparable and sometimes better performance to similar-sized BERTs in supervised learning. Importantly, we find that P-tuning also improves BERTs' performance in both few-shot and supervised settings while largely reducing the need for prompt engineering. Consequently, P-tuning outperforms the state-of-the-art approaches on the few-shot SuperGlue benchmark..

PromptEncoderConfig

[[autodoc]] tuners.p_tuning.config.PromptEncoderConfig

PromptEncoder

[[autodoc]] tuners.p_tuning.model.PromptEncoder

[huggingface/peft] src/peft/tuners/prompt_tuning/config.py

class PromptTuningInit(str, enum.Enum):
    TEXT = "TEXT"
    RANDOM = "RANDOM"

[huggingface/peft] docs/source/conceptual_guides/prompting.md

Multitask prompt tuning

source training - for each task, its soft prompt is decomposed into task-specific vectors. The task-specific vectors are multiplied together to form another matrix W, and the Hadamard product is used between W and a shared prompt matrix P to generate a task-specific prompt matrix. The task-specific prompts are distilled into a single prompt matrix that is shared across all tasks. This prompt is trained with multitask training.
target adaptation - to adapt the single prompt for a target task, a target prompt is initialized and expressed as the Hadamard product of the shared prompt matrix and the task-specific low-rank prompt matrix.

[huggingface/transformers] src/transformers/tools/agents.py

def format_prompt(self, task, chat_mode=False):
        description = "\n".join([f"- {name}: {tool.description}" for name, tool in self.toolbox.items()])
        if chat_mode:
            if self.chat_history is None:
                prompt = self.chat_prompt_template.replace("<<all_tools>>", description)
            else:
                prompt = self.chat_history
            prompt += CHAT_MESSAGE_PROMPT.replace("<<task>>", task)
        else:
            prompt = self.run_prompt_template.replace("<<all_tools>>", description)
            prompt = prompt.replace("<<prompt>>", task)
        return prompt

[huggingface/transformers] docs/source/en/llm_optims.md

Prompt lookup decoding

Prompt lookup decoding is a variant of speculative decoding that is also compatible with greedy search and sampling. Prompt lookup works especially well for input-grounded tasks - such as summarization - where there is often overlapping words between the prompt and output. These overlapping n-grams are used as the LLM candidate tokens.

To enable prompt lookup decoding, specify the number of tokens that should be overlapping in the prompt_lookup_num_tokens parameter. Then you can pass this parameter to the [~GenerationMixin.generate] method.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device)
assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
outputs = model.generate(**inputs, prompt_lookup_num_tokens=3)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
['The second law of thermodynamics states that entropy increases with temperature.      ']

</hfoption> <hfoption id="sampling">

For prompt lookup decoding with sampling, add the do_sample and temperature parameters to the [~GenerationMixin.generate] method.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device)
outputs = model.generate(**inputs, prompt_lookup_num_tokens=3, do_sample=True, temperature=0.7)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["The second law of thermodynamics states that energy cannot be created nor destroyed. It's not a"]

</hfoption> </hfoptions>

[huggingface/transformers] src/transformers/models/code_llama/tokenization_code_llama_fast.py

# fmt: off
DEFAULT_SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your \
answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure\
 that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not \
correct. If you don't know the answer to a question, please don't share false information."""

[huggingface/peft] examples/boft_dreambooth/utils/dataset.py

def __init__(self, prompt, num_samples):
        self.prompt = prompt
        self.num_samples = num_samples

[huggingface/peft] src/peft/peft_model.py

def get_prompt(self, batch_size: int, task_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Returns the virtual prompts to use for Peft. Only applicable when using a prompt learning method.
        """
        peft_config = self.active_peft_config
        prompt_encoder = self.prompt_encoder[self.active_adapter]
        prompt_tokens = (
            self.prompt_tokens[self.active_adapter]
            .unsqueeze(0)
            .expand(batch_size, -1)
            .to(prompt_encoder.embedding.weight.device)
        )
        if peft_config.peft_type == PeftType.PREFIX_TUNING:
            prompt_tokens = prompt_tokens[:, : peft_config.num_virtual_tokens]
            if peft_config.inference_mode:
                past_key_values = prompt_encoder.embedding.weight.repeat(batch_size, 1, 1)
            else:
                past_key_values = prompt_encoder(prompt_tokens)
            if self.base_model_torch_dtype is not None:
                past_key_values = past_key_values.to(self.base_model_torch_dtype)
            past_key_values = past_key_values.view(
                batch_size,
                peft_config.num_virtual_tokens,
                peft_config.num_layers * 2,
                peft_config.num_attention_heads,
                peft_config.token_dim // peft_config.num_attention_heads,
            )
            if peft_config.num_transformer_submodules == 2:
                past_key_values = torch.cat([past_key_values, past_key_values], dim=2)
            past_key_values = past_key_values.permute([2, 0, 3, 1, 4]).split(
                peft_config.num_transformer_submodules * 2
            )
            if TRANSFORMERS_MODELS_TO_PREFIX_TUNING_POSTPROCESS_MAPPING.get(self.config.model_type, None) is not None:
                post_process_fn = TRANSFORMERS_MODELS_TO_PREFIX_TUNING_POSTPROCESS_MAPPING[self.config.model_type]
                past_key_values = post_process_fn(past_key_values)
            return past_key_values
        else:
            if peft_config.peft_type == PeftType.MULTITASK_PROMPT_TUNING:
                prompts = prompt_encoder(prompt_tokens, task_ids)
            else:
                if peft_config.inference_mode:
                    prompts = prompt_encoder.embedding.weight.repeat(batch_size, 1, 1)
                else:
                    prompts = prompt_encoder(prompt_tokens)
            return prompts

[huggingface/transformers] src/transformers/models/cohere/tokenization_cohere_fast.py

            "{{ '\n\n# System Preamble' }}"
            "{{ '\n## Basic Rules' }}"
            "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
            "{{ '\n\n# User Preamble' }}"
            "{{ '\n' + system_message }}"
            "{{ '<|END_OF_TURN_TOKEN|>'}}"
            "{% for message in loop_messages %}"  # Loop over all non-system messages
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'system' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}"
            "{{ '<results>' }}"
            "{% for document in documents %}"  # Loop over all non-system messages
            "{{ '\nDocument: ' }}"
            "{{ loop.index0 }}\n"
            "{% for key, value in document.items() %}"
            "{{ key }}: {{value}}\n"
            "{% endfor %}"
            "{% endfor %}"
            "{{ '</results>'}}"
            "{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}"
            "{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}"
            "{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}"
            "{% if citation_mode=='accurate' %}"
            "{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}"
            "{% endif %}"
            "{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}"
            "{{ '<|END_OF_TURN_TOKEN|>' }}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_rag_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
        rag_template = rag_template.replace("DEFAULT_SYSTEM_MESSAGE", default_rag_message)

        return {"default": default_template, "tool_use": tool_use_template, "rag": rag_template}

    def apply_tool_use_template(
        self,
        conversation: Union[List[Dict[str, str]], "Conversation"],
        tools: List[Dict],
        **kwargs,
    ) -> Union[str, List[int]]:
        """Create a Command-R tool-use prompt.

        Once rendered, the prompt instructs the model to generate a list of actions to perform on a set of user supplied tools
        to help carry out the user's requests.

        Conceptually, this works in the same way as `apply_chat_format`, but takes an additional `tools` parameter.

        Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of available
        tools for the model to use into a prompt string, or a list of token ids.
        This method will use the tokenizer's `default_tool_use_template` template specified at the class level.
        You can override the default template using the `tool_use_template` kwarg but the quality of your results may decrease.

        Args:
            conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
                with "role" and "content" keys, representing the chat history so far.
            tools (List[Dict]): a list of tools to render into the prompt for the model to choose from.
                See an example at the bottom of the docstring.
                The format should be:
                   * name (str): The name of the tool to be called. Valid names contain only the characters a-z,
                        A-Z, 0-9, _ and must not begin with a digit.
                   * description (str): The description of what the tool does, the model uses the description to
                        choose when and how to call the function.
                   * parameter_definitions (List[Dict]): The input parameters of the tool. Accepts a dictionary
                        where the key is the name of the parameter and the value is the parameter spec.
                        Valid parameter names contain only the characters a-z, A-Z, 0-9, _ and must not begin with a digit.
                        Parameter specs are as follows:
                       * description (str): The description of the parameter.
                       * type (str): the type of the parameter - most effective for python builtin data types, such as 'str', 'bool'
                       * required: boolean: Denotes whether the parameter is always present (required) or not. Defaults to not required.
            add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
                the start of an assistant message. This is useful when you want to generate a response from the model.
                Note that this argument will be passed to the chat template, and so it must be supported in the
                template for this argument to have any effect.
            tokenize (`bool`, defaults to `True`):
                Whether to tokenize the output. If `False`, the output will be a string.
            padding (`bool`, defaults to `False`):
                Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
            truncation (`bool`, defaults to `False`):
                Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
            max_length (`int`, *optional*):
                Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
                not specified, the tokenizer's `max_length` attribute will be used as a default.
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
                values are:
                - `'tf'`: Return TensorFlow `tf.Tensor` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether to return a dictionary with named outputs. Has no effect if tokenize is `False`.
            **tokenizer_kwargs: Additional kwargs to pass to the tokenizer.

        Returns:
            `str`: A rendered prompt string.
            or if tokenize=True:
            `List[int]`: A list of token ids representing the tokenized chat so far, including control tokens. This
            output is ready to pass to the model, either directly or via methods like `generate()`.

        Examples:

        ```python
        >> tokenizer = CohereTokenizerFast.from_pretrained("CohereForAI/c4ai-command-r-v01")
        >> tools = [
            {
                "name": "internet_search",
                "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
                "parameter_definitions": {
                    "query": {
                        "description": "Query to search the internet with",
                        "type": "str",
                        "required": True
                    }
                }
            },
            {
                "name': "directly_answer",
                "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
                "parameter_definitions": {}
            }
        ]
        >> conversation = [
            {"role": "user", "content": "Whats the biggest penguin in the world?"}
        ]
        >> # render the prompt, ready for user to inspect, or for input into the model:
        >> prompt = tokenizer.apply_tool_use_template(conversation, tools=tools, tokenize=False, add_generation_prompt=True)
        >> print(prompt)
        <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
        The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.

        # System Preamble
        ## Basic Rules
        You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.

        # User Preamble
        ## Task and Context
        You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

        ## Style Guide
        Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.

        ## Available Tools
        Here is a list of tools that you have available to you:

        \\`\\`\\`python
        def internet_search(query: str) -> List[Dict]:
            \"\"\"Returns a list of relevant document snippets for a textual query retrieved from the internet

            Args:
                query (str): Query to search the internet with
            \"\"\"
            pass
        \\`\\`\\`

        \\`\\`\\`python
        def directly_answer() -> List[Dict]:
            \"\"\"Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history
            \"\"\"
            pass
        \\`\\`\\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:
        \\`\\`\\`json
        [
            {
                "tool_name": title of the tool in the specification,
                "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
            }
        ]\\`\\`\\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
        ```
        >> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
        >> outputs = model.generate(inputs, max_new_tokens=128)
        >> print(tokenizer.decode(outputs[0]))
        Action: ```json
        [
            {
                "tool_name": "internet_search",
                "parameters": {
                    "query": "biggest penguin in the world"
                }
            }
        ]
        ```
        """
        return self.apply_chat_template(
            conversation,
            chat_template="tool_use",
            tools=tools,
            **kwargs,
        )

    def apply_grounded_generation_template(
        self,
        conversation: Union[List[Dict[str, str]], "Conversation"],
        documents: List[Dict],
        citation_mode: Literal["fast", "accurate"] = "accurate",
        **kwargs,
    ) -> Union[str, List[int]]:
        """Create a Command-R grounded generation (aka RAG) prompt.

        Once rendered, the prompt instructs the model to generate a response with citations in, based on supplied documents.

        Conceptually, this works in the same way as `apply_chat_format`, but takes additional `documents`
        and parameter `citation_mode` parameters.

        Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of
        documents for the model to ground its response on into a prompt string, or a list of token ids.
        This method will use the tokenizer's `grounded_generation_template` template specified at the class level.
        You can override the default template using the `grounded_generation_template` kwarg but the quality of your results may decrease.

        Args:
            conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
                with "role" and "content" keys, representing the chat history so far.
            documents (List[Dict[str, str]): A list of dicts, representing documents or tool outputs to ground your
                generation on. A document is a semistructured dict, wiht a string to string mapping. Common fields are
                `url`, `title`, `snippet` etc but should be descriptive of the key. They will get rendered into the prompt.
            citation_mode: either "accurate" (prompt the model to generate an answer first, then rewrite it with citation
                spans in) or "fast", where the prompt instructs the model to generate an answer with citations in directly.
                The former has higher quality citations, the latter requires fewer tokens to be generated.
            add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
                the start of an assistant message. This is useful when you want to generate a response from the model.
                Note that this argument will be passed to the chat template, and so it must be supported in the
                template for this argument to have any effect.
            tokenize (`bool`, defaults to `True`):
                Whether to tokenize the output. If `False`, the output will be a string.
            padding (`bool`, defaults to `False`):
                Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
            truncation (`bool`, defaults to `False`):
                Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
            max_length (`int`, *optional*):
                Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
                not specified, the tokenizer's `max_length` attribute will be used as a default.
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
                values are:

[huggingface/transformers] docs/source/en/tasks/prompting.md

Best practices of LLM prompting

In this section of the guide we have compiled a list of best practices that tend to improve the prompt results:

When choosing the model to work with, the latest and most capable models are likely to perform better.
Start with a simple and short prompt, and iterate from there.
Put the instructions at the beginning of the prompt, or at the very end. When working with large context, models apply various optimizations to prevent Attention complexity from scaling quadratically. This may make a model more attentive to the beginning or end of a prompt than the middle.
Clearly separate instructions from the text they apply to - more on this in the next section.
Be specific and descriptive about the task and the desired outcome - its format, length, style, language, etc.
Avoid ambiguous descriptions and instructions.
Favor instructions that say "what to do" instead of those that say "what not to do".
"Lead" the output in the right direction by writing the first word (or even begin the first sentence for the model).
Use advanced techniques like Few-shot prompting and Chain-of-thought
Test your prompts with different models to assess their robustness.
Version and track the performance of your prompts.

[huggingface/peft] examples/lora_dreambooth/train_dreambooth.py

def __init__(self, prompt, num_samples):
        self.prompt = prompt
        self.num_samples = num_samples

[huggingface/peft] examples/oft_dreambooth/train_dreambooth.py

def __getitem__(self, index):
        example = {}
        example["prompt"] = self.prompt
        example["index"] = index
        return example

[huggingface/peft] examples/stable_diffusion/train_dreambooth.py

def __init__(self, prompt, num_samples):
        self.prompt = prompt
        self.num_samples = num_samples

[huggingface/transformers] tests/models/opt/test_modeling_flax_opt.py

def prompts(self):
        return [
            "Today is a beautiful day and I want",
            "In the city of",
            "Paris is the capital of France and",
            "Computers and mobile phones have taken",
        ]

[huggingface/transformers] tests/models/opt/test_modeling_opt.py

def prompts(self):
        return [
            "Today is a beautiful day and I want",
            "In the city of",
            "Paris is the capital of France and",
            "Computers and mobile phones have taken",
        ]

[huggingface/accelerate] benchmarks/big_model_inference.py

PROMPTS = [
    "Hello, my name is",
    "Are unicorns real? Unicorns are",
    "For the first time in several years,",
    "My name is Julien and I am",
    "The goal of life is",
    "Whenever I'm sad, I like to",
]

[openaccess-ai-collective/axolotl] docs/dataset-formats/template_free.qmd

---
title: Template-Free
description: Construct prompts without a template.
order: 4
---

See [these docs](../input_output.qmd).

[openaccess-ai-collective/axolotl] docs/dataset-formats/inst_tune.qmd

---
title: Instruction Tuning
description: Instruction tuning formats for supervised fine-tuning.
order: 2
---

## alpaca

instruction; input(optional)

```{.json filename="data.jsonl"}
{"instruction": "...", "input": "...", "output": "..."}

jeopardy

question and answer

{"question": "...", "category": "...", "answer": "..."}

oasst

instruction

{"INSTRUCTION": "...", "RESPONSE": "..."}

gpteacher

instruction; input(optional)

{"instruction": "...", "input": "...", "response": "..."}

reflection

instruction with reflect; input(optional)

{"instruction": "...", "input": "...", "output": "...", "reflection": "...", "corrected": "..."}

explainchoice

question, choices, (solution OR explanation)

{"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}

concisechoice

question, choices, (solution OR explanation)

{"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}

summarizetldr

article and summary

{"article": "...", "summary": "..."}

alpaca_chat

basic instruct for alpaca chat

{"instruction": "...", "input": "...", "response": "..."}

alpaca_chat.load_qa

question and answer for alpaca chat

{"question": "...", "answer": "..."}

alpaca_chat.load_concise

question and answer for alpaca chat, for concise answers

{"instruction": "...", "input": "...", "response": "..."}

alpaca_chat.load_camel_ai

question and answer for alpaca chat, for load_camel_ai

{"message_1": "...", "message_2": "..."}

alpaca_w_system.load_open_orca

support for open orca datasets with included system prompts, instruct

{"system_prompt": "...", "question": "...", "response": "..."}

context_qa

in context question answering from an article

{"article": "...", "question": "...", "answer": "..."}

context_qa.load_v2

in context question answering (alternate)

{"context": "...", "question": "...", "answer": "..."}

context_qa.load_404

in context question answering from an article, with default response for no answer from context

{"article": "...", "unanswerable_question": "..."}

creative_acr.load_answer

instruction and revision

{"instruction": "...", "revision": "..."}

creative_acr.load_critique

critique

{"scores": "...", "critiques": "...", "instruction": "...", "answer": "..."}

creative_acr.load_revise

critique and revise

{"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."}

metharme

instruction, adds additional eos tokens

{"prompt": "...", "generation": "..."}

How to add custom prompt format

For a dataset that is preprocessed for instruction purposes:

{"input": "...", "output": "..."}

You can use this example in your YAML config:

datasets:
  - path: repo
    type:
      system_prompt: ""
      field_system: system
      field_instruction: input
      field_output: output
      format: "[INST] {instruction} [/INST]"
      no_input_format: "[INST] {instruction} [/INST]"

See full config options under here.

[huggingface/accelerate] examples/inference/pippy/llama.py

prompts = ("I would like to", "I really like to", "The weather is pretty")

[openaccess-ai-collective/axolotl] docs/input_output.qmd

---
title: Template-free prompt construction
description: "Template-free prompt construction with the `input_output` format"
---

<!-- TOC -->

- [Background](#background)
    - [Masking Inputs](#masking-inputs)
    - [You may not want prompt templates](#you-may-not-want-prompt-templates)
    - [The `input_output` format](#the-input_output-format)
- [Usage](#usage)
    - [1. Prepare Data](#1-prepare-data)
    - [2. Use `type: input_output`](#2-use-type-input_output)
    - [3. Check the prompts](#3-check-the-prompts)

<!-- /TOC -->

<a id="markdown-background" name="background"></a>

## Background

<a id="markdown-masking-inputs" name="masking-inputs"></a>

### Masking Inputs

One of the most popular features of
[axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is
setting the following configuration value:


```yaml
train_on_inputs: false

If you declare a dataset formats such as alpaca or chatml, axolotl knows what is an input (i.e. human) vs. an output (i.e. the assistant) and masks the input labels so that your model can focus on predicting the outputs only.

You may not want prompt templates

However, there are many situations where you don't want to use one of these formats or templates. This is because they can:

Add unnecessary boilerplate to your prompts.
Create artifacts like special delimiters <|im_start|> that can quickly become footguns if you don't include them correctly at inference time.
Enforce a chat interface when you do not want one. Sometimes you just want to fine-tune a model to a very specific task and do NOT want multi-turn conversations, roles, etc.
Limit you to only certain roles that the template allows.

The `input_output` format

You can construct your prompts without a template by using the input_output format, by setting type: input_output in your configuration file like this:

config.yml

train_on_inputs: false # Mask segments of your data
datasets:
  - path: output.jsonl
    type: input_output  # use template free prompt construction

Unlike type: completion, which is also template-free, type: input_output allows you to mask segments of your text. More details on how this works are described below.

Usage

This is how you can use the input_output format:

1. Prepare Data

To use the input_output format, collect your data in the following format into a jsonl file (below is the first row from the file output.jsonl` pretty printed):

$ head -n1 output.jsonl | python -m json.tool

:::{.cell-output .cell-output-stdout} { "segments": [ { "label": true, "text": "<s>Hello\n" }, { "label": true, "text": "hi there!. " }, { "label": false, "text": "goodbye " }, { "label": true, "text": "farewell</s>" } ] } :::

Set label:false when you want to mask a segment of text so that the model isn't trained on it. Some things to keep in mind:

[!IMPORTANT]

EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl concatenates all the segments as-is. The tokenizer doesn't add anything additional. Notice how I added spaces, newlines, <s> (BOS), and </s> (EOS) myself.

Make sure you check the materialized output to validate that the prompt is getting assembled how you like.

2. Use `type: input_output`

Let's materialize data with our output.jsonl file by setting type: input_output in our axolotl config:

# training_config.yaml
base_model: mistralai/Mistral-7B-v0.1
data_seed: 49
seed: 49

datasets:
  - path: output.jsonl
    type: input_output
val_set_size: 0.1

sequence_len: 896
sample_packing: false

micro_batch_size: 2
gradient_accumulation_steps: 3
eval_batch_size: 2
num_epochs: 1
learning_rate: 0.0002

train_on_inputs: false
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

You can use the following command to materialize your data. The --debug flag will print the tokens, along with the labels so you can verify that the correct items are being ignored:

$ python -m axolotl.cli.preprocess training_config.yaml --debug

...
[2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557)
(13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2)

The format is decoded_token(label, token_id), for example, <s>(1, 1) means that the token is <s>, the label is 1 and the token_id is 1. When the label is -100 then that token is ignored for training.

3. Check the prompts

Here is another way to check the materialized output:

from transformers import AutoTokenizer
from datasets import load_from_disk
import yaml

directory = !ls last_run_prepared/
with open('training_config.yaml', 'r') as f:
    cfg = yaml.safe_load(f)
model_id = cfg['base_model']
tok = AutoTokenizer.from_pretrained(model_id)
ds = load_from_disk(f'last_run_prepared/{directory[0]}/')

>>> row = ds[0]
>>> print(tok.decode(row['input_ids']))
<s> Hello
    hi there!.  goodbye  farewell</s>

We can check that the right tokens are ingored by comparing the labels to each token:

import pandas as pd
pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in
              zip(row['input_ids'], row['labels'])])

| token | label | id | |-------|-------|-------| | 0 | <s> | 1 | | 1 | Hello | 22557 | | 2 | \n | 13 | | 3 | hi | 12014 | | 4 | there | 736 | | 5 | ! | 28808 | | 6 | . | 28723 | | 7 | | 28705 | | 8 | good | -100 | | 9 | bye | -100 | | 10 | | -100 | | 11 | fare | 19111 | | 12 | well | 5458 | | 13 | </s>| 2 |

If we look at the input data, the above table seems correct! (The jsonl version is repeated below for reference):

$ head -n1 output.jsonl | python -m json.tool

[huggingface/accelerate] examples/inference/distributed/phi2.py

# Split into batches
# We will get the following results:
# [ ["I would like to", "hello how are you"], [ "what is going on", "roses are red and"], [ "welcome to the hotel"] ]
formatted_prompts = [prompts[i : i + batch_size] for i in range(0, len(prompts), batch_size)]

# Apply padding on the left since we are doing generation
padding_side_default = tokenizer.padding_side
tokenizer.padding_side = "left"

[huggingface/accelerate] src/accelerate/commands/config/config.py

description = "Launches a series of prompts to create and save a `default_config.yaml` configuration file for your training system. Should always be ran first on your machine"

[huggingface/accelerate] src/accelerate/commands/menu/selection_menu.py

def __init__(self, prompt: str = None, choices: list = []):
        self.position = 0
        self.choices = choices
        self.prompt = prompt
        if sys.platform == "win32":
            self.arrow_char = "*"
        else:
            self.arrow_char = "➔ "

[openaccess-ai-collective/axolotl] docs/config.qmd

# currently only supported on Llama and Mistral
neftune_noise_alpha:

# Whether to bettertransformers
flash_optimum:
# Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
xformers_attention:
# Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
flash_attention:
flash_attn_cross_entropy:  # Whether to use flash-attention cross entropy implementation - advanced use only
flash_attn_rms_norm:  # Whether to use flash-attention rms norm implementation - advanced use only
flash_attn_fuse_qkv: # Whether to fuse QKV into a single operation
flash_attn_fuse_mlp: # Whether to fuse part of the MLP into a single operation
# Whether to use scaled-dot-product attention
# https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
sdp_attention:
# Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
s2_attention:
# Resume from a specific checkpoint dir
resume_from_checkpoint:
# If resume_from_checkpoint isn't set and you simply want it to start where it left off.
# Be careful with this being turned on between different models.
auto_resume_from_checkpoints: false

# Don't mess with this, it's here for accelerate and torchrun
local_rank:

# Add or change special tokens.
# If you add tokens here, you don't need to add them to the `tokens` list.
special_tokens:
  # bos_token: "<s>"
  # eos_token: "</s>"
  # unk_token: "<unk>"
  # pad_token: "[PAD]"

# Add extra tokens.
tokens:

# FSDP
fsdp:
fsdp_config:

# Deepspeed config path. e.g., deepspeed_configs/zero3.json
deepspeed:

# Advanced DDP Arguments
ddp_timeout:
ddp_bucket_cap_mb:
ddp_broadcast_buffers:

# Path to torch distx for optim 'adamw_anyprecision'
torchdistx_path:

# Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
pretraining_dataset:

# Debug mode
debug:

# Seed
seed:

# Allow overwrite yml config using from cli
strict:

[openaccess-ai-collective/axolotl] docs/dataset-formats/pretraining.qmd

---
title: Pre-training
description: Data format for a pre-training completion task.
order: 1
---

For pretraining, there is no prompt template or roles.  The only required field is `text`:

```{.json filename="data.jsonl"}
{"text": "first row"}
{"text": "second row"}
...

:::{.callout-note}

Streaming is recommended for large datasets

Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:

pretraining_dataset: # hf path only
...

:::

[huggingface/accelerate] docs/source/quicktour.md

[huggingface/accelerate] src/accelerate/commands/launch.py

def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.titles = [
            "Hardware Selection Arguments",
            "Resource Selection Arguments",
            "Training Paradigm Arguments",
            "positional arguments",
            "optional arguments",
        ]

[openaccess-ai-collective/axolotl] docs/dataset_preprocessing.qmd

---
title: Dataset Preprocessing
description: How datasets are processed
---

Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
the (dataset format)[../dataset-formats/] and prompt strategies to:
 - parse the dataset based on the *dataset format*
 - transform the dataset to how you would interact with the model based on the *prompt strategy*
 - tokenize the dataset based on the configured model & tokenizer
 - shuffle and merge multiple datasets together if using more than one

The processing of the datasets can happen one of two ways:

1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
2. When training is started

What are the benefits of pre-processing? When training interactively or for sweeps
(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
training parameters so that it will intelligently pull from its cache when possible.

The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.

If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
data is in the cache.

What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
prompt template. Because the trainer cannot readily detect these changes, we cannot change the
calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
and change your prompt templating logic, it may not pick up the changes you made and you will be
training over the old prompt.

[huggingface/accelerate] examples/inference/distributed/phi2.py

# Tokenize each batch
tokenized_prompts = [
    tokenizer(formatted_prompt, padding=True, pad_to_multiple_of=pad_to_multiple_of, return_tensors="pt")
    for formatted_prompt in formatted_prompts
]
# Put back the original padding behavior
tokenizer.padding_side = padding_side_default

completions_per_process = []

[huggingface/accelerate] docs/source/package_reference/deepspeed.md

[huggingface/accelerate] src/accelerate/commands/launch.py

class CustomHelpFormatter(argparse.HelpFormatter):
    """
    This is a custom help formatter that will hide all arguments that are not used in the command line when the help is
    called. This is useful for the case where the user is using a specific platform and only wants to see the arguments
    for that platform.
    """

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.titles = [
            "Hardware Selection Arguments",
            "Resource Selection Arguments",
            "Training Paradigm Arguments",
            "positional arguments",
            "optional arguments",
        ]

    def add_argument(self, action: argparse.Action):
        if "accelerate" in sys.argv[0] and "launch" in sys.argv[1:]:
            args = sys.argv[2:]
        else:
            args = sys.argv[1:]

        if len(args) > 1:
            args = list(map(clean_option, args))
            used_platforms = [arg for arg in args if arg in options_to_group.keys()]
            used_titles = [options_to_group[o] for o in used_platforms]
            if action.container.title not in self.titles + used_titles:
                action.help = argparse.SUPPRESS
            elif action.container.title == "Hardware Selection Arguments":
                if set(action.option_strings).isdisjoint(set(args)):
                    action.help = argparse.SUPPRESS
                else:
                    action.help = action.help + " (currently selected)"
            elif action.container.title == "Training Paradigm Arguments":
                if set(action.option_strings).isdisjoint(set(args)):
                    action.help = argparse.SUPPRESS
                else:
                    action.help = action.help + " (currently selected)"

        action.option_strings = [s for s in action.option_strings if "-" not in s[2:]]
        super().add_argument(action)

    def end_section(self):
        if len(self._current_section.items) < 2:
            self._current_section.items = []
            self._current_section.heading = ""
        super().end_section()

[openaccess-ai-collective/axolotl] docs/dataset-formats/conversation.qmd

---
title: Conversation
description: Conversation format for supervised fine-tuning.
order: 3
---

## sharegpt

conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)

```{.json filename="data.jsonl"}
{"conversations": [{"from": "...", "value": "..."}]}

Note: type: sharegpt opens special configs:

conversation: enables conversions to many Conversation types. Refer to the 'name' here for options.
roles: allows you to specify the roles for input and output. This is useful for datasets with custom roles such as tool etc to support masking.
field_human: specify the key to use instead of human in the conversation.
field_model: specify the key to use instead of gpt in the conversation.

datasets:
    path: ...
    type: sharegpt

    conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
    field_human: # Optional[str]. Human key to use for conversation.
    field_model: # Optional[str]. Assistant key to use for conversation.
    # Add additional keys from your dataset as input or output roles
    roles:
      input: # Optional[List[str]]. These will be masked based on train_on_input
      output: # Optional[List[str]].

pygmalion

{"conversations": [{"role": "...", "value": "..."}]}

sharegpt.load_role

conversations where role is used instead of from

{"conversations": [{"role": "...", "value": "..."}]}

sharegpt.load_guanaco

conversations where from is prompter assistant instead of default sharegpt

{"conversations": [{"from": "...", "value": "..."}]}

sharegpt_jokes

creates a chat where bot is asked to tell a joke, then explain why the joke is funny

{"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}

[openaccess-ai-collective/axolotl] docs/dataset-formats/index.qmd

---
title: Dataset Formats
description: Supported dataset formats.
listing:
  fields: [title, description]
  type: table
  sort-ui: false
  filter-ui: false
  max-description-length: 250
---

Axolotl supports a variety of dataset formats.  It is recommended to use a JSONL format.  The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.

Below are these various formats organized by task:

OpenAccess-AI-Collective/axolotl

huggingface/transformers

huggingface/peft

huggingface/accelerate

what is prompt_style?

Prompt tuning

Prompt tuning

PromptTuningConfig

PromptEmbedding

P-tuning

Customizing the whole prompt

Soft prompts

Prompt tuning

Prefix tuning

P-tuning

Multitask prompt tuning

P-tuning

PromptEncoderConfig

PromptEncoder

Multitask prompt tuning

Prompt lookup decoding

Best practices of LLM prompting

jeopardy

oasst

gpteacher

reflection

explainchoice

concisechoice

summarizetldr

alpaca_chat

alpaca_chat.load_qa

alpaca_chat.load_concise

alpaca_chat.load_camel_ai

alpaca_w_system.load_open_orca

context_qa

context_qa.load_v2

context_qa.load_404

creative_acr.load_answer

creative_acr.load_critique

creative_acr.load_revise

metharme

How to add custom prompt format

You may not want prompt templates

The input_output format

Usage

1. Prepare Data

2. Use type: input_output

3. Check the prompts

Streaming is recommended for large datasets

pygmalion

sharegpt.load_role

sharegpt.load_guanaco

sharegpt_jokes

The `input_output` format

2. Use `type: input_output`