Phorm.ai
OpenAccess-AI-Collective/axolotl

    Is there a quick, easy and effective way to force the Llama tokenizer to split up text into single character tokens for finetuning Llama 3.1?

    Yes, you can customize the Llama tokenizer to tokenize text into single character tokens by overriding its tokenize method. Since the Llama tokenizer is based on the PreTrainedTokenizer class from the Hugging Face transformers library, you can subclass the specific tokenizer used for Llama (e.g., LlamaTokenizer or LlamaTokenizerFast) and modify the tokenize method to split the text into individual characters.

    Here's an example of how you could implement this:

    from transformers import LlamaTokenizer class CharLevelLlamaTokenizer(LlamaTokenizer): def tokenize(self, text, **kwargs): # Split text into individual characters tokens = list(text) return tokens # Initialize the tokenizer tokenizer = CharLevelLlamaTokenizer.from_pretrained("Llama-model-identifier") # Example text text = "Hello, Llama!" # Tokenize the text into single characters tokens = tokenizer.tokenize(text) print(tokens)

    In this example, CharLevelLlamaTokenizer is a subclass of LlamaTokenizer that overrides the tokenize method. The new tokenize method simply splits the input text into a list of its constituent characters. This approach allows you to tokenize any text into single-character tokens, which can then be used for fine-tuning the Llama model.

    Please replace "Llama-model-identifier" with the actual model identifier for the Llama model you are using (e.g., "openai/llama-3.1").

    Keep in mind that tokenizing text at the character level can significantly increase the sequence length, which may impact the model's performance and training/inference time. Additionally, you may need to adjust the model's maximum sequence length (max_length) to accommodate longer sequences of character-level tokens.

    Search results
    [huggingface/transformers] src/transformers/models/llama/tokenization_llama_fast.py
    class LlamaTokenizerFast(PreTrainedTokenizerFast): """ Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. This uses notably ByteFallback and no normalization. ```python >>> from transformers import LlamaTokenizerFast >>> tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer") >>> tokenizer.encode("Hello this is a test") [1, 15043, 445, 338, 263, 1243] ``` If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct). For more details, checkout [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation. This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: vocab_file (`str`, *optional*): [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer. tokenizer_file (`str`, *optional*): [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that contains everything needed to load the tokenizer. clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`): Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<unk>"`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<s>"`): The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"</s>"`): The end of sequence token. add_bos_token (`bool`, *optional*, defaults to `True`): Whether or not to add an `bos_token` at the start of sequences. add_eos_token (`bool`, *optional*, defaults to `False`): Whether or not to add an `eos_token` at the end of sequences. use_default_system_prompt (`bool`, *optional*, defaults to `False`): Whether or not the default system prompt for Llama should be used legacy (`bool`, *optional*): Whether or not the `legacy` behavior of the tokenizer should be used. Legacy is before the merge of #24622 and #25224 which includes fixes to properly handle tokens that appear after special tokens. Make sure to also set `from_slow` to `True`. A simple example: - `legacy=True`: ```python >>> from transformers import LlamaTokenizerFast >>> tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=True, from_slow=True) >>> tokenizer.encode("Hello <s>.") # 869 is '▁.' [1, 15043, 29871, 1, 869] ``` - `legacy=False`: ```python >>> from transformers import LlamaTokenizerFast >>> tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=False, from_slow=True) >>> tokenizer.encode("Hello <s>.") # 29889 is '.' [1, 15043, 29871, 1, 29889] ``` Checkout the [pull request](https://github.com/huggingface/transformers/pull/24565) for more details. add_prefix_space (`bool`, *optional*): Whether or not the tokenizer should automatically add a prefix space """ vocab_files_names = VOCAB_FILES_NAMES slow_tokenizer_class = LlamaTokenizer padding_side = "left" model_input_names = ["input_ids", "attention_mask"] def __init__( self, vocab_file=None, tokenizer_file=None, clean_up_tokenization_spaces=False, unk_token="<unk>", bos_token="<s>", eos_token="</s>", add_bos_token=True, add_eos_token=False, use_default_system_prompt=False, legacy=None, add_prefix_space=None, **kwargs, ): if legacy is None: logger.warning_once( f"You are using the default legacy behaviour of the {self.__class__}. This is" " expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you." " If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it" " means, and thoroughly read the reason why this was added as explained in" " https://github.com/huggingface/transformers/pull/24565" ) legacy = True self.legacy = legacy if add_prefix_space is not None: logger.warning_once( "You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers" ) kwargs["from_slow"] = True super().__init__( vocab_file=vocab_file, tokenizer_file=tokenizer_file, clean_up_tokenization_spaces=clean_up_tokenization_spaces, unk_token=unk_token, bos_token=bos_token, eos_token=eos_token, add_bos_token=add_bos_token, add_eos_token=add_eos_token, use_default_system_prompt=use_default_system_prompt, **kwargs, ) self._add_bos_token = add_bos_token self._add_eos_token = add_eos_token self.update_post_processor() self.use_default_system_prompt = use_default_system_prompt self.vocab_file = vocab_file @property def can_save_slow_tokenizer(self) -> bool: return os.path.isfile(self.vocab_file) if self.vocab_file else False def update_post_processor(self): """ Updates the underlying post processor with the current `bos_token` and `eos_token`. """ bos = self.bos_token bos_token_id = self.bos_token_id if bos is None and self.add_bos_token: raise ValueError("add_bos_token = True but bos_token = None") eos = self.eos_token eos_token_id = self.eos_token_id if eos is None and self.add_eos_token: raise ValueError("add_eos_token = True but eos_token = None") single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}" pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}" special_tokens = [] if self.add_bos_token: special_tokens.append((bos, bos_token_id)) if self.add_eos_token: special_tokens.append((eos, eos_token_id)) self._tokenizer.post_processor = processors.TemplateProcessing( single=single, pair=pair, special_tokens=special_tokens ) @property def add_eos_token(self): return self._add_eos_token @property def add_bos_token(self): return self._add_bos_token @add_eos_token.setter def add_eos_token(self, value): self._add_eos_token = value self.update_post_processor() @add_bos_token.setter def add_bos_token(self, value): self._add_bos_token = value self.update_post_processor() def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: if not self.can_save_slow_tokenizer: raise ValueError( "Your fast tokenizer does not have the necessary information to save the vocabulary for a slow " "tokenizer." ) if not os.path.isdir(save_directory): logger.error(f"Vocabulary path ({save_directory}) should be a directory") return out_vocab_file = os.path.join( save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] ) if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file): copyfile(self.vocab_file, out_vocab_file) return (out_vocab_file,) @property # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.default_chat_template def default_chat_template(self): """ LLaMA uses [INST] and [/INST] to indicate user messages, and <<SYS>> and <</SYS>> to indicate system messages. Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering rather than needing special tokens. The system message is partly 'embedded' in the first user message, which results in an unusual token ordering when it is present. This template should definitely be changed if you wish to fine-tune a model with more flexible role ordering! The output should look something like: <bos>[INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer <eos><bos>[INST] Prompt [/INST] Answer <eos> <bos>[INST] Prompt [/INST] The reference for this chat template is [this code snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362) in the original repository. """ template = ( "{% if messages[0]['role'] == 'system' %}" "{% set loop_messages = messages[1:] %}" # Extract system message if it's present "{% set system_message = messages[0]['content'] %}" "{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}" "{% set loop_messages = messages %}" # Or use the default system message if the flag is set "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}" "{% else %}" "{% set loop_messages = messages %}" "{% set system_message = false %}" "{% endif %}" "{% for message in loop_messages %}" # Loop over all non-system messages "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}" "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}" "{% endif %}" "{% if loop.index0 == 0 and system_message != false %}" # Embed system message in first message "{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}" "{% else %}" "{% set content = message['content'] %}" "{% endif %}" "{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way "{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}" "{% elif message['role'] == 'system' %}" "{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}" "{% elif message['role'] == 'assistant' %}" "{{ ' ' + content.strip() + ' ' + eos_token }}" "{% endif %}" "{% endfor %}" ) template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false") default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'") template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message) return template # TODO ArthurZ let's rely on the template processor instead, refactor all fast tokenizers # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.build_inputs_with_special_tokens def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): bos_token_id = [self.bos_token_id] if self.add_bos_token else [] eos_token_id = [self.eos_token_id] if self.add_eos_token else [] output = bos_token_id + token_ids_0 + eos_token_id if token_ids_1 is not None: output = output + bos_token_id + token_ids_1 + eos_token_id return output
    [huggingface/transformers] src/transformers/models/llama/tokenization_llama.py
    # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.tokenize def tokenize(self, text: "TextInput", **kwargs) -> List[str]: """ Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the first token is special. """ if self.legacy or len(text) == 0: return super().tokenize(text, **kwargs) text = text.replace(SPIECE_UNDERLINE, " ") if self.add_prefix_space: text = SPIECE_UNDERLINE + text tokens = super().tokenize(text, **kwargs) if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens: tokens = tokens[1:] return tokens
    [huggingface/transformers] src/transformers/models/llama/tokenization_llama.py
    """Tokenization classes for LLaMA."""
    [huggingface/transformers] tests/models/llama/test_tokenization_llama.py
    class LlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase): from_pretrained_id = ["hf-internal-testing/llama-tokenizer", "meta-llama/Llama-2-7b-hf"] tokenizer_class = LlamaTokenizer rust_tokenizer_class = LlamaTokenizerFast test_rust_tokenizer = False test_sentencepiece = True from_pretrained_kwargs = {} def setUp(self): super().setUp() # We have a SentencePiece fixture for testing tokenizer = LlamaTokenizer(SAMPLE_VOCAB, keep_accents=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.save_pretrained(self.tmpdirname) def get_tokenizers(self, **kwargs): kwargs.update({"pad_token": "<PAD>"}) return super().get_tokenizers(**kwargs) def test_full_tokenizer(self): tokenizer = LlamaTokenizer(SAMPLE_VOCAB, keep_accents=True) tokens = tokenizer.tokenize("This is a test") self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"]) self.assertListEqual( tokenizer.convert_tokens_to_ids(tokens), [285, 46, 10, 170, 382], ) tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.") self.assertListEqual( tokens, [ SPIECE_UNDERLINE + "I", SPIECE_UNDERLINE + "was", SPIECE_UNDERLINE + "b", "or", "n", SPIECE_UNDERLINE + "in", SPIECE_UNDERLINE + "", "9", "2", "0", "0", "0", ",", SPIECE_UNDERLINE + "and", SPIECE_UNDERLINE + "this", SPIECE_UNDERLINE + "is", SPIECE_UNDERLINE + "f", "al", "s", "é", ".", ], ) ids = tokenizer.convert_tokens_to_ids(tokens) self.assertListEqual( ids, [8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 0, 4], ) back_tokens = tokenizer.convert_ids_to_tokens(ids) self.assertListEqual( back_tokens, [ SPIECE_UNDERLINE + "I", SPIECE_UNDERLINE + "was", SPIECE_UNDERLINE + "b", "or", "n", SPIECE_UNDERLINE + "in", SPIECE_UNDERLINE + "", "<unk>", "2", "0", "0", "0", ",", SPIECE_UNDERLINE + "and", SPIECE_UNDERLINE + "this", SPIECE_UNDERLINE + "is", SPIECE_UNDERLINE + "f", "al", "s", "<unk>", ".", ], ) @unittest.skip("Let's wait for the fast tokenizer!") def test_save_pretrained(self): self.tokenizers_list += (self.rust_tokenizer_class, "hf-internal-testing/llama-tokenizer", {}) for tokenizer, pretrained_name, kwargs in self.tokenizers_list: with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs) tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs) tmpdirname2 = tempfile.mkdtemp() tokenizer_r_files = tokenizer_r.save_pretrained(tmpdirname2) tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2) # Checks it save with the same files + the tokenizer.json file for the fast one self.assertTrue(any("tokenizer.json" in f for f in tokenizer_r_files)) tokenizer_r_files = tuple(f for f in tokenizer_r_files if "tokenizer.json" not in f) self.assertSequenceEqual(tokenizer_r_files, tokenizer_p_files) # Checks everything loads correctly in the same way tokenizer_rp = tokenizer_r.from_pretrained(tmpdirname2) tokenizer_pp = tokenizer_p.from_pretrained(tmpdirname2) # Check special tokens are set accordingly on Rust and Python for key in tokenizer_pp.special_tokens_map: self.assertTrue(hasattr(tokenizer_rp, key)) shutil.rmtree(tmpdirname2) # Save tokenizer rust, legacy_format=True tmpdirname2 = tempfile.mkdtemp() tokenizer_r_files = tokenizer_r.save_pretrained(tmpdirname2, legacy_format=True) tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2) # Checks it save with the same files self.assertSequenceEqual(tokenizer_r_files, tokenizer_p_files) # Checks everything loads correctly in the same way tokenizer_rp = tokenizer_r.from_pretrained(tmpdirname2) tokenizer_pp = tokenizer_p.from_pretrained(tmpdirname2) # Check special tokens are set accordingly on Rust and Python for key in tokenizer_pp.special_tokens_map: self.assertTrue(hasattr(tokenizer_rp, key)) shutil.rmtree(tmpdirname2) # Save tokenizer rust, legacy_format=False tmpdirname2 = tempfile.mkdtemp() tokenizer_r_files = tokenizer_r.save_pretrained(tmpdirname2, legacy_format=False) tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2) # Checks it saved the tokenizer.json file self.assertTrue(any("tokenizer.json" in f for f in tokenizer_r_files)) # Checks everything loads correctly in the same way tokenizer_rp = tokenizer_r.from_pretrained(tmpdirname2) tokenizer_pp = tokenizer_p.from_pretrained(tmpdirname2) # Check special tokens are set accordingly on Rust and Python for key in tokenizer_pp.special_tokens_map: self.assertTrue(hasattr(tokenizer_rp, key)) shutil.rmtree(tmpdirname2) @require_torch def test_batch_tokenization(self): if not self.test_seq2seq: return tokenizers = self.get_tokenizers() for tokenizer in tokenizers: with self.subTest(f"{tokenizer.__class__.__name__}"): # Longer text that will definitely require truncation. text = [ " UN Chief Says There Is No Military Solution in Syria", " Secretary-General Ban Ki-moon says his response to Russia's stepped up military support for" " Syria is that 'there is no military solution' to the nearly five-year conflict and more weapons" " will only worsen the violence and misery for millions of people.", ] try: batch = tokenizer( text=text, max_length=3, max_target_length=10, return_tensors="pt", ) except NotImplementedError: return self.assertEqual(batch.input_ids.shape[1], 3) # max_target_length will default to max_length if not specified batch = tokenizer(text, max_length=3, return_tensors="pt") self.assertEqual(batch.input_ids.shape[1], 3) batch_encoder_only = tokenizer(text=text, max_length=3, max_target_length=10, return_tensors="pt") self.assertEqual(batch_encoder_only.input_ids.shape[1], 3) self.assertEqual(batch_encoder_only.attention_mask.shape[1], 3) self.assertNotIn("decoder_input_ids", batch_encoder_only) @unittest.skip("Unfortunately way too slow to build a BPE with SentencePiece.") def test_save_slow_from_fast_and_reload_fast(self): pass def test_special_tokens_initialization(self): for tokenizer, pretrained_name, kwargs in self.tokenizers_list: with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): added_tokens = [AddedToken("<special>", lstrip=True)] tokenizer_r = self.rust_tokenizer_class.from_pretrained( pretrained_name, additional_special_tokens=added_tokens, **kwargs ) r_output = tokenizer_r.encode("Hey this is a <special> token") special_token_id = tokenizer_r.encode("<special>", add_special_tokens=False)[0] self.assertTrue(special_token_id in r_output) if self.test_slow_tokenizer: tokenizer_cr = self.rust_tokenizer_class.from_pretrained( pretrained_name, additional_special_tokens=added_tokens, **kwargs, # , from_slow=True <- unfortunately too slow to convert ) tokenizer_p = self.tokenizer_class.from_pretrained( pretrained_name, additional_special_tokens=added_tokens, **kwargs ) p_output = tokenizer_p.encode("Hey this is a <special> token") cr_output = tokenizer_cr.encode("Hey this is a <special> token") self.assertEqual(p_output, r_output) self.assertEqual(cr_output, r_output) self.assertTrue(special_token_id in p_output) self.assertTrue(special_token_id in cr_output) @slow def test_tokenizer_integration(self): expected_encoding = {'input_ids': [[1, 4103, 689, 414, 313, 24784, 368, 2998, 408, 282, 3637, 25350, 29899, 9067, 414, 322, 282, 3637, 25350, 29899, 1457, 3018, 1312, 29899, 2151, 29897, 8128, 2498, 29899, 15503, 4220, 6956, 1973, 313, 13635, 29911, 29892, 402, 7982, 29899, 29906, 29892, 1528, 13635, 29911, 29874, 29892, 1060, 26369, 29892, 6652, 309, 29933, 814, 29892, 1060, 29931, 6779, 11410, 363, 18385, 17088, 7634, 11235, 313, 25103, 29965, 29897, 322, 18385, 17088, 28203, 313, 25103, 29954, 29897, 411, 975, 29871, 29941, 29906, 29974, 758, 3018, 1312, 4733, 297, 29871, 29896, 29900, 29900, 29974, 10276, 322, 6483, 1006, 3372, 3097, 1546, 435, 1165, 29892, 10772, 29911, 25350, 322, 323, 6073, 17907, 29889], [1, 350, 20161, 338, 8688, 304, 758, 29899, 14968, 6483, 21000, 8684, 284, 22540, 515, 443, 29880, 24025, 1426, 491, 14002, 368, 4195, 292, 373, 1716, 2175, 322, 1492, 3030, 297, 599, 15359, 29889], [1, 450, 4996, 17354, 1701, 29916, 432, 17204, 975, 278, 17366, 11203, 29889]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]} # fmt: skip self.tokenizer_integration_test_util( expected_encoding=expected_encoding, model_name="hf-internal-testing/llama-tokenizer", revision="0984d03108b1a041ed679bd253b6519b7e1a4778", padding=False, ) def test_picklable(self): with tempfile.NamedTemporaryFile() as f: shutil.copyfile(SAMPLE_VOCAB, f.name) tokenizer = LlamaTokenizer(f.name, keep_accents=True) pickled_tokenizer = pickle.dumps(tokenizer) pickle.loads(pickled_tokenizer) @unittest.skip("worker 'gw4' crashed on CI, passing locally.") def test_pickle_subword_regularization_tokenizer(self): pass @unittest.skip("worker 'gw4' crashed on CI, passing locally.") def test_subword_regularization_tokenizer(self): pass def test_add_prefix_space(self): pretrained_name = "hf-internal-testing/llama-tokenizer-non-normalized" inputs = "Hey how are you doing" EXPECTED_WITH_SPACE = [1, 18637, 920, 526, 366, 2599] EXPECTED_WO_SPACE = [1, 29950, 1032, 920, 526, 366, 2599] slow_ = self.tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=False, legacy=False) fast_ = self.rust_tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=False, legacy=False) self.assertEqual(slow_.encode(inputs), EXPECTED_WO_SPACE) self.assertEqual(slow_.encode(inputs), fast_.encode(inputs)) self.assertEqual(slow_.tokenize(inputs), ["H", "ey", "▁how", "▁are", "▁you", "▁doing"]) self.assertEqual(slow_.decode(EXPECTED_WO_SPACE, skip_special_tokens=True), inputs) self.assertEqual( slow_.decode(EXPECTED_WO_SPACE, skip_special_tokens=True), fast_.decode(EXPECTED_WO_SPACE, skip_special_tokens=True), ) slow_ = self.tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=True, legacy=False) fast_ = self.rust_tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=True, legacy=False) self.assertEqual(slow_.encode(inputs), EXPECTED_WITH_SPACE) self.assertEqual(slow_.encode(inputs), fast_.encode(inputs)) self.assertEqual(slow_.tokenize(inputs), ["▁Hey", "▁how", "▁are", "▁you", "▁doing"]) self.assertEqual(slow_.decode(EXPECTED_WITH_SPACE, skip_special_tokens=True), inputs) self.assertEqual( slow_.decode(EXPECTED_WITH_SPACE, skip_special_tokens=True), fast_.decode(EXPECTED_WITH_SPACE, skip_special_tokens=True), )
    [huggingface/accelerate] examples/inference/pippy/llama.py
    # sdpa implementation which is the default torch>2.1.2 fails with the tracing + attention mask kwarg # with attn_implementation="eager" mode, the forward is very slow for some reason model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-chat-hf", low_cpu_mem_usage=True, attn_implementation="sdpa" ) model.eval() # Input configs # Create example inputs for the model tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
    [openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/llama2_chat.py
    def tokenize_prompt(self, prompt): conv = next(self.prompter.build_prompt(prompt)) conversation_str = conv.get_prompt() # Tokenize conversations input_ids = self.tokenizer( conversation_str, return_tensors="pt", padding="max_length", max_length=self.sequence_len, truncation=True, ).input_ids[0] target = input_ids.clone() # Mask targets. Only compute loss on the assistant outputs. sep = conv.roles[1] total_len = int(target.ne(self.tokenizer.pad_token_id).sum()) turns = conversation_str.split(conv.sep2) cur_len = 1 target[:cur_len] = IGNORE_TOKEN_ID for turn in turns: if turn == "": break turn_len = len(self.tokenizer(turn).input_ids) parts = turn.split(sep) if len(parts) != 2: break parts[0] += sep # "-1" is hardcoded for the LLaMA tokenizer to make the offset correct. instruction_len = len(self.tokenizer(parts[0]).input_ids) - 1 # Ignore the user instructions target[cur_len - 1 : cur_len + instruction_len] = IGNORE_TOKEN_ID cur_len += turn_len + 2 # due to length of role token target[cur_len:] = IGNORE_TOKEN_ID if cur_len < self.sequence_len: if cur_len != total_len: target[:] = IGNORE_TOKEN_ID logging.warning( f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)" ) attention_mask = input_ids.ne(self.tokenizer.pad_token_id).tolist() input_ids = input_ids.tolist() target = target.tolist() # this is a fix for the tokenizer which tokenizes [ differently with eos tokens and # follows the original llama implementation for i in range(2, total_len - 2): if input_ids[i] == 29961: input_ids[i] = 518 if target[i] == 29961: target[i] = 518 return { "input_ids": input_ids, "labels": target, "attention_mask": attention_mask, }
    [openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/llama2_chat.py
    def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.tokenizer.add_special_tokens( {"pad_token": getattr(self.tokenizer, "pad_token", "<pad>")} ) # https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/added_tokens.json
    [openaccess-ai-collective/axolotl] examples/tiny-llama/pretrain.yml
    base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false max_steps: 200 pretraining_dataset: path: c4 name: en type: pretrain dataset_prepared_path: val_set_size: 0.0 output_dir: ./outputs/model-out sequence_len: 2048 sample_packing: true wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 10 evals_per_epoch: eval_table_size: saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens:
    [openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/llama2_chat.py
    class LLama2ChatTokenizingStrategy(PromptTokenizingStrategy): """ Tokenizing strategy for ShareGPT prompts. adapted from https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py """ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.tokenizer.add_special_tokens( {"pad_token": getattr(self.tokenizer, "pad_token", "<pad>")} ) # https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/added_tokens.json def tokenize_prompt(self, prompt): conv = next(self.prompter.build_prompt(prompt)) conversation_str = conv.get_prompt() # Tokenize conversations input_ids = self.tokenizer( conversation_str, return_tensors="pt", padding="max_length", max_length=self.sequence_len, truncation=True, ).input_ids[0] target = input_ids.clone() # Mask targets. Only compute loss on the assistant outputs. sep = conv.roles[1] total_len = int(target.ne(self.tokenizer.pad_token_id).sum()) turns = conversation_str.split(conv.sep2) cur_len = 1 target[:cur_len] = IGNORE_TOKEN_ID for turn in turns: if turn == "": break turn_len = len(self.tokenizer(turn).input_ids) parts = turn.split(sep) if len(parts) != 2: break parts[0] += sep # "-1" is hardcoded for the LLaMA tokenizer to make the offset correct. instruction_len = len(self.tokenizer(parts[0]).input_ids) - 1 # Ignore the user instructions target[cur_len - 1 : cur_len + instruction_len] = IGNORE_TOKEN_ID cur_len += turn_len + 2 # due to length of role token target[cur_len:] = IGNORE_TOKEN_ID if cur_len < self.sequence_len: if cur_len != total_len: target[:] = IGNORE_TOKEN_ID logging.warning( f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)" ) attention_mask = input_ids.ne(self.tokenizer.pad_token_id).tolist() input_ids = input_ids.tolist() target = target.tolist() # this is a fix for the tokenizer which tokenizes [ differently with eos tokens and # follows the original llama implementation for i in range(2, total_len - 2): if input_ids[i] == 29961: input_ids[i] = 518 if target[i] == 29961: target[i] = 518 return { "input_ids": input_ids, "labels": target, "attention_mask": attention_mask, }
    [openaccess-ai-collective/axolotl] examples/llama-2/fft_optimized.yml
    base_model: NousResearch/Llama-2-7b-hf model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./outputs/out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true flash_attn_cross_entropy: false flash_attn_rms_norm: true flash_attn_fuse_qkv: false flash_attn_fuse_mlp: true warmup_steps: 100 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 1 debug: deepspeed: #deepspeed_configs/zero2.json # multi-gpu only weight_decay: 0.1 fsdp: fsdp_config: special_tokens:
    [openaccess-ai-collective/axolotl] src/axolotl/prompt_strategies/llama2_chat.py
    def load(tokenizer, cfg) -> LLama2ChatTokenizingStrategy: return LLama2ChatTokenizingStrategy( Llama2ChatPrompter(), tokenizer, cfg.train_on_inputs, cfg.sequence_len, )
    [openaccess-ai-collective/axolotl] examples/openllama-3b/config.yml
    base_model: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca dataset_prepared_path: val_set_size: 0.02 adapter: lora_model_dir: sequence_len: 1024 sample_packing: true lora_r: lora_alpha: lora_dropout: lora_target_modules: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/openllama-out gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.000003 train_on_inputs: false group_by_length: false float16: true bf16: false fp16: false tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
    [openaccess-ai-collective/axolotl] examples/llama-3/instruct-lora-8b.yml
    base_model: meta-llama/Meta-Llama-3-8B-Instruct model_type: LlamaForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: true load_in_4bit: false strict: false chat_template: llama3 datasets: - path: fozziethebeat/alpaca_messages_2k_test type: chat_template chat_template: llama3 field_messages: messages message_field_role: role message_field_content: content roles: user: - user assistant: - assistant dataset_prepared_path: val_set_size: 0.05 output_dir: ./outputs/lora-out sequence_len: 4096 sample_packing: false pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true s2_attention: warmup_steps: 10 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config:
    [huggingface/accelerate] examples/inference/pippy/llama.py
    # bs = 3 tokenizer.pad_token = tokenizer.eos_token inputs = tokenizer(prompts, return_tensors="pt", padding=True) # Create a pipeline stage from the model # Using `auto` is equivalent to letting `device_map="auto"` figure # out device mapping and will also split the model according to the # number of total GPUs available if it fits on one GPU model = prepare_pippy(model, split_points="auto", example_kwargs=inputs)
    [openaccess-ai-collective/axolotl] examples/openllama-3b/qlora.yml
    base_model: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: true strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca dataset_prepared_path: val_set_size: 0.05 adapter: qlora lora_model_dir: sequence_len: 1024 sample_packing: true lora_r: 8 lora_alpha: 32 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/qlora-out gradient_accumulation_steps: 1 micro_batch_size: 2 num_epochs: 4 optimizer: paged_adamw_32bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
    [openaccess-ai-collective/axolotl] src/axolotl/utils/models.py
    def load_tokenizer(cfg): model_config = load_model_config(cfg) tokenizer_kwargs = {} use_fast = True # this is the default if cfg.tokenizer_use_fast is not None: use_fast = cfg.tokenizer_use_fast if cfg.tokenizer_legacy is not None: # True is the default w/ https://github.com/huggingface/transformers/pull/25224 tokenizer_kwargs["legacy"] = cfg.tokenizer_legacy tokenizer_cls = AutoTokenizer if cfg.tokenizer_type: tokenizer_cls = getattr(transformers, cfg.tokenizer_type) tokenizer = tokenizer_cls.from_pretrained( cfg.tokenizer_config, trust_remote_code=cfg.trust_remote_code or False, use_fast=use_fast, **tokenizer_kwargs, ) if ( tokenizer.__class__.__name__ in [ "LlamaTokenizer", "LlamaTokenizerFast", "CodeLlamaTokenizer", "CodeLlamaTokenizerFast", ] and hasattr(tokenizer, "pad_token") and not tokenizer.pad_token ): # set a pad_token, but use eos_token so we don't add a new token tokenizer.pad_token = LLAMA_DEFAULT_EOS_TOKEN if tokenizer.__class__.__name__ == "GPTNeoXTokenizerFast": tokenizer.add_special_tokens({"pad_token": "[PAD]"}) os.environ["TOKENIZERS_PARALLELISM"] = "false" # Mistral's official FA implementation requires left padding if cfg.is_mistral_derived_model and cfg.flash_attention and not cfg.sample_packing: tokenizer.padding_side = "left" # Qwen base only has single token, so we need to set the special tokens if cfg.is_qwen_derived_model: token_ids = ["bos_token_id", "eos_token_id", "pad_token_id", "unk_token_id"] for attr_name in token_ids: if getattr(tokenizer, attr_name) is None: setattr(tokenizer, attr_name, tokenizer.eod_id) token_names = ["bos_token", "eos_token", "pad_token", "unk_token"] for attr_name in token_names: if getattr(tokenizer, attr_name) is None: setattr(tokenizer, attr_name, "<|endoftext|>") additional_special_tokens = None if cfg.special_tokens: special_tokens = cfg.special_tokens.to_dict() additional_special_tokens = special_tokens.pop( "additional_special_tokens", None ) lora_modules_to_save = get_linear_embedding_layers(model_config.model_type) for k, val in special_tokens.items(): # check if new special token is not already in tokenizer and # is adapter training to make sure lora_modules_to_save is set # pylint: disable=too-many-boolean-expressions if ( (getattr(tokenizer, k) is None or getattr(tokenizer, k) != val) and (len(tokenizer.encode(val, add_special_tokens=False)) > 2) and cfg.adapter and ( not cfg.lora_modules_to_save or not all( x in cfg.lora_modules_to_save for x in lora_modules_to_save ) ) ): lora_modules_to_save = ", ".join( [f"`{x}`" for x in lora_modules_to_save] ) raise ValueError( f"Please set lora_modules_to_save to [{lora_modules_to_save}] when using an adapter and changing the special tokens." ) tokenizer.add_special_tokens( {k: AddedToken(val, rstrip=False, lstrip=False, normalized=False)} ) # If we add bos_token and eos_token, we need to update the post processor to # handle them correctly. # https://github.com/huggingface/transformers/pull/24132 bos_or_eos_in_special_tokens = ( "bos_token" in cfg.special_tokens and "eos_token" in cfg.special_tokens ) if ( tokenizer.__class__.__name__ in ( "LlamaTokenizerFast", "CodeLlamaTokenizerFast", ) and bos_or_eos_in_special_tokens ): tokenizer.update_post_processor() if cfg.tokens: tokenizer.add_tokens( [ AddedToken(token, rstrip=False, lstrip=False, normalized=False) for token in cfg.tokens ] ) # Additional special tokens are a List, and need to be treated differently than regular special # tokens. We add them after we have called `add_tokens` in case these additional special tokens # are new tokens. # # Usage: # # ```py # special_tokens: # additional_special_tokens: ["<|im_start|>", "<|im_end|>"] # ``` if additional_special_tokens is not None: tokenizer.add_special_tokens( {"additional_special_tokens": additional_special_tokens} ) with zero_only(): LOG.debug(f"EOS: {tokenizer.eos_token_id} / {tokenizer.eos_token}") LOG.debug(f"BOS: {tokenizer.bos_token_id} / {tokenizer.bos_token}") LOG.debug(f"PAD: {tokenizer.pad_token_id} / {tokenizer.pad_token}") LOG.debug(f"UNK: {tokenizer.unk_token_id} / {tokenizer.unk_token}") if cfg.chat_template: chat_template_string = chat_templates(cfg.chat_template) if cfg.default_system_message and cfg.chat_template == "chatml": chat_template_string = chat_template_string.replace( "You are a helpful assistant.", cfg.default_system_message ) tokenizer.chat_template = chat_template_string else: LOG.info( "No Chat template selected. Consider adding a chat template for easier inference." ) return tokenizer
    [huggingface/peft] docs/source/conceptual_guides/adapter.md

    Llama-Adapter

    Llama-Adapter is a method for adapting Llama into a instruction-following model. To help adapt the model for instruction-following, the adapter is trained with a 52K instruction-output dataset.

    A set of of learnable adaption prompts are prefixed to the input instruction tokens. These are inserted into the upper layers of the model because it is better to learn with the higher-level semantics of the pretrained model. The instruction-output tokens prefixed to the input guide the adaption prompt to generate a contextual response.

    <div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/llama-adapter.png"/> </div> <small><a href="https://hf.co/papers/2303.16199">LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention</a></small>

    To avoid adding noise to the tokens, the adapter uses zero-initialized attention. On top of this, the adapter adds a learnable gating factor (initialized with zeros) to progressively add information to the model during training. This prevents overwhelming the model's pretrained knowledge with the newly learned instructions.

    [huggingface/peft] examples/token_classification/peft_lora_token_cls.ipynb
            # the entire model is fine-tuned.
            tokens += [sep_token]
            token_boxes += [sep_token_box]
            actual_bboxes += [[0, 0, width, height]]
            label_ids += [pad_token_label_id]
            if sep_token_extra:
                # roberta uses an extra separator b/w pairs of sentences
                tokens += [sep_token]
                token_boxes += [sep_token_box]
                actual_bboxes += [[0, 0, width, height]]
                label_ids += [pad_token_label_id]
            segment_ids = [sequence_a_segment_id] * len(tokens)
    
            if cls_token_at_end:
                tokens += [cls_token]
                token_boxes += [cls_token_box]
                actual_bboxes += [[0, 0, width, height]]
                label_ids += [pad_token_label_id]
                segment_ids += [cls_token_segment_id]
            else:
                tokens = [cls_token] + tokens
                token_boxes = [cls_token_box] + token_boxes
                actual_bboxes = [[0, 0, width, height]] + actual_bboxes
                label_ids = [pad_token_label_id] + label_ids
                segment_ids = [cls_token_segment_id] + segment_ids
    
            input_ids = tokenizer.convert_tokens_to_ids(tokens)
    
            # The mask has 1 for real tokens and 0 for padding tokens. Only real
            # tokens are attended to.
            input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
    
            # Zero-pad up to the sequence length.
            padding_length = max_seq_length - len(input_ids)
            if pad_on_left:
                input_ids = ([pad_token] * padding_length) + input_ids
                input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
                segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
                label_ids = ([pad_token_label_id] * padding_length) + label_ids
                token_boxes = ([pad_token_box] * padding_length) + token_boxes
            else:
                input_ids += [pad_token] * padding_length
                input_mask += [0 if mask_padding_with_zero else 1] * padding_length
                segment_ids += [pad_token_segment_id] * padding_length
                label_ids += [pad_token_label_id] * padding_length
                token_boxes += [pad_token_box] * padding_length
    
            assert len(input_ids) == max_seq_length
            assert len(input_mask) == max_seq_length
            assert len(segment_ids) == max_seq_length
            assert len(label_ids) == max_seq_length
            assert len(token_boxes) == max_seq_length
    
            if ex_index < 5:
                logger.info("*** Example ***")
                logger.info("guid: %s", example.guid)
                logger.info("tokens: %s", " ".join([str(x) for x in tokens]))
                logger.info("input_ids: %s", " ".join([str(x) for x in input_ids]))
                logger.info("input_mask: %s", " ".join([str(x) for x in input_mask]))
                logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids]))
                logger.info("label_ids: %s", " ".join([str(x) for x in label_ids]))
                logger.info("boxes: %s", " ".join([str(x) for x in token_boxes]))
                logger.info("actual_bboxes: %s", " ".join([str(x) for x in actual_bboxes]))
    
            features.append(
                InputFeatures(
                    input_ids=input_ids,
                    input_mask=input_mask,
                    segment_ids=segment_ids,
                    label_ids=label_ids,
                    boxes=token_boxes,
                    actual_bboxes=actual_bboxes,
                    file_name=file_name,
                    page_size=page_size,
                )
            )
        return features
    
    from transformers import LayoutLMTokenizer
    
    # from .unilm.layoutlm.data.funsd import FunsdDataset, InputFeatures
    from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
    
    batch_size = 16
    args = {
        "local_rank": -1,
        "overwrite_cache": True,
        "data_dir": "/home/sourab/temp/data/",
        "model_name_or_path": "microsoft/layoutlm-base-uncased",
        "max_seq_length": 512,
        "model_type": "layoutlm",
    }
    
    
    # class to turn the keys of a dict into attributes (thanks Stackoverflow)
    class AttrDict(dict):
        def __init__(self, *args, **kwargs):
            super(AttrDict, self).__init__(*args, **kwargs)
            self.__dict__ = self
    
    
    args = AttrDict(args)
    
    tokenizer = LayoutLMTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")
    
    # the LayoutLM authors already defined a specific FunsdDataset, so we are going to use this here
    train_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode="train")
    train_sampler = RandomSampler(train_dataset)
    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size)
    
    eval_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode="test")
    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=batch_size)
    
    len(train_dataloader)
    
    len(eval_dataloader)
    
    batch = next(iter(train_dataloader))
    input_ids = batch[0][0]
    tokenizer.decode(input_ids)
    

    Define and fine-tune the model

    As this is a sequence labeling task, we are going to load LayoutLMForTokenClassification (the base sized model) from the hub. We are going to fine-tune it on a downstream task, namely FUNSD.

    from peft import get_peft_config, PeftModel, get_peft_model, LoraConfig, TaskType
    
    peft_config = LoraConfig(
        task_type=TaskType.TOKEN_CLS, inference_mode=False, r=16, lora_alpha=16, lora_dropout=0.1, bias="all"
    )
    peft_config
    
    from transformers import LayoutLMForTokenClassification
    import torch
    from transformers import set_seed
    
    seed = 100
    set_seed(seed)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    model = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=num_labels)
    model = get_peft_model(model, peft_config)
    model.to(device)
    
    print(model.model.layoutlm.encoder.layer[0].attention.self.query.weight)
    print(model.model.layoutlm.encoder.layer[0].attention.self.query.lora_A.weight)
    print(model.model.classifier.weight)
    

    Now we can start training:

    from transformers import AdamW, get_linear_schedule_with_warmup
    from tqdm import tqdm
    
    num_train_epochs = 100
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3)
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=0.06 * (len(train_dataloader) * num_train_epochs),
        num_training_steps=(len(train_dataloader) * num_train_epochs),
    )
    
    
    global_step = 0
    
    t_total = len(train_dataloader) * num_train_epochs  # total number of training steps
    
    # put the model in training mode
    model.train()
    for epoch in range(num_train_epochs):
        for batch in tqdm(train_dataloader, desc="Training"):
            input_ids = batch[0].to(device)
            bbox = batch[4].to(device)
            attention_mask = batch[1].to(device)
            token_type_ids = batch[2].to(device)
            labels = batch[3].to(device)
    
            # forward pass
            outputs = model(
                input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels
            )
            loss = outputs.loss
    
            # print loss every 100 steps
            if global_step % 10 == 0:
                print(f"Loss after {global_step} steps: {loss.item()}")
    
            # backward pass to get the gradients
            loss.backward()
    
            # print("Gradients on classification head:")
            # print(model.classifier.weight.grad[6,:].sum())
    
            # update
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            global_step += 1
    
    import numpy as np
    from seqeval.metrics import (
        classification_report,
        f1_score,
        precision_score,
        recall_score,
    )
    
    eval_loss = 0.0
    nb_eval_steps = 0
    preds = None
    out_label_ids = None
    
    # put model in evaluation mode
    model.eval()
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        with torch.no_grad():
            input_ids = batch[0].to(device)
            bbox = batch[4].to(device)
            attention_mask = batch[1].to(device)
            token_type_ids = batch[2].to(device)
            labels = batch[3].to(device)
    
            # forward pass
            outputs = model(
                input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels
            )
            # get the loss and logits
            tmp_eval_loss = outputs.loss
            logits = outputs.logits
    
            eval_loss += tmp_eval_loss.item()
            nb_eval_steps += 1
    
            # compute the predictions
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = labels.detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
    
    # compute average evaluation loss
    eval_loss = eval_loss / nb_eval_steps
    preds = np.argmax(preds, axis=2)
    
    out_label_list = [[] for _ in range(out_label_ids.shape[0])]
    preds_list = [[] for _ in range(out_label_ids.shape[0])]
    
    for i in range(out_label_ids.shape[0]):
        for j in range(out_label_ids.shape[1]):
            if out_label_ids[i, j] != pad_token_label_id:
                out_label_list[i].append(label_map[out_label_ids[i][j]])
                preds_list[i].append(label_map[preds[i][j]])
    
    results = {
        "loss": eval_loss,
        "precision": precision_score(out_label_list, preds_list),
        "recall": recall_score(out_label_list, preds_list),
        "f1": f1_score(out_label_list, preds_list),
    }
    print(results)
    
    model.print_trainable_parameters()
    
    model.save_pretrained("peft_layoutlm")
    
    !du -h "peft_layoutlm/adapter_model.bin"
    
    [huggingface/peft] examples/loftq_finetuning/train_gsm8k_llama.py
    def preprocess_function_test(examples): sources = examples[source_column] labels = examples[target_column] inputs = [source + task_prompt for source in sources] model_inputs = tokenizer(inputs, max_length=args.max_source_length, padding=padding, truncation=True) labels = tokenizer(labels, max_length=args.max_target_length, padding=padding, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs
    [huggingface/peft] docs/source/package_reference/llama_adapter.md

    Llama-Adapter

    Llama-Adapter is a PEFT method specifically designed for turning Llama into an instruction-following model. The Llama model is frozen and only a set of adaptation prompts prefixed to the input instruction tokens are learned. Since randomly initialized modules inserted into the model can cause the model to lose some of its existing knowledge, Llama-Adapter uses zero-initialized attention with zero gating to progressively add the instructional prompts to the model.

    The abstract from the paper is:

    We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the input text tokens at higher transformer layers. Then, a zero-init attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With efficient training, LLaMA-Adapter generates high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Furthermore, our approach can be simply extended to multi-modal input, e.g., images, for image-conditioned LLaMA, which achieves superior reasoning capacity on ScienceQA. We release our code at https://github.com/ZrrSkywalker/LLaMA-Adapter.

    AdaptionPromptConfig

    [[autodoc]] tuners.adaption_prompt.config.AdaptionPromptConfig

    AdaptionPromptModel

    [[autodoc]] tuners.adaption_prompt.model.AdaptionPromptModel

    [huggingface/peft] examples/pissa_finetuning/preprocess.py
    tokenizer = AutoTokenizer.from_pretrained(script_args.base_model_name_or_path) tokenizer.pad_token_id = tokenizer.eos_token_id lora_config = LoraConfig( r=script_args.lora_r, lora_alpha=script_args.lora_alpha, init_lora_weights=script_args.init_lora_weights, lora_dropout=script_args.lora_dropout, target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"], bias="none", task_type="CAUSAL_LM", ) peft_model = get_peft_model(model, lora_config)
    [openaccess-ai-collective/axolotl] README.md

    Train

    Run

    accelerate launch -m axolotl.cli.train your_config.yml

    [!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

    Preprocess dataset

    You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

    • Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
    • (Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
    • (Optional): Use --debug to see preprocessed examples.
    python -m axolotl.cli.preprocess your_config.yml

    Multi-GPU

    Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

    DeepSpeed

    Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated

    We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.

    deepspeed: deepspeed_configs/zero1.json
    accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
    FSDP
    • llama FSDP
    fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
    FSDP + QLoRA

    Axolotl supports training with FSDP and QLoRA, see these docs for more information.

    Weights & Biases Logging

    Make sure your WANDB_API_KEY environment variable is set (recommended) or you login to wandb with wandb login.

    • wandb options
    wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
    Special Tokens

    It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:

    special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"

    When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.

    [huggingface/accelerate] examples/by_feature/cross_validation.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/accelerate] examples/by_feature/memory.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/accelerate] examples/by_feature/multi_process_metrics.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/accelerate] examples/by_feature/early_stopping.py
    def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
    [huggingface/accelerate] examples/by_feature/megatron_lm_gpt_pretraining.py
    def tokenize_function(examples): return tokenizer(examples[text_column_name])
    [openaccess-ai-collective/axolotl] README.md

    Common Errors 🧰

    See also the FAQ's and debugging guide.

    If you encounter a 'Cuda out of memory' error, it means your GPU ran out of memory during the training process. Here's how to resolve it:

    Please reduce any below

    • micro_batch_size
    • eval_batch_size
    • gradient_accumulation_steps
    • sequence_len

    If it does not help, try running without deepspeed and without accelerate (replace "accelerate launch" with "python") in the command.

    Using adamw_bnb_8bit might also save you some memory.

    failed (exitcode: -9)

    Usually means your system has run out of system memory. Similarly, you should consider reducing the same settings as when you run out of VRAM. Additionally, look into upgrading your system RAM which should be simpler than GPU upgrades.

    RuntimeError: expected scalar type Float but found Half

    Try set fp16: true

    NotImplementedError: No operator found for memory_efficient_attention_forward ...

    Try to turn off xformers.

    accelerate config missing

    It's safe to ignore it.

    NCCL Timeouts during training

    See the NCCL guide.

    Tokenization Mismatch b/w Inference & Training

    For many formats, Axolotl constructs prompts by concatenating token ids after tokenizing strings. The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.

    If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens. When you are starting out with a new format, you should always do the following:

    1. Materialize some data using python -m axolotl.cli.preprocess your_config.yml --debug, and then decode the first few rows with your model's tokenizer.
    2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
    3. Make sure the inference string from #2 looks exactly like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same, adjust your inference server accordingly.
    4. As an additional troubleshooting step, you can look at the token ids between 1 and 2 to make sure they are identical.

    Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See this blog post for a concrete example.

    [openaccess-ai-collective/axolotl] README.md

    Advanced Setup

    Environment

    Docker

    docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest

    Or run on the current files for development:

    docker compose up -d

    [!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.

    <details> <summary>Docker advanced</summary>

    A more powerful Docker command to run would be this:

    docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

    It additionally:

    • Prevents memory issues when running e.g. deepspeed (e.g. you could hit SIGBUS/signal 7 error) through --ipc and --ulimit args.
    • Persists the downloaded HF data (models etc.) and your modifications to axolotl code through --mount/-v args.
    • The --name argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...) or in your terminal.
    • The --privileged flag gives all capabilities to the container.
    • The --shm-size 10g argument increases the shared memory size. Use this if you see exitcode: -7 errors using deepspeed.

    More information on nvidia website

    </details>

    Conda/Pip venv

    1. Install python >=3.10

    2. Install pytorch stable https://pytorch.org/get-started/locally/

    3. Install Axolotl along with python dependencies

      pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
    4. (Optional) Login to Huggingface to use gated models/datasets.

      huggingface-cli login

      Get the token at huggingface.co/settings/tokens

    Cloud GPU

    For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest

    Bare Metal Cloud GPU

    LambdaLabs
    <details> <summary>Click to Expand</summary>
    1. Install python
    sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
    1. Install pip
    wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
    1. Install Pytorch https://pytorch.org/get-started/locally/

    2. Follow instructions on quickstart.

    3. Run

    pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
    1. Set path
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
    </details>
    GCP
    <details> <summary>Click to Expand</summary>

    Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.

    Make sure to run the below to uninstall xla.

    pip uninstall -y torch_xla[tpu]
    </details>

    Windows

    Please use WSL or Docker!

    Mac

    Use the below instead of the install method in QuickStart.

    pip3 install -e '.'
    

    More info: mac.md

    Google Colab

    Please use this example notebook.

    Launching on public clouds via SkyPilot

    To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:

    pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check

    Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1:

    git clone https://github.com/skypilot-org/skypilot.git
    cd skypilot/llm/axolotl
    

    Use one command to launch:

    # On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET

    Launching on public clouds via dstack

    To launch on GPU instance (both on-demand and spot instances) on public clouds (GCP, AWS, Azure, Lambda Labs, TensorDock, Vast.ai, and CUDO), you can use dstack.

    Write a job description in YAML as below:

    # dstack.yaml type: task image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.2 env: - HUGGING_FACE_HUB_TOKEN - WANDB_API_KEY commands: - accelerate launch -m axolotl.cli.train config.yaml ports: - 6006 resources: gpu: memory: 24GB.. count: 2

    then, simply run the job with dstack run command. Append --spot option if you want spot instance. dstack run command will show you the instance with cheapest price across multi cloud services:

    pip install dstack HUGGING_FACE_HUB_TOKEN=xxx WANDB_API_KEY=xxx dstack run . -f dstack.yaml # --spot

    For further and fine-grained use cases, please refer to the official dstack documents and the detailed description of axolotl example on the official repository.

    Dataset

    Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.

    See these docs for more information on how to use different dataset formats.

    Config

    See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:

    • model

      base_model: ./llama-7b-hf # local or huggingface repo

      Note: The code will load the right architecture.

    • dataset

      datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
    • loading

      load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP

      Note: Repo does not do 4-bit quantization.

    • lora

      adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj

    All Config Options

    See these docs for all config options.

    Train

    Run

    accelerate launch -m axolotl.cli.train your_config.yml

    [!TIP] You can also reference a config file that is hosted on a public URL, for example accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml

    Preprocess dataset

    You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.

    • Set dataset_prepared_path: to a local folder for saving and loading pre-tokenized dataset.
    • (Optional): Set push_dataset_to_hub: hf_user/repo to push it to Huggingface.
    • (Optional): Use --debug to see preprocessed examples.
    python -m axolotl.cli.preprocess your_config.yml

    Multi-GPU

    Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.

    DeepSpeed

    Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated

    We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.

    deepspeed: deepspeed_configs/zero1.json
    accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
    FSDP
    • llama FSDP
    fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
    FSDP + QLoRA

    Axolotl supports training with FSDP and QLoRA, see these docs for more information.

    Weights & Biases Logging

    Make sure your WANDB_API_KEY environment variable is set (recommended) or you login to wandb with wandb login.

    • wandb options
    wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
    Special Tokens

    It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:

    special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"

    When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.

    Inference Playground

    Axolotl allows you to load your model in an interactive terminal playground for quick experimentation. The config file is the same config file used for training.

    Pass the appropriate flag to the inference command, depending upon what kind of model was trained:

    • Pretrained LORA:
      python -m axolotl.cli.inference examples/your_config.yml --lora_model_dir="./lora-output-dir"
    • Full weights finetune:
      python -m axolotl.cli.inference examples/your_config.yml --base_model="./completed-model"
    • Full weights finetune w/ a prompt from a text file:
      cat /tmp/prompt.txt | python -m axolotl.cli.inference examples/your_config.yml \ --base_model="./completed-model" --prompter=None --load_in_8bit=True

    -- With gradio hosting

    python -m axolotl.cli.inference examples/your_config.yml --gradio

    Please use --sample_packing False if you have it on and receive the error similar to below:

    RuntimeError: stack expects each tensor to be equal size, but got [1, 32, 1, 128] at entry 0 and [1, 32, 8, 128] at entry 1

    Merge LORA to base

    The following command will merge your LORA adapater with your base model. You can optionally pass the argument --lora_model_dir to specify the directory where your LORA adapter was saved, otherwhise, this will be inferred from output_dir in your axolotl config file. The merged model is saved in the sub-directory {lora_model_dir}/merged.

    python3 -m axolotl.cli.merge_lora your_config.yml --lora_model_dir="./completed-model"

    You may need to use the gpu_memory_limit and/or lora_on_cpu config options to avoid running out of memory. If you still run out of CUDA memory, you can try to merge in system RAM with

    CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.merge_lora ...

    although this will be very slow, and using the config options above are recommended instead.

    [openaccess-ai-collective/axolotl] docs/config.qmd
    optim_target_modules:
    # - self_attn  # for llama
    # - mlp
    
    # Specify weight decay
    weight_decay:
    # adamw hyperparams
    adam_beta1:
    adam_beta2:
    adam_epsilon:
    # Gradient clipping max norm
    max_grad_norm:
    
    # Augmentation techniques
    # NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
    # currently only supported on Llama and Mistral
    neftune_noise_alpha:
    
    # Whether to bettertransformers
    flash_optimum:
    # Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
    xformers_attention:
    # Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
    flash_attention:
    flash_attn_cross_entropy:  # Whether to use flash-attention cross entropy implementation - advanced use only
    flash_attn_rms_norm:  # Whether to use flash-attention rms norm implementation - advanced use only
    flash_attn_fuse_qkv: # Whether to fuse QKV into a single operation
    flash_attn_fuse_mlp: # Whether to fuse part of the MLP into a single operation
    # Whether to use scaled-dot-product attention
    # https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
    sdp_attention:
    # Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
    s2_attention:
    # Resume from a specific checkpoint dir
    resume_from_checkpoint:
    # If resume_from_checkpoint isn't set and you simply want it to start where it left off.
    # Be careful with this being turned on between different models.
    auto_resume_from_checkpoints: false
    
    # Don't mess with this, it's here for accelerate and torchrun
    local_rank:
    
    # Add or change special tokens.
    # If you add tokens here, you don't need to add them to the `tokens` list.
    special_tokens:
      # bos_token: "<s>"
      # eos_token: "</s>"
      # unk_token: "<unk>"
      # pad_token: "[PAD]"
    
    # Add extra tokens.
    tokens:
    
    # FSDP
    fsdp:
    fsdp_config:
    
    # Deepspeed config path. e.g., deepspeed_configs/zero3.json
    deepspeed:
    
    # Advanced DDP Arguments
    ddp_timeout:
    ddp_bucket_cap_mb:
    ddp_broadcast_buffers:
    
    # Path to torch distx for optim 'adamw_anyprecision'
    torchdistx_path:
    
    # Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
    pretraining_dataset:
    
    # Debug mode
    debug:
    
    # Seed
    seed:
    
    # Allow overwrite yml config using from cli
    strict:
    
OpenAccess-AI-Collective/axolotl
huggingface/transformers
huggingface/peft
huggingface/accelerate