Yes, you can customize the Llama tokenizer to tokenize text into single character tokens by overriding its tokenize
method. Since the Llama tokenizer is based on the PreTrainedTokenizer
class from the Hugging Face transformers
library, you can subclass the specific tokenizer used for Llama (e.g., LlamaTokenizer
or LlamaTokenizerFast
) and modify the tokenize
method to split the text into individual characters.
Here's an example of how you could implement this:
from transformers import LlamaTokenizer class CharLevelLlamaTokenizer(LlamaTokenizer): def tokenize(self, text, **kwargs): # Split text into individual characters tokens = list(text) return tokens # Initialize the tokenizer tokenizer = CharLevelLlamaTokenizer.from_pretrained("Llama-model-identifier") # Example text text = "Hello, Llama!" # Tokenize the text into single characters tokens = tokenizer.tokenize(text) print(tokens)
In this example, CharLevelLlamaTokenizer
is a subclass of LlamaTokenizer
that overrides the tokenize
method. The new tokenize
method simply splits the input text into a list of its constituent characters. This approach allows you to tokenize any text into single-character tokens, which can then be used for fine-tuning the Llama model.
Please replace "Llama-model-identifier"
with the actual model identifier for the Llama model you are using (e.g., "openai/llama-3.1"
).
Keep in mind that tokenizing text at the character level can significantly increase the sequence length, which may impact the model's performance and training/inference time. Additionally, you may need to adjust the model's maximum sequence length (max_length
) to accommodate longer sequences of character-level tokens.
class LlamaTokenizerFast(PreTrainedTokenizerFast): """ Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. This uses notably ByteFallback and no normalization. ```python >>> from transformers import LlamaTokenizerFast >>> tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer") >>> tokenizer.encode("Hello this is a test") [1, 15043, 445, 338, 263, 1243] ``` If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct). For more details, checkout [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation. This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: vocab_file (`str`, *optional*): [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer. tokenizer_file (`str`, *optional*): [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that contains everything needed to load the tokenizer. clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`): Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<unk>"`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<s>"`): The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"</s>"`): The end of sequence token. add_bos_token (`bool`, *optional*, defaults to `True`): Whether or not to add an `bos_token` at the start of sequences. add_eos_token (`bool`, *optional*, defaults to `False`): Whether or not to add an `eos_token` at the end of sequences. use_default_system_prompt (`bool`, *optional*, defaults to `False`): Whether or not the default system prompt for Llama should be used legacy (`bool`, *optional*): Whether or not the `legacy` behavior of the tokenizer should be used. Legacy is before the merge of #24622 and #25224 which includes fixes to properly handle tokens that appear after special tokens. Make sure to also set `from_slow` to `True`. A simple example: - `legacy=True`: ```python >>> from transformers import LlamaTokenizerFast >>> tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=True, from_slow=True) >>> tokenizer.encode("Hello <s>.") # 869 is '▁.' [1, 15043, 29871, 1, 869] ``` - `legacy=False`: ```python >>> from transformers import LlamaTokenizerFast >>> tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=False, from_slow=True) >>> tokenizer.encode("Hello <s>.") # 29889 is '.' [1, 15043, 29871, 1, 29889] ``` Checkout the [pull request](https://github.com/huggingface/transformers/pull/24565) for more details. add_prefix_space (`bool`, *optional*): Whether or not the tokenizer should automatically add a prefix space """ vocab_files_names = VOCAB_FILES_NAMES slow_tokenizer_class = LlamaTokenizer padding_side = "left" model_input_names = ["input_ids", "attention_mask"] def __init__( self, vocab_file=None, tokenizer_file=None, clean_up_tokenization_spaces=False, unk_token="<unk>", bos_token="<s>", eos_token="</s>", add_bos_token=True, add_eos_token=False, use_default_system_prompt=False, legacy=None, add_prefix_space=None, **kwargs, ): if legacy is None: logger.warning_once( f"You are using the default legacy behaviour of the {self.__class__}. This is" " expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you." " If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it" " means, and thoroughly read the reason why this was added as explained in" " https://github.com/huggingface/transformers/pull/24565" ) legacy = True self.legacy = legacy if add_prefix_space is not None: logger.warning_once( "You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers" ) kwargs["from_slow"] = True super().__init__( vocab_file=vocab_file, tokenizer_file=tokenizer_file, clean_up_tokenization_spaces=clean_up_tokenization_spaces, unk_token=unk_token, bos_token=bos_token, eos_token=eos_token, add_bos_token=add_bos_token, add_eos_token=add_eos_token, use_default_system_prompt=use_default_system_prompt, **kwargs, ) self._add_bos_token = add_bos_token self._add_eos_token = add_eos_token self.update_post_processor() self.use_default_system_prompt = use_default_system_prompt self.vocab_file = vocab_file @property def can_save_slow_tokenizer(self) -> bool: return os.path.isfile(self.vocab_file) if self.vocab_file else False def update_post_processor(self): """ Updates the underlying post processor with the current `bos_token` and `eos_token`. """ bos = self.bos_token bos_token_id = self.bos_token_id if bos is None and self.add_bos_token: raise ValueError("add_bos_token = True but bos_token = None") eos = self.eos_token eos_token_id = self.eos_token_id if eos is None and self.add_eos_token: raise ValueError("add_eos_token = True but eos_token = None") single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}" pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}" special_tokens = [] if self.add_bos_token: special_tokens.append((bos, bos_token_id)) if self.add_eos_token: special_tokens.append((eos, eos_token_id)) self._tokenizer.post_processor = processors.TemplateProcessing( single=single, pair=pair, special_tokens=special_tokens ) @property def add_eos_token(self): return self._add_eos_token @property def add_bos_token(self): return self._add_bos_token @add_eos_token.setter def add_eos_token(self, value): self._add_eos_token = value self.update_post_processor() @add_bos_token.setter def add_bos_token(self, value): self._add_bos_token = value self.update_post_processor() def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: if not self.can_save_slow_tokenizer: raise ValueError( "Your fast tokenizer does not have the necessary information to save the vocabulary for a slow " "tokenizer." ) if not os.path.isdir(save_directory): logger.error(f"Vocabulary path ({save_directory}) should be a directory") return out_vocab_file = os.path.join( save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] ) if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file): copyfile(self.vocab_file, out_vocab_file) return (out_vocab_file,) @property # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.default_chat_template def default_chat_template(self): """ LLaMA uses [INST] and [/INST] to indicate user messages, and <<SYS>> and <</SYS>> to indicate system messages. Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering rather than needing special tokens. The system message is partly 'embedded' in the first user message, which results in an unusual token ordering when it is present. This template should definitely be changed if you wish to fine-tune a model with more flexible role ordering! The output should look something like: <bos>[INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer <eos><bos>[INST] Prompt [/INST] Answer <eos> <bos>[INST] Prompt [/INST] The reference for this chat template is [this code snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362) in the original repository. """ template = ( "{% if messages[0]['role'] == 'system' %}" "{% set loop_messages = messages[1:] %}" # Extract system message if it's present "{% set system_message = messages[0]['content'] %}" "{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}" "{% set loop_messages = messages %}" # Or use the default system message if the flag is set "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}" "{% else %}" "{% set loop_messages = messages %}" "{% set system_message = false %}" "{% endif %}" "{% for message in loop_messages %}" # Loop over all non-system messages "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}" "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}" "{% endif %}" "{% if loop.index0 == 0 and system_message != false %}" # Embed system message in first message "{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}" "{% else %}" "{% set content = message['content'] %}" "{% endif %}" "{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way "{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}" "{% elif message['role'] == 'system' %}" "{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}" "{% elif message['role'] == 'assistant' %}" "{{ ' ' + content.strip() + ' ' + eos_token }}" "{% endif %}" "{% endfor %}" ) template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false") default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'") template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message) return template # TODO ArthurZ let's rely on the template processor instead, refactor all fast tokenizers # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.build_inputs_with_special_tokens def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): bos_token_id = [self.bos_token_id] if self.add_bos_token else [] eos_token_id = [self.eos_token_id] if self.add_eos_token else [] output = bos_token_id + token_ids_0 + eos_token_id if token_ids_1 is not None: output = output + bos_token_id + token_ids_1 + eos_token_id return output
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.tokenize def tokenize(self, text: "TextInput", **kwargs) -> List[str]: """ Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the first token is special. """ if self.legacy or len(text) == 0: return super().tokenize(text, **kwargs) text = text.replace(SPIECE_UNDERLINE, " ") if self.add_prefix_space: text = SPIECE_UNDERLINE + text tokens = super().tokenize(text, **kwargs) if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens: tokens = tokens[1:] return tokens
"""Tokenization classes for LLaMA."""
class LlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase): from_pretrained_id = ["hf-internal-testing/llama-tokenizer", "meta-llama/Llama-2-7b-hf"] tokenizer_class = LlamaTokenizer rust_tokenizer_class = LlamaTokenizerFast test_rust_tokenizer = False test_sentencepiece = True from_pretrained_kwargs = {} def setUp(self): super().setUp() # We have a SentencePiece fixture for testing tokenizer = LlamaTokenizer(SAMPLE_VOCAB, keep_accents=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.save_pretrained(self.tmpdirname) def get_tokenizers(self, **kwargs): kwargs.update({"pad_token": "<PAD>"}) return super().get_tokenizers(**kwargs) def test_full_tokenizer(self): tokenizer = LlamaTokenizer(SAMPLE_VOCAB, keep_accents=True) tokens = tokenizer.tokenize("This is a test") self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"]) self.assertListEqual( tokenizer.convert_tokens_to_ids(tokens), [285, 46, 10, 170, 382], ) tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.") self.assertListEqual( tokens, [ SPIECE_UNDERLINE + "I", SPIECE_UNDERLINE + "was", SPIECE_UNDERLINE + "b", "or", "n", SPIECE_UNDERLINE + "in", SPIECE_UNDERLINE + "", "9", "2", "0", "0", "0", ",", SPIECE_UNDERLINE + "and", SPIECE_UNDERLINE + "this", SPIECE_UNDERLINE + "is", SPIECE_UNDERLINE + "f", "al", "s", "é", ".", ], ) ids = tokenizer.convert_tokens_to_ids(tokens) self.assertListEqual( ids, [8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 0, 4], ) back_tokens = tokenizer.convert_ids_to_tokens(ids) self.assertListEqual( back_tokens, [ SPIECE_UNDERLINE + "I", SPIECE_UNDERLINE + "was", SPIECE_UNDERLINE + "b", "or", "n", SPIECE_UNDERLINE + "in", SPIECE_UNDERLINE + "", "<unk>", "2", "0", "0", "0", ",", SPIECE_UNDERLINE + "and", SPIECE_UNDERLINE + "this", SPIECE_UNDERLINE + "is", SPIECE_UNDERLINE + "f", "al", "s", "<unk>", ".", ], ) @unittest.skip("Let's wait for the fast tokenizer!") def test_save_pretrained(self): self.tokenizers_list += (self.rust_tokenizer_class, "hf-internal-testing/llama-tokenizer", {}) for tokenizer, pretrained_name, kwargs in self.tokenizers_list: with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs) tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs) tmpdirname2 = tempfile.mkdtemp() tokenizer_r_files = tokenizer_r.save_pretrained(tmpdirname2) tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2) # Checks it save with the same files + the tokenizer.json file for the fast one self.assertTrue(any("tokenizer.json" in f for f in tokenizer_r_files)) tokenizer_r_files = tuple(f for f in tokenizer_r_files if "tokenizer.json" not in f) self.assertSequenceEqual(tokenizer_r_files, tokenizer_p_files) # Checks everything loads correctly in the same way tokenizer_rp = tokenizer_r.from_pretrained(tmpdirname2) tokenizer_pp = tokenizer_p.from_pretrained(tmpdirname2) # Check special tokens are set accordingly on Rust and Python for key in tokenizer_pp.special_tokens_map: self.assertTrue(hasattr(tokenizer_rp, key)) shutil.rmtree(tmpdirname2) # Save tokenizer rust, legacy_format=True tmpdirname2 = tempfile.mkdtemp() tokenizer_r_files = tokenizer_r.save_pretrained(tmpdirname2, legacy_format=True) tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2) # Checks it save with the same files self.assertSequenceEqual(tokenizer_r_files, tokenizer_p_files) # Checks everything loads correctly in the same way tokenizer_rp = tokenizer_r.from_pretrained(tmpdirname2) tokenizer_pp = tokenizer_p.from_pretrained(tmpdirname2) # Check special tokens are set accordingly on Rust and Python for key in tokenizer_pp.special_tokens_map: self.assertTrue(hasattr(tokenizer_rp, key)) shutil.rmtree(tmpdirname2) # Save tokenizer rust, legacy_format=False tmpdirname2 = tempfile.mkdtemp() tokenizer_r_files = tokenizer_r.save_pretrained(tmpdirname2, legacy_format=False) tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2) # Checks it saved the tokenizer.json file self.assertTrue(any("tokenizer.json" in f for f in tokenizer_r_files)) # Checks everything loads correctly in the same way tokenizer_rp = tokenizer_r.from_pretrained(tmpdirname2) tokenizer_pp = tokenizer_p.from_pretrained(tmpdirname2) # Check special tokens are set accordingly on Rust and Python for key in tokenizer_pp.special_tokens_map: self.assertTrue(hasattr(tokenizer_rp, key)) shutil.rmtree(tmpdirname2) @require_torch def test_batch_tokenization(self): if not self.test_seq2seq: return tokenizers = self.get_tokenizers() for tokenizer in tokenizers: with self.subTest(f"{tokenizer.__class__.__name__}"): # Longer text that will definitely require truncation. text = [ " UN Chief Says There Is No Military Solution in Syria", " Secretary-General Ban Ki-moon says his response to Russia's stepped up military support for" " Syria is that 'there is no military solution' to the nearly five-year conflict and more weapons" " will only worsen the violence and misery for millions of people.", ] try: batch = tokenizer( text=text, max_length=3, max_target_length=10, return_tensors="pt", ) except NotImplementedError: return self.assertEqual(batch.input_ids.shape[1], 3) # max_target_length will default to max_length if not specified batch = tokenizer(text, max_length=3, return_tensors="pt") self.assertEqual(batch.input_ids.shape[1], 3) batch_encoder_only = tokenizer(text=text, max_length=3, max_target_length=10, return_tensors="pt") self.assertEqual(batch_encoder_only.input_ids.shape[1], 3) self.assertEqual(batch_encoder_only.attention_mask.shape[1], 3) self.assertNotIn("decoder_input_ids", batch_encoder_only) @unittest.skip("Unfortunately way too slow to build a BPE with SentencePiece.") def test_save_slow_from_fast_and_reload_fast(self): pass def test_special_tokens_initialization(self): for tokenizer, pretrained_name, kwargs in self.tokenizers_list: with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): added_tokens = [AddedToken("<special>", lstrip=True)] tokenizer_r = self.rust_tokenizer_class.from_pretrained( pretrained_name, additional_special_tokens=added_tokens, **kwargs ) r_output = tokenizer_r.encode("Hey this is a <special> token") special_token_id = tokenizer_r.encode("<special>", add_special_tokens=False)[0] self.assertTrue(special_token_id in r_output) if self.test_slow_tokenizer: tokenizer_cr = self.rust_tokenizer_class.from_pretrained( pretrained_name, additional_special_tokens=added_tokens, **kwargs, # , from_slow=True <- unfortunately too slow to convert ) tokenizer_p = self.tokenizer_class.from_pretrained( pretrained_name, additional_special_tokens=added_tokens, **kwargs ) p_output = tokenizer_p.encode("Hey this is a <special> token") cr_output = tokenizer_cr.encode("Hey this is a <special> token") self.assertEqual(p_output, r_output) self.assertEqual(cr_output, r_output) self.assertTrue(special_token_id in p_output) self.assertTrue(special_token_id in cr_output) @slow def test_tokenizer_integration(self): expected_encoding = {'input_ids': [[1, 4103, 689, 414, 313, 24784, 368, 2998, 408, 282, 3637, 25350, 29899, 9067, 414, 322, 282, 3637, 25350, 29899, 1457, 3018, 1312, 29899, 2151, 29897, 8128, 2498, 29899, 15503, 4220, 6956, 1973, 313, 13635, 29911, 29892, 402, 7982, 29899, 29906, 29892, 1528, 13635, 29911, 29874, 29892, 1060, 26369, 29892, 6652, 309, 29933, 814, 29892, 1060, 29931, 6779, 11410, 363, 18385, 17088, 7634, 11235, 313, 25103, 29965, 29897, 322, 18385, 17088, 28203, 313, 25103, 29954, 29897, 411, 975, 29871, 29941, 29906, 29974, 758, 3018, 1312, 4733, 297, 29871, 29896, 29900, 29900, 29974, 10276, 322, 6483, 1006, 3372, 3097, 1546, 435, 1165, 29892, 10772, 29911, 25350, 322, 323, 6073, 17907, 29889], [1, 350, 20161, 338, 8688, 304, 758, 29899, 14968, 6483, 21000, 8684, 284, 22540, 515, 443, 29880, 24025, 1426, 491, 14002, 368, 4195, 292, 373, 1716, 2175, 322, 1492, 3030, 297, 599, 15359, 29889], [1, 450, 4996, 17354, 1701, 29916, 432, 17204, 975, 278, 17366, 11203, 29889]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]} # fmt: skip self.tokenizer_integration_test_util( expected_encoding=expected_encoding, model_name="hf-internal-testing/llama-tokenizer", revision="0984d03108b1a041ed679bd253b6519b7e1a4778", padding=False, ) def test_picklable(self): with tempfile.NamedTemporaryFile() as f: shutil.copyfile(SAMPLE_VOCAB, f.name) tokenizer = LlamaTokenizer(f.name, keep_accents=True) pickled_tokenizer = pickle.dumps(tokenizer) pickle.loads(pickled_tokenizer) @unittest.skip("worker 'gw4' crashed on CI, passing locally.") def test_pickle_subword_regularization_tokenizer(self): pass @unittest.skip("worker 'gw4' crashed on CI, passing locally.") def test_subword_regularization_tokenizer(self): pass def test_add_prefix_space(self): pretrained_name = "hf-internal-testing/llama-tokenizer-non-normalized" inputs = "Hey how are you doing" EXPECTED_WITH_SPACE = [1, 18637, 920, 526, 366, 2599] EXPECTED_WO_SPACE = [1, 29950, 1032, 920, 526, 366, 2599] slow_ = self.tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=False, legacy=False) fast_ = self.rust_tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=False, legacy=False) self.assertEqual(slow_.encode(inputs), EXPECTED_WO_SPACE) self.assertEqual(slow_.encode(inputs), fast_.encode(inputs)) self.assertEqual(slow_.tokenize(inputs), ["H", "ey", "▁how", "▁are", "▁you", "▁doing"]) self.assertEqual(slow_.decode(EXPECTED_WO_SPACE, skip_special_tokens=True), inputs) self.assertEqual( slow_.decode(EXPECTED_WO_SPACE, skip_special_tokens=True), fast_.decode(EXPECTED_WO_SPACE, skip_special_tokens=True), ) slow_ = self.tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=True, legacy=False) fast_ = self.rust_tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=True, legacy=False) self.assertEqual(slow_.encode(inputs), EXPECTED_WITH_SPACE) self.assertEqual(slow_.encode(inputs), fast_.encode(inputs)) self.assertEqual(slow_.tokenize(inputs), ["▁Hey", "▁how", "▁are", "▁you", "▁doing"]) self.assertEqual(slow_.decode(EXPECTED_WITH_SPACE, skip_special_tokens=True), inputs) self.assertEqual( slow_.decode(EXPECTED_WITH_SPACE, skip_special_tokens=True), fast_.decode(EXPECTED_WITH_SPACE, skip_special_tokens=True), )
# sdpa implementation which is the default torch>2.1.2 fails with the tracing + attention mask kwarg # with attn_implementation="eager" mode, the forward is very slow for some reason model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-chat-hf", low_cpu_mem_usage=True, attn_implementation="sdpa" ) model.eval() # Input configs # Create example inputs for the model tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
def tokenize_prompt(self, prompt): conv = next(self.prompter.build_prompt(prompt)) conversation_str = conv.get_prompt() # Tokenize conversations input_ids = self.tokenizer( conversation_str, return_tensors="pt", padding="max_length", max_length=self.sequence_len, truncation=True, ).input_ids[0] target = input_ids.clone() # Mask targets. Only compute loss on the assistant outputs. sep = conv.roles[1] total_len = int(target.ne(self.tokenizer.pad_token_id).sum()) turns = conversation_str.split(conv.sep2) cur_len = 1 target[:cur_len] = IGNORE_TOKEN_ID for turn in turns: if turn == "": break turn_len = len(self.tokenizer(turn).input_ids) parts = turn.split(sep) if len(parts) != 2: break parts[0] += sep # "-1" is hardcoded for the LLaMA tokenizer to make the offset correct. instruction_len = len(self.tokenizer(parts[0]).input_ids) - 1 # Ignore the user instructions target[cur_len - 1 : cur_len + instruction_len] = IGNORE_TOKEN_ID cur_len += turn_len + 2 # due to length of role token target[cur_len:] = IGNORE_TOKEN_ID if cur_len < self.sequence_len: if cur_len != total_len: target[:] = IGNORE_TOKEN_ID logging.warning( f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)" ) attention_mask = input_ids.ne(self.tokenizer.pad_token_id).tolist() input_ids = input_ids.tolist() target = target.tolist() # this is a fix for the tokenizer which tokenizes [ differently with eos tokens and # follows the original llama implementation for i in range(2, total_len - 2): if input_ids[i] == 29961: input_ids[i] = 518 if target[i] == 29961: target[i] = 518 return { "input_ids": input_ids, "labels": target, "attention_mask": attention_mask, }
def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.tokenizer.add_special_tokens( {"pad_token": getattr(self.tokenizer, "pad_token", "<pad>")} ) # https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/added_tokens.json
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false max_steps: 200 pretraining_dataset: path: c4 name: en type: pretrain dataset_prepared_path: val_set_size: 0.0 output_dir: ./outputs/model-out sequence_len: 2048 sample_packing: true wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 10 evals_per_epoch: eval_table_size: saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens:
class LLama2ChatTokenizingStrategy(PromptTokenizingStrategy): """ Tokenizing strategy for ShareGPT prompts. adapted from https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py """ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.tokenizer.add_special_tokens( {"pad_token": getattr(self.tokenizer, "pad_token", "<pad>")} ) # https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/added_tokens.json def tokenize_prompt(self, prompt): conv = next(self.prompter.build_prompt(prompt)) conversation_str = conv.get_prompt() # Tokenize conversations input_ids = self.tokenizer( conversation_str, return_tensors="pt", padding="max_length", max_length=self.sequence_len, truncation=True, ).input_ids[0] target = input_ids.clone() # Mask targets. Only compute loss on the assistant outputs. sep = conv.roles[1] total_len = int(target.ne(self.tokenizer.pad_token_id).sum()) turns = conversation_str.split(conv.sep2) cur_len = 1 target[:cur_len] = IGNORE_TOKEN_ID for turn in turns: if turn == "": break turn_len = len(self.tokenizer(turn).input_ids) parts = turn.split(sep) if len(parts) != 2: break parts[0] += sep # "-1" is hardcoded for the LLaMA tokenizer to make the offset correct. instruction_len = len(self.tokenizer(parts[0]).input_ids) - 1 # Ignore the user instructions target[cur_len - 1 : cur_len + instruction_len] = IGNORE_TOKEN_ID cur_len += turn_len + 2 # due to length of role token target[cur_len:] = IGNORE_TOKEN_ID if cur_len < self.sequence_len: if cur_len != total_len: target[:] = IGNORE_TOKEN_ID logging.warning( f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)" ) attention_mask = input_ids.ne(self.tokenizer.pad_token_id).tolist() input_ids = input_ids.tolist() target = target.tolist() # this is a fix for the tokenizer which tokenizes [ differently with eos tokens and # follows the original llama implementation for i in range(2, total_len - 2): if input_ids[i] == 29961: input_ids[i] = 518 if target[i] == 29961: target[i] = 518 return { "input_ids": input_ids, "labels": target, "attention_mask": attention_mask, }
base_model: NousResearch/Llama-2-7b-hf model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: mhenrichsen/alpaca_2k_test type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./outputs/out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true flash_attn_cross_entropy: false flash_attn_rms_norm: true flash_attn_fuse_qkv: false flash_attn_fuse_mlp: true warmup_steps: 100 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 1 debug: deepspeed: #deepspeed_configs/zero2.json # multi-gpu only weight_decay: 0.1 fsdp: fsdp_config: special_tokens:
def load(tokenizer, cfg) -> LLama2ChatTokenizingStrategy: return LLama2ChatTokenizingStrategy( Llama2ChatPrompter(), tokenizer, cfg.train_on_inputs, cfg.sequence_len, )
base_model: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca dataset_prepared_path: val_set_size: 0.02 adapter: lora_model_dir: sequence_len: 1024 sample_packing: true lora_r: lora_alpha: lora_dropout: lora_target_modules: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/openllama-out gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 4 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.000003 train_on_inputs: false group_by_length: false float16: true bf16: false fp16: false tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
base_model: meta-llama/Meta-Llama-3-8B-Instruct model_type: LlamaForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: true load_in_4bit: false strict: false chat_template: llama3 datasets: - path: fozziethebeat/alpaca_messages_2k_test type: chat_template chat_template: llama3 field_messages: messages message_field_role: role message_field_content: content roles: user: - user assistant: - assistant dataset_prepared_path: val_set_size: 0.05 output_dir: ./outputs/lora-out sequence_len: 4096 sample_packing: false pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true s2_attention: warmup_steps: 10 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config:
# bs = 3 tokenizer.pad_token = tokenizer.eos_token inputs = tokenizer(prompts, return_tensors="pt", padding=True) # Create a pipeline stage from the model # Using `auto` is equivalent to letting `device_map="auto"` figure # out device mapping and will also split the model according to the # number of total GPUs available if it fits on one GPU model = prepare_pippy(model, split_points="auto", example_kwargs=inputs)
base_model: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: true strict: false push_dataset_to_hub: datasets: - path: teknium/GPT4-LLM-Cleaned type: alpaca dataset_prepared_path: val_set_size: 0.05 adapter: qlora lora_model_dir: sequence_len: 1024 sample_packing: true lora_r: 8 lora_alpha: 32 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/qlora-out gradient_accumulation_steps: 1 micro_batch_size: 2 num_epochs: 4 optimizer: paged_adamw_32bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>"
def load_tokenizer(cfg): model_config = load_model_config(cfg) tokenizer_kwargs = {} use_fast = True # this is the default if cfg.tokenizer_use_fast is not None: use_fast = cfg.tokenizer_use_fast if cfg.tokenizer_legacy is not None: # True is the default w/ https://github.com/huggingface/transformers/pull/25224 tokenizer_kwargs["legacy"] = cfg.tokenizer_legacy tokenizer_cls = AutoTokenizer if cfg.tokenizer_type: tokenizer_cls = getattr(transformers, cfg.tokenizer_type) tokenizer = tokenizer_cls.from_pretrained( cfg.tokenizer_config, trust_remote_code=cfg.trust_remote_code or False, use_fast=use_fast, **tokenizer_kwargs, ) if ( tokenizer.__class__.__name__ in [ "LlamaTokenizer", "LlamaTokenizerFast", "CodeLlamaTokenizer", "CodeLlamaTokenizerFast", ] and hasattr(tokenizer, "pad_token") and not tokenizer.pad_token ): # set a pad_token, but use eos_token so we don't add a new token tokenizer.pad_token = LLAMA_DEFAULT_EOS_TOKEN if tokenizer.__class__.__name__ == "GPTNeoXTokenizerFast": tokenizer.add_special_tokens({"pad_token": "[PAD]"}) os.environ["TOKENIZERS_PARALLELISM"] = "false" # Mistral's official FA implementation requires left padding if cfg.is_mistral_derived_model and cfg.flash_attention and not cfg.sample_packing: tokenizer.padding_side = "left" # Qwen base only has single token, so we need to set the special tokens if cfg.is_qwen_derived_model: token_ids = ["bos_token_id", "eos_token_id", "pad_token_id", "unk_token_id"] for attr_name in token_ids: if getattr(tokenizer, attr_name) is None: setattr(tokenizer, attr_name, tokenizer.eod_id) token_names = ["bos_token", "eos_token", "pad_token", "unk_token"] for attr_name in token_names: if getattr(tokenizer, attr_name) is None: setattr(tokenizer, attr_name, "<|endoftext|>") additional_special_tokens = None if cfg.special_tokens: special_tokens = cfg.special_tokens.to_dict() additional_special_tokens = special_tokens.pop( "additional_special_tokens", None ) lora_modules_to_save = get_linear_embedding_layers(model_config.model_type) for k, val in special_tokens.items(): # check if new special token is not already in tokenizer and # is adapter training to make sure lora_modules_to_save is set # pylint: disable=too-many-boolean-expressions if ( (getattr(tokenizer, k) is None or getattr(tokenizer, k) != val) and (len(tokenizer.encode(val, add_special_tokens=False)) > 2) and cfg.adapter and ( not cfg.lora_modules_to_save or not all( x in cfg.lora_modules_to_save for x in lora_modules_to_save ) ) ): lora_modules_to_save = ", ".join( [f"`{x}`" for x in lora_modules_to_save] ) raise ValueError( f"Please set lora_modules_to_save to [{lora_modules_to_save}] when using an adapter and changing the special tokens." ) tokenizer.add_special_tokens( {k: AddedToken(val, rstrip=False, lstrip=False, normalized=False)} ) # If we add bos_token and eos_token, we need to update the post processor to # handle them correctly. # https://github.com/huggingface/transformers/pull/24132 bos_or_eos_in_special_tokens = ( "bos_token" in cfg.special_tokens and "eos_token" in cfg.special_tokens ) if ( tokenizer.__class__.__name__ in ( "LlamaTokenizerFast", "CodeLlamaTokenizerFast", ) and bos_or_eos_in_special_tokens ): tokenizer.update_post_processor() if cfg.tokens: tokenizer.add_tokens( [ AddedToken(token, rstrip=False, lstrip=False, normalized=False) for token in cfg.tokens ] ) # Additional special tokens are a List, and need to be treated differently than regular special # tokens. We add them after we have called `add_tokens` in case these additional special tokens # are new tokens. # # Usage: # # ```py # special_tokens: # additional_special_tokens: ["<|im_start|>", "<|im_end|>"] # ``` if additional_special_tokens is not None: tokenizer.add_special_tokens( {"additional_special_tokens": additional_special_tokens} ) with zero_only(): LOG.debug(f"EOS: {tokenizer.eos_token_id} / {tokenizer.eos_token}") LOG.debug(f"BOS: {tokenizer.bos_token_id} / {tokenizer.bos_token}") LOG.debug(f"PAD: {tokenizer.pad_token_id} / {tokenizer.pad_token}") LOG.debug(f"UNK: {tokenizer.unk_token_id} / {tokenizer.unk_token}") if cfg.chat_template: chat_template_string = chat_templates(cfg.chat_template) if cfg.default_system_message and cfg.chat_template == "chatml": chat_template_string = chat_template_string.replace( "You are a helpful assistant.", cfg.default_system_message ) tokenizer.chat_template = chat_template_string else: LOG.info( "No Chat template selected. Consider adding a chat template for easier inference." ) return tokenizer
Llama-Adapter is a method for adapting Llama into a instruction-following model. To help adapt the model for instruction-following, the adapter is trained with a 52K instruction-output dataset.
A set of of learnable adaption prompts are prefixed to the input instruction tokens. These are inserted into the upper layers of the model because it is better to learn with the higher-level semantics of the pretrained model. The instruction-output tokens prefixed to the input guide the adaption prompt to generate a contextual response.
<div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/llama-adapter.png"/> </div> <small><a href="https://hf.co/papers/2303.16199">LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention</a></small>To avoid adding noise to the tokens, the adapter uses zero-initialized attention. On top of this, the adapter adds a learnable gating factor (initialized with zeros) to progressively add information to the model during training. This prevents overwhelming the model's pretrained knowledge with the newly learned instructions.
# the entire model is fine-tuned.
tokens += [sep_token]
token_boxes += [sep_token_box]
actual_bboxes += [[0, 0, width, height]]
label_ids += [pad_token_label_id]
if sep_token_extra:
# roberta uses an extra separator b/w pairs of sentences
tokens += [sep_token]
token_boxes += [sep_token_box]
actual_bboxes += [[0, 0, width, height]]
label_ids += [pad_token_label_id]
segment_ids = [sequence_a_segment_id] * len(tokens)
if cls_token_at_end:
tokens += [cls_token]
token_boxes += [cls_token_box]
actual_bboxes += [[0, 0, width, height]]
label_ids += [pad_token_label_id]
segment_ids += [cls_token_segment_id]
else:
tokens = [cls_token] + tokens
token_boxes = [cls_token_box] + token_boxes
actual_bboxes = [[0, 0, width, height]] + actual_bboxes
label_ids = [pad_token_label_id] + label_ids
segment_ids = [cls_token_segment_id] + segment_ids
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
# Zero-pad up to the sequence length.
padding_length = max_seq_length - len(input_ids)
if pad_on_left:
input_ids = ([pad_token] * padding_length) + input_ids
input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
label_ids = ([pad_token_label_id] * padding_length) + label_ids
token_boxes = ([pad_token_box] * padding_length) + token_boxes
else:
input_ids += [pad_token] * padding_length
input_mask += [0 if mask_padding_with_zero else 1] * padding_length
segment_ids += [pad_token_segment_id] * padding_length
label_ids += [pad_token_label_id] * padding_length
token_boxes += [pad_token_box] * padding_length
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
assert len(label_ids) == max_seq_length
assert len(token_boxes) == max_seq_length
if ex_index < 5:
logger.info("*** Example ***")
logger.info("guid: %s", example.guid)
logger.info("tokens: %s", " ".join([str(x) for x in tokens]))
logger.info("input_ids: %s", " ".join([str(x) for x in input_ids]))
logger.info("input_mask: %s", " ".join([str(x) for x in input_mask]))
logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids]))
logger.info("label_ids: %s", " ".join([str(x) for x in label_ids]))
logger.info("boxes: %s", " ".join([str(x) for x in token_boxes]))
logger.info("actual_bboxes: %s", " ".join([str(x) for x in actual_bboxes]))
features.append(
InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_ids=label_ids,
boxes=token_boxes,
actual_bboxes=actual_bboxes,
file_name=file_name,
page_size=page_size,
)
)
return features
from transformers import LayoutLMTokenizer
# from .unilm.layoutlm.data.funsd import FunsdDataset, InputFeatures
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
batch_size = 16
args = {
"local_rank": -1,
"overwrite_cache": True,
"data_dir": "/home/sourab/temp/data/",
"model_name_or_path": "microsoft/layoutlm-base-uncased",
"max_seq_length": 512,
"model_type": "layoutlm",
}
# class to turn the keys of a dict into attributes (thanks Stackoverflow)
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
args = AttrDict(args)
tokenizer = LayoutLMTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")
# the LayoutLM authors already defined a specific FunsdDataset, so we are going to use this here
train_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode="train")
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size)
eval_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode="test")
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=batch_size)
len(train_dataloader)
len(eval_dataloader)
batch = next(iter(train_dataloader))
input_ids = batch[0][0]
tokenizer.decode(input_ids)
As this is a sequence labeling task, we are going to load LayoutLMForTokenClassification
(the base sized model) from the hub. We are going to fine-tune it on a downstream task, namely FUNSD.
from peft import get_peft_config, PeftModel, get_peft_model, LoraConfig, TaskType
peft_config = LoraConfig(
task_type=TaskType.TOKEN_CLS, inference_mode=False, r=16, lora_alpha=16, lora_dropout=0.1, bias="all"
)
peft_config
from transformers import LayoutLMForTokenClassification
import torch
from transformers import set_seed
seed = 100
set_seed(seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=num_labels)
model = get_peft_model(model, peft_config)
model.to(device)
print(model.model.layoutlm.encoder.layer[0].attention.self.query.weight)
print(model.model.layoutlm.encoder.layer[0].attention.self.query.lora_A.weight)
print(model.model.classifier.weight)
Now we can start training:
from transformers import AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm
num_train_epochs = 100
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3)
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=0.06 * (len(train_dataloader) * num_train_epochs),
num_training_steps=(len(train_dataloader) * num_train_epochs),
)
global_step = 0
t_total = len(train_dataloader) * num_train_epochs # total number of training steps
# put the model in training mode
model.train()
for epoch in range(num_train_epochs):
for batch in tqdm(train_dataloader, desc="Training"):
input_ids = batch[0].to(device)
bbox = batch[4].to(device)
attention_mask = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# forward pass
outputs = model(
input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels
)
loss = outputs.loss
# print loss every 100 steps
if global_step % 10 == 0:
print(f"Loss after {global_step} steps: {loss.item()}")
# backward pass to get the gradients
loss.backward()
# print("Gradients on classification head:")
# print(model.classifier.weight.grad[6,:].sum())
# update
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
global_step += 1
import numpy as np
from seqeval.metrics import (
classification_report,
f1_score,
precision_score,
recall_score,
)
eval_loss = 0.0
nb_eval_steps = 0
preds = None
out_label_ids = None
# put model in evaluation mode
model.eval()
for batch in tqdm(eval_dataloader, desc="Evaluating"):
with torch.no_grad():
input_ids = batch[0].to(device)
bbox = batch[4].to(device)
attention_mask = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# forward pass
outputs = model(
input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels
)
# get the loss and logits
tmp_eval_loss = outputs.loss
logits = outputs.logits
eval_loss += tmp_eval_loss.item()
nb_eval_steps += 1
# compute the predictions
if preds is None:
preds = logits.detach().cpu().numpy()
out_label_ids = labels.detach().cpu().numpy()
else:
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
# compute average evaluation loss
eval_loss = eval_loss / nb_eval_steps
preds = np.argmax(preds, axis=2)
out_label_list = [[] for _ in range(out_label_ids.shape[0])]
preds_list = [[] for _ in range(out_label_ids.shape[0])]
for i in range(out_label_ids.shape[0]):
for j in range(out_label_ids.shape[1]):
if out_label_ids[i, j] != pad_token_label_id:
out_label_list[i].append(label_map[out_label_ids[i][j]])
preds_list[i].append(label_map[preds[i][j]])
results = {
"loss": eval_loss,
"precision": precision_score(out_label_list, preds_list),
"recall": recall_score(out_label_list, preds_list),
"f1": f1_score(out_label_list, preds_list),
}
print(results)
model.print_trainable_parameters()
model.save_pretrained("peft_layoutlm")
!du -h "peft_layoutlm/adapter_model.bin"
def preprocess_function_test(examples): sources = examples[source_column] labels = examples[target_column] inputs = [source + task_prompt for source in sources] model_inputs = tokenizer(inputs, max_length=args.max_source_length, padding=padding, truncation=True) labels = tokenizer(labels, max_length=args.max_target_length, padding=padding, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs
Llama-Adapter is a PEFT method specifically designed for turning Llama into an instruction-following model. The Llama model is frozen and only a set of adaptation prompts prefixed to the input instruction tokens are learned. Since randomly initialized modules inserted into the model can cause the model to lose some of its existing knowledge, Llama-Adapter uses zero-initialized attention with zero gating to progressively add the instructional prompts to the model.
The abstract from the paper is:
We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the input text tokens at higher transformer layers. Then, a zero-init attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With efficient training, LLaMA-Adapter generates high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Furthermore, our approach can be simply extended to multi-modal input, e.g., images, for image-conditioned LLaMA, which achieves superior reasoning capacity on ScienceQA. We release our code at https://github.com/ZrrSkywalker/LLaMA-Adapter.
[[autodoc]] tuners.adaption_prompt.config.AdaptionPromptConfig
[[autodoc]] tuners.adaption_prompt.model.AdaptionPromptModel
tokenizer = AutoTokenizer.from_pretrained(script_args.base_model_name_or_path) tokenizer.pad_token_id = tokenizer.eos_token_id lora_config = LoraConfig( r=script_args.lora_r, lora_alpha=script_args.lora_alpha, init_lora_weights=script_args.init_lora_weights, lora_dropout=script_args.lora_dropout, target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"], bias="none", task_type="CAUSAL_LM", ) peft_model = get_peft_model(model, lora_config)
Run
accelerate launch -m axolotl.cli.train your_config.yml
[!TIP] You can also reference a config file that is hosted on a public URL, for example
accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml
You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.
dataset_prepared_path:
to a local folder for saving and loading pre-tokenized dataset.push_dataset_to_hub: hf_user/repo
to push it to Huggingface.--debug
to see preprocessed examples.python -m axolotl.cli.preprocess your_config.yml
Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.
Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated
We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.
deepspeed: deepspeed_configs/zero1.json
accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
Axolotl supports training with FSDP and QLoRA, see these docs for more information.
Make sure your WANDB_API_KEY
environment variable is set (recommended) or you login to wandb with wandb login
.
wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"
When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs
def tokenize_function(examples): return tokenizer(examples[text_column_name])
See also the FAQ's and debugging guide.
If you encounter a 'Cuda out of memory' error, it means your GPU ran out of memory during the training process. Here's how to resolve it:
Please reduce any below
micro_batch_size
eval_batch_size
gradient_accumulation_steps
sequence_len
If it does not help, try running without deepspeed and without accelerate (replace "accelerate launch" with "python") in the command.
Using adamw_bnb_8bit might also save you some memory.
failed (exitcode: -9)
Usually means your system has run out of system memory. Similarly, you should consider reducing the same settings as when you run out of VRAM. Additionally, look into upgrading your system RAM which should be simpler than GPU upgrades.
RuntimeError: expected scalar type Float but found Half
Try set fp16: true
NotImplementedError: No operator found for
memory_efficient_attention_forward
...
Try to turn off xformers.
accelerate config missing
It's safe to ignore it.
NCCL Timeouts during training
See the NCCL guide.
For many formats, Axolotl constructs prompts by concatenating token ids after tokenizing strings. The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.
If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens. When you are starting out with a new format, you should always do the following:
python -m axolotl.cli.preprocess your_config.yml --debug
, and then decode the first few rows with your model's tokenizer.Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See this blog post for a concrete example.
docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest
Or run on the current files for development:
docker compose up -d
<details> <summary>Docker advanced</summary>[!Tip] If you want to debug axolotl or prefer to use Docker as your development environment, see the debugging guide's section on Docker.
A more powerful Docker command to run would be this:
docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest
It additionally:
--ipc
and --ulimit
args.--mount
/-v
args.--name
argument simply makes it easier to refer to the container in vscode (Dev Containers: Attach to Running Container...
) or in your terminal.--privileged
flag gives all capabilities to the container.--shm-size 10g
argument increases the shared memory size. Use this if you see exitcode: -7
errors using deepspeed.More information on nvidia website
</details>Install python >=3.10
Install pytorch stable https://pytorch.org/get-started/locally/
Install Axolotl along with python dependencies
pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]'
(Optional) Login to Huggingface to use gated models/datasets.
huggingface-cli login
Get the token at huggingface.co/settings/tokens
For cloud GPU providers that support docker images, use winglian/axolotl-cloud:main-latest
sudo apt update sudo apt install -y python3.10 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 sudo update-alternatives --config python # pick 3.10 if given option python -V # should be 3.10
wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
Install Pytorch https://pytorch.org/get-started/locally/
Follow instructions on quickstart.
Run
pip3 install protobuf==3.20.3 pip3 install -U --ignore-installed requests Pillow psutil scipy
</details>export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
Use a Deeplearning linux OS with cuda and pytorch installed. Then follow instructions on quickstart.
Make sure to run the below to uninstall xla.
</details>pip uninstall -y torch_xla[tpu]
Please use WSL or Docker!
Use the below instead of the install method in QuickStart.
pip3 install -e '.'
More info: mac.md
Please use this example notebook.
To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use SkyPilot:
pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds sky check
Get the example YAMLs of using Axolotl to finetune mistralai/Mistral-7B-v0.1
:
git clone https://github.com/skypilot-org/skypilot.git
cd skypilot/llm/axolotl
Use one command to launch:
# On-demand HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN # Managed spot (auto-recovery on preemption) HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
To launch on GPU instance (both on-demand and spot instances) on public clouds (GCP, AWS, Azure, Lambda Labs, TensorDock, Vast.ai, and CUDO), you can use dstack.
Write a job description in YAML as below:
# dstack.yaml type: task image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.2 env: - HUGGING_FACE_HUB_TOKEN - WANDB_API_KEY commands: - accelerate launch -m axolotl.cli.train config.yaml ports: - 6006 resources: gpu: memory: 24GB.. count: 2
then, simply run the job with dstack run
command. Append --spot
option if you want spot instance. dstack run
command will show you the instance with cheapest price across multi cloud services:
pip install dstack HUGGING_FACE_HUB_TOKEN=xxx WANDB_API_KEY=xxx dstack run . -f dstack.yaml # --spot
For further and fine-grained use cases, please refer to the official dstack documents and the detailed description of axolotl example on the official repository.
Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.
See these docs for more information on how to use different dataset formats.
See examples for quick start. It is recommended to duplicate and modify to your needs. The most important options are:
model
base_model: ./llama-7b-hf # local or huggingface repo
Note: The code will load the right architecture.
dataset
datasets: # huggingface repo - path: vicgalle/alpaca-gpt4 type: alpaca # huggingface repo with specific configuration/subset - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data # huggingface repo with multiple named configurations/subsets - path: bigcode/commitpackft name: - ruby - python - typescript type: ... # unimplemented custom format # fastchat conversation # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py - path: ... type: sharegpt conversation: chatml # default: vicuna_v1.1 # local - path: data.jsonl # or json ds_type: json # see other options below type: alpaca # dataset with splits, but no train split - path: knowrohit07/know_sql type: context_qa.load_v2 train_on_split: validation # loading from s3 or gcs # s3 creds will be loaded from the system default and gcs only supports public access - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. ... # Loading Data From a Public URL # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. ds_type: json # this is the default, see other options below.
loading
load_in_4bit: true load_in_8bit: true bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 tf32: true # require >=ampere bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) float16: true # use instead of fp16 when you don't want AMP
Note: Repo does not do 4-bit quantization.
lora
adapter: lora # 'qlora' or leave blank for full finetune lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj
See these docs for all config options.
Run
accelerate launch -m axolotl.cli.train your_config.yml
[!TIP] You can also reference a config file that is hosted on a public URL, for example
accelerate launch -m axolotl.cli.train https://yourdomain.com/your_config.yml
You can optionally pre-tokenize dataset with the following before finetuning. This is recommended for large datasets.
dataset_prepared_path:
to a local folder for saving and loading pre-tokenized dataset.push_dataset_to_hub: hf_user/repo
to push it to Huggingface.--debug
to see preprocessed examples.python -m axolotl.cli.preprocess your_config.yml
Below are the options available in axolotl for training with multiple GPUs. Note that DeepSpeed is the recommended multi-GPU option currently because FSDP may experience loss instability.
Deepspeed is an optimization suite for multi-gpu systems allowing you to train much larger models than you might typically be able to fit into your GPU's VRAM. More information about the various optimization types for deepspeed is available at https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#what-is-integrated
We provide several default deepspeed JSON configurations for ZeRO stage 1, 2, and 3.
deepspeed: deepspeed_configs/zero1.json
accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
Axolotl supports training with FSDP and QLoRA, see these docs for more information.
Make sure your WANDB_API_KEY
environment variable is set (recommended) or you login to wandb with wandb login
.
wandb_mode: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model:
It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocabulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"
When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
Axolotl allows you to load your model in an interactive terminal playground for quick experimentation. The config file is the same config file used for training.
Pass the appropriate flag to the inference command, depending upon what kind of model was trained:
python -m axolotl.cli.inference examples/your_config.yml --lora_model_dir="./lora-output-dir"
python -m axolotl.cli.inference examples/your_config.yml --base_model="./completed-model"
cat /tmp/prompt.txt | python -m axolotl.cli.inference examples/your_config.yml \ --base_model="./completed-model" --prompter=None --load_in_8bit=True
-- With gradio hosting
python -m axolotl.cli.inference examples/your_config.yml --gradio
Please use --sample_packing False
if you have it on and receive the error similar to below:
RuntimeError: stack expects each tensor to be equal size, but got [1, 32, 1, 128] at entry 0 and [1, 32, 8, 128] at entry 1
The following command will merge your LORA adapater with your base model. You can optionally pass the argument --lora_model_dir
to specify the directory where your LORA adapter was saved, otherwhise, this will be inferred from output_dir
in your axolotl config file. The merged model is saved in the sub-directory {lora_model_dir}/merged
.
python3 -m axolotl.cli.merge_lora your_config.yml --lora_model_dir="./completed-model"
You may need to use the gpu_memory_limit
and/or lora_on_cpu
config options to avoid running out of memory. If you still run out of CUDA memory, you can try to merge in system RAM with
CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.merge_lora ...
although this will be very slow, and using the config options above are recommended instead.
optim_target_modules:
# - self_attn # for llama
# - mlp
# Specify weight decay
weight_decay:
# adamw hyperparams
adam_beta1:
adam_beta2:
adam_epsilon:
# Gradient clipping max norm
max_grad_norm:
# Augmentation techniques
# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
# currently only supported on Llama and Mistral
neftune_noise_alpha:
# Whether to bettertransformers
flash_optimum:
# Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
xformers_attention:
# Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
flash_attention:
flash_attn_cross_entropy: # Whether to use flash-attention cross entropy implementation - advanced use only
flash_attn_rms_norm: # Whether to use flash-attention rms norm implementation - advanced use only
flash_attn_fuse_qkv: # Whether to fuse QKV into a single operation
flash_attn_fuse_mlp: # Whether to fuse part of the MLP into a single operation
# Whether to use scaled-dot-product attention
# https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
sdp_attention:
# Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
s2_attention:
# Resume from a specific checkpoint dir
resume_from_checkpoint:
# If resume_from_checkpoint isn't set and you simply want it to start where it left off.
# Be careful with this being turned on between different models.
auto_resume_from_checkpoints: false
# Don't mess with this, it's here for accelerate and torchrun
local_rank:
# Add or change special tokens.
# If you add tokens here, you don't need to add them to the `tokens` list.
special_tokens:
# bos_token: "<s>"
# eos_token: "</s>"
# unk_token: "<unk>"
# pad_token: "[PAD]"
# Add extra tokens.
tokens:
# FSDP
fsdp:
fsdp_config:
# Deepspeed config path. e.g., deepspeed_configs/zero3.json
deepspeed:
# Advanced DDP Arguments
ddp_timeout:
ddp_bucket_cap_mb:
ddp_broadcast_buffers:
# Path to torch distx for optim 'adamw_anyprecision'
torchdistx_path:
# Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
pretraining_dataset:
# Debug mode
debug:
# Seed
seed:
# Allow overwrite yml config using from cli
strict: