top of page
  • Writer's pictureSofia Ng

Hugging 2 llamas: how to use huggingface and llama2

Updated: Oct 19, 2023

In this article, we discover a way to improve the performance of a language model called LLaMA 2 using a method called QLoRA. QLoRA is an efficient technique that modifies the model by reducing its complexity, making it possible to run large models with up to 65 billion parameters on a single GPU. Despite this simplification, the model still achieves excellent results on language tasks compared to other methods.

This method makes the model simpler and easier to run on just one GPU, even though it has a large number of parameters. It's like optimizing how the model works to handle complex language tasks efficiently without needing super expensive hardware. And the cool thing is, even with these changes, the model still does an amazing job, getting top-notch results on language challenges compared to other methods. So, if you're keen on using powerful language models without breaking the bank, exploring QLoRA could be a fantastic option for you!

<a href="">The South American pack-animal of the family. Camels with valuable wool -</a>
Alpaca my bags, we're going on an adventure!

1. Setup Development Environment:

Install the required libraries and set up your HuggingFace account and Sagemaker with the appropriate permissions.

pip install "transformers==4.31.0" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet

# Login to your HuggingFace account
huggingface-cli login --token YOUR_TOKEN

2. Load and prepare the dataset:

Load the Dolly dataset and format the samples with a given structure for training.


The Dolly dataset is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several behavioral categories, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarisation. To start, we'll load this dataset using the HuggingFace datasets library.

from datasets import load_dataset
from transformers import AutoTokenizer
from random import randrange
import sagemaker
sess = sagemaker.Session()

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# Define a formatting function for samples
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # Join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

# Test the formatting function on a random example

# Use AutoTokenizer to tokenize the formatted samples
model_id = "meta-llama/Llama-2-70b-chat-hf"  # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token

# Define a helper function to pack multiple samples into sequences of a given length and then tokenize them.
from random import randint
from itertools import chain
from functools import partial

# Template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample

# Apply prompt template per sample
dataset =, remove_columns=list(dataset.features))

# Define a helper function to chunk samples into sequences of a given length
def chunk(sample, chunk_length=2048):
    text_list = sample["text"] if isinstance(sample["text"], list) else [sample["text"]]
    result = []
    for text in text_list:
        for i in range(0, len(text), chunk_length):
            result.append({"text": text[i: i + chunk_length]})
    return result

# Transform and chunk the dataset
transformed_examples = []
for example in dataset:

# Create a new dataset from the transformed examples
transformed_dataset = dataset.from_dict({"text": [ex["text"] for ex in transformed_examples]})

# Tokenize the transformed dataset
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, max_length=512, padding="max_length")

lm_dataset =
    load_from_cache_file=False,  # To avoid issues with cached data

# Save the processed dataset to S3
training_input_path = f's3://{sess.default_bucket()}/processed/llama/dolly/train'

print("Uploaded data to:")
print(f"Training dataset: {training_input_path}")

3. Fine-Tune LLaMA 13B with QLoRA on Amazon SageMaker:

Use QLoRA and PEFT to fine-tune LLaMA 13B on SageMaker, efficiently reducing memory footprint without sacrificing performance. Select the appropriate instance type based on your model size and context length.

Note: By selecting the right instance type on Amazon SageMaker, we can efficiently optimize the fine-tuning process for better performance and cost-effectiveness.

import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# Define Training Job Name
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
sagemaker_session = sagemaker.Session()
dummy_role = "arn:aws:iam::111111111111:role/dummy-role"
role = dummy_role #sagemaker.get_execution_role()

# Define hyperparameters
hyperparameters = {
    'model_id': model_id,  # Pre-trained model
    'dataset_path': 'file:///Users/User/OneDrive - Ava Technology Solutions/Customers/Blog/S3Emulation',  # Local path to the processed training dataset
    'epochs': 3,  # Number of training epochs
    'per_device_train_batch_size': 2,  # Batch size for training
    'lr': 2e-4,  # Learning rate used during training
    'hf_token': HfFolder.get_token(),  # HuggingFace token to access LLaMA 2
    'merge_weights': True,  # Whether to merge LoRA into the model (needs more memory)

# Create the Estimator
huggingface_estimator = HuggingFace(
    entry_point='',  # Train script (make sure it's in the root directory)
    source_dir='.',  # Set this to the root directory where is located
    instance_type='local_gpu',  # Run training locally
    instance_count=1,  # Number of local instances (1 in this case)
    base_job_name=job_name,  # The name of the training job
    role=dummy_role,  # A dummy IAM role for local training (not used)
    volume_size=30,  #300 # The size of the EBS volume in GB
    transformers_version='4.28',  # The transformers version used in the training job
    pytorch_version='2.0',  # The PyTorch version used in the training job
    py_version='py310',  # The Python version used in the training job
    hyperparameters=hyperparameters,  # The hyperparameters passed to the training job
    environment={"HUGGINGFACE_HUB_CACHE": "/tmp/.cache"},  # Set env variable to cache models in /tmp

# Define a data input dictionary with our local dataset path
data = {'training': 'file:///Users/User/OneDrive - Ava Technology Solutions/Customers/Blog/S3Emulation'}#'s3://{sess.default_bucket()}/processed/llama/dolly/train'}

# Start the train job with our local dataset as input, wait=True)

After fine-tuning, you can deploy your fine-tuned LLaMA model to a SageMaker endpoint for inference and use it for various language tasks.

Additional options:

- Explore other PEFT techniques like Prefix Tuning, P-Tuning, Prompt Tuning, and IA3 for efficient adaptation of pre-trained language models.

- Access different versions of LLaMA (LLaMa 7B, LLaMa 13B, LLaMa 70B) and choose the one suitable for your specific use case.

Remember to accept the LLaMA 2 license and modify the S3 bucket path if needed during the process. By following these steps, you can leverage the power of LLaMA 2 and QLoRA for your language-related tasks effectively and efficiently.

Helpful definitions

QLoRA (Quantization-aware Low-Rank Adapter Tuning for Language Generation):

QLoRA is a novel technique introduced in the paper "QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation" by Tim Dettmers et al. The primary purpose of QLoRA is to reduce the memory footprint of large language models during fine-tuning without compromising their performance.

The key steps of QLoRA are as follows:

1. Quantization: The pretrained language model is quantized, which means it is transformed from full-precision (e.g., 32 bits) to lower precision (e.g., 4 bits) representation. Quantization helps to reduce memory requirements and allows for more efficient inference.

2. Freezing: The quantized model is then "frozen," meaning its parameters are fixed and not updated during the fine-tuning process. This step preserves the original model's capabilities while making it more memory-efficient.

3. Low-Rank Adapters (LoRA): Small, trainable adapter layers are attached to the frozen quantized model. These adapters are added to capture the task-specific information during fine-tuning.

4. Fine-Tuning: During the fine-tuning process, only the adapter layers are updated, while the quantized model remains fixed, reducing the number of parameters that need to be fine-tuned. This allows for more efficient fine-tuning of large models, even on resource-constrained devices.

The combination of quantization and adapter layers in QLoRA enables efficient fine-tuning of large language models with up to 65 billion parameters on a single GPU while maintaining state-of-the-art performance on various language tasks.

PEFT (Parameter Efficient Fine-Tuning):

PEFT stands for "Parameter Efficient Fine-Tuning," and it is an open-source library developed by HuggingFace. PEFT aims to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all of the model's parameters. It includes various techniques designed to achieve parameter efficiency during fine-tuning.

PEFT offers different techniques, and some of them are:

- (Q)LoRA: As mentioned earlier, QLoRA is one of the techniques included in PEFT, which focuses on quantization-aware low-rank adaptation of large language models.

- Prefix Tuning: This technique, known as P-Tuning v2, allows prompt tuning to be comparable to fine-tuning universally across scales and tasks. It involves training a small, task-specific prefix for each prompt to achieve efficient adaptation.

- P-Tuning: P-Tuning, or GPT Understands, Too, is another prompt tuning method, aiming to make fine-tuning more efficient while maintaining high performance.

- Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning is a technique that leverages prompt engineering for fine-tuning, enabling parameter efficiency.

- IA3: Infused Adapter by Inhibiting and Amplifying Inner Activations introduces a method using adapters to enhance the efficiency and adaptability of PLMs to specific tasks.

Overall, PEFT provides a suite of techniques that researchers and developers can use to fine-tune large language models efficiently for various language tasks, making it easier to deploy these models in resource-constrained environments.


bottom of page