Fine-Tuning Large Language Models with PEFT (LoRA) and Rouge Score: A Comprehensive Hands-On Guide

15 min readAug 9, 2024

Master the Art of Efficient Model Adaptation: Explore PEFT and LoRA Techniques for Fine-Tuning LLMs and Dive into Route Score Optimization

peaceful summer, puri india — peaceful summer

Hi there, today we will look into training our own large-language model and fine-tune it with Parameter Efficient Fine-Tunning like LoRA (Low-Rank Adaptation of Large Language Models) as will evaluate it with Rouge Score.

So let's get started quickly. FYI “ i used Kaggle notebook with accelerator GPU P100”

#install and load the importance libraries 
!pip install transformers==4.27.2 
!pip install datasets==2.11.0 
!pip install evaluate==0.4.0 
!pip install rouge_score==0.1.2 
!pip install loralib==0.1.1 
!pip install peft==0.3.0 

from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer,GenerationConfig, TrainingArguments,Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

Load the train, test and validate dataset. We can also get the dataset from huggingface_dataset_name = “knkarthick/dialogsum”


train_path = "/kaggle/input/training-dataset-peft-lora/dataset_peft_lora/knkarthick_dialogsum/train.csv" 
test_path = "/kaggle/input/training-dataset-peft-lora/dataset_peft_lora/knkarthick_dialogsum/test.csv"
validation_path = "/kaggle/input/training-dataset-peft-lora/dataset_peft_lora/knkarthick_dialogsum/validation.csv"

import pandas as pd 
dataset_train = pd.read_csv(train_path)
dataset_train.head(3)
dataset_test = pd.read_csv(test_path)
dataset_test.head(4)
dataset_validation = pd.read_csv(validation_path)
dataset_validation.head(4)

Now load the model

#########################
# Load the model
###############################
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name,torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True,is_trainable = True)

AutoModelForSeq2SeqLM.from_pretrained(model_name,torch_dtype=torch.bfloat16)

torch_dtype = torch.bfloat16 (brain floating point) is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

https://www.researchgate.net/figure/Comparison-of-the-float32-bfloat16-and-float16-numerical-formats-The-bfloat16-format_fig4_366410363

AutoModelForSeq2SeqLM is used to load any seq2seq (or encoder-decoder) architecture model, like T5 and BART, while AutoModelForCausalLM is used for auto-regressive language models like all the GPT models.

Indicates we are initializing/load the tokenizer = AutoTokenizer.from_pretrained())

Tokenization:

Tokenization is the process of breaking down text into smaller units, such as words, subwords, or characters, that the language model can process. These units are called tokens. In the context of LLMs (Large Language Models), tokenization allows the model to convert the text into a format that can be input into the neural network. The model then processes these tokens sequentially, understanding and generating language based on them. Simply put, it converts String to Numbers as we know a model requires numbers/integers to process.

Embeddings:

Embeddings are numerical representations of tokens that capture the meaning and context of words or phrases in a high-dimensional space. In LLMs, once text is tokenized, each token is mapped to an embedding vector. These vectors encode semantic relationships, such that tokens with similar meanings have embeddings close to each other in this space. Embeddings allow the model to understand and manipulate language in a way that reflects the underlying relationships between words, making them a crucial component in tasks like language understanding, translation, and generation.

#PRINT the number of trainable paraemters
def print_number_of_trainable_model_parameters(model):
    trainable_model_params=0
    all_model_params=0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model\n parameters:{trainable_model_params}\n all model parameters {all_model_params}\n percentrage of trainable model: {trainable_model_params/all_model_params*100}"

print(print_number_of_trainable_model_parameters(original_model))

print(print_number_of_trainable_model_parameters(original_model))

So what we see here are the parts of the Large-Language-model that is available to train, and fine-tune for specific use cases.

If most part of the model is available to train then it has a high risk of catastrophic forgetting like basic vocabulary, translation, summarization etc.

Catastrophic Forgetting in the context of Large Language Models (LLMs) refers to the phenomenon where a model, after being trained on new data, loses or significantly degrades its ability to recall or perform on tasks it previously learned. This occurs because the model’s weights are updated to accommodate new information, potentially overwriting the knowledge gained from earlier data.
Example:
Imagine an LLM that has been trained on a dataset of scientific articles, allowing it to generate detailed explanations on scientific topics. Later, the same model is fine-tuned on a dataset of literary fiction to enhance its ability to generate creative stories.
Before Fine-tuning:
The model can accurately describe concepts like quantum mechanics or photosynthesis.
After Fine-tuning:
The model becomes more proficient in generating imaginative and stylistically rich text.
However, when asked to explain quantum mechanics or photosynthesis, the model might now produce less accurate or less detailed responses than before.
It’s a significant challenge in scenarios where models need to maintain a wide range of knowledge over time, especially when continuously learning from new data.

##################################
# PERFORM FULL FINE-TUNNING
#################################
tokenized_datasets = {"train":Dataset.from_pandas(dataset_train), "test":Dataset.from_pandas(dataset_test), 
                      "validation":Dataset.from_pandas(dataset_validation)}
def tokenize_function(example):
    start_prompt = "Summarize the following conversation \n\n"
    end_prompt = '\n\n Summary:'
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example["input_ids"] = tokenizer(prompt,padding="max_length", truncation = True, return_tensors="pt").input_ids
    example["labels"] = tokenizer(example["summary"],padding="max_length", truncation=True, return_tensors="pt").input_ids
    return example

#tokenzing the datasets train, validation, test
for k, v in tokenized_datasets.items():
  tokenized_datasets[k] = tokenized_datasets[k].map(tokenize_function,batched=True)
  tokenized_datasets[k] = tokenized_datasets[k].remove_columns(['id','topic','dialogue','summary'])
  tokenized_datasets[k] = tokenized_datasets[k].filter(lambda example, index:index % 500 ==0, with_indices=True)#increase index= 10000 for more accuracy

#view tokenized dataset
tokenized_datasets

Now we will fine-tune the Flan T5 model with the datasets

################################################# 
#FINE-TUNE THE MODEL WITH THE PREPROCESSED DATASET
#################################################
output_dir = f'./dialgoue-summary-training-{str(int(time.time()))}'

# training_args = TrainingArguments(
#     output_dir = output_dir,
#     learning_rate=1e-5,
#     num_train_epochs=1,
#     weight_decay=0.01,
#     logging_steps = 1,
#     max_steps=1
# )

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1, #increase more for better training
    learning_rate=1e-4,
    output_dir=output_dir,
    optim="adamw_torch",
    lr_scheduler_type="cosine",
    warmup_ratio=0.01,
    report_to="none",
    
)

trainer = Trainer(
    model = original_model,
    args = training_args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset= tokenized_datasets["validation"]
)

Call the trainier function to start training the model

trainer.train()
#output: TrainOutput(global_step=6, training_loss=44.125, metrics={'train_runtime': 444.022, 'train_samples_per_second': 0.056, 'train_steps_per_second': 0.014, 'total_flos': 16434176458752.0, 'train_loss': 44.125, 'epoch': 0.96})

trainer.save_model() 
#will cause error in next retraining if is_trainable=False in AutoTokenizer.from_pretrained()

Now we will load the model that we saved in the output directory and give a name to it “instruct_model”

#load the model we just trained and saved in output directory
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(output_dir,torch_dtype=torch.bfloat16)
device="cuda"
instruct_model.to(device)
#add .to(device) if you running CPU & GPU both else RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

EVALUATE

First, we will evaluate the model Manually.

###################################################
#EVALUATE THE MODEL QUALITATIVELY (HUMAN EVALUATION)
####################################################

index = 2 #evaluating with 2 rows
dialgoue = dataset_test['dialogue'][index]
human_baseline_summary = dataset_test['summary'][index]

prompt = f"""
Summarize the following conversation.
{dialgoue}
Summary:
"""
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) 
#add .to(device) else RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

original_model_outputs = original_model.generate(input_ids = input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1,))
original_model_test_output = tokenizer.decode(original_model_outputs[0],skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids = input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instuct_model_test_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
print(prompt)
print(f"BASELINE HUMAN SUMMARY:\n{human_baseline_summary}")
print("++++++++++++++++++")
print(f"ORIGNIAL MODEL:\n{original_model_test_output}")
print("+++++++")
print(f"INSTURCT MODEL:\n{instuct_model_test_output}")

#output
'''             
Summarize the following conversation.
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with the memo. Where were we?
#Person2#: This applies to internal and external communications.
#Person1#: Yes. Any employee who persists in using Instant Messaging will first receive a warning and be placed on probation. At second offense, the employee will face termination. Any questions regarding this new policy may be directed to department heads.
#Person2#: Is that all?
#Person1#: Yes. Please get this memo typed up and distributed to all employees before 4 pm.
Summary:
BASELINE HUMAN SUMMARY:
Ms. Dawson takes a dictation for #Person1# about prohibiting the use of Instant Message programs in the office. They argue about its reasonability but #Person1# still insists.
++++++++++++++++++
ORIGNIAL MODEL:
A memo is being distributed to all employees by this afternoon.
+++++++
INSTURCT MODEL:
This memo is to be distributed to all employees by this afternoon.
'''


#EVALUATING WITH 10ROWS
dialogues = dataset_test['dialogue'][0:10]
human_baseline_summaries = dataset_test['summary'][0:10]

original_model_summaries = []
instruct_model_summaries = []

for _, diaglogue in enumerate(dialogues):
    prompt = f"""
{dialogues}

Summary:"""
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) #added
    original_model_outputs = original_model.generate(input_ids = input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    orignial_model_text_output = tokenizer.decode(original_model_outputs[0],skip_special_tokens=True)
    original_model_summaries.append(orignial_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids= input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zippped_summaires = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zippped_summaires, columns = ["human_baseline_summaries","original_model_summaries","instruct_model_summaries"])
df

Now we will use the Rouge Metric

######################################################
#EVALUATE THE MODEL QUANTITAVIVELY (with ROUGE METRIC)
#####################################################

#initlialze the rouge function
rouge = evaluate.load('rouge')

oringial_model_results = rouge.compute(
    predictions = original_model_summaries, 
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions = instruct_model_summaries,
    references = human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print("ORIGINAL MODEL")
print(oringial_model_results)
print("INSTRUCT MODEL")
print(instruct_model_results)

print(oringial_model_results)/print(instruct_model_results)

We will summarize the scores

print("ABSOLUTE percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(oringial_model_results.values())))
for key,value in zip(instruct_model_results.keys(),improvement):
    print(f'{key}:{value:.2f}')

print(“ABSOLUTE percentage improvement of INSTRUCT MODEL over HUMAN BASELINE”)

PART — 2 LoRa Vs PEFT

LoRA (Low-Rank Adaptation)

LoRA is a specific technique used within the broader category of parameter-efficient fine-tuning (PEFT) methods. It focuses on reducing the number of parameters that need to be fine-tuned in a large model by introducing low-rank updates. The core idea is to decompose the weight updates during fine-tuning into a product of two low-rank matrices, significantly reducing the number of parameters that need to be trained.

How LoRA Works:

Low-Rank Matrices: During fine-tuning, instead of updating the full weight matrix in a neural network layer, LoRA approximates these updates by learning two smaller matrices. These matrices, when multiplied together, approximate the changes that would have been made to the full matrix.
Efficiency: By using these low-rank approximations, LoRA can significantly reduce the number of parameters that need to be stored and computed during fine-tuning, making the process more efficient both in terms of memory and computation.

Applications:

LoRA is particularly useful when fine-tuning very large models, such as GPT-style models, where updating the entire model would be prohibitively expensive. It allows for fine-tuning to be done with fewer computational resources while still achieving effective model adaptation to new tasks or domains.

PEFT (Parameter-Efficient Fine-Tuning)

PEFT is a broader category of techniques aimed at making the fine-tuning of large pre-trained models more efficient by only modifying a small subset of the model’s parameters or introducing lightweight auxiliary components. LoRA is one specific method within the PEFT framework.

Key Concepts in PEFT:

Adapters: These are small neural networks inserted into a pre-trained model that can be fine-tuned while keeping the original model’s parameters mostly unchanged.
Partial Fine-Tuning: Instead of updating all the layers of the model, only certain layers or components (like the final layers or attention heads) are fine-tuned.
Modular Fine-Tuning: PEFT methods often allow for the re-use of pre-trained models across multiple tasks by swapping out or retraining only the relevant parts, making it easier and more efficient to adapt models to new tasks.

Benefits of PEFT:

Resource Efficiency: Reduces the computational and memory requirements for fine-tuning large models.
Flexibility: Makes it easier to adapt large models to multiple tasks without requiring extensive retraining.
Preservation of General Knowledge: By fine-tuning only a small subset of parameters, PEFT methods help retain the general knowledge encoded in the pre-trained model.
Reduced Overfitting: By fine-tuning fewer parameters, there’s often a reduced risk of overfitting to the new task, preserving the generalization capability of the model.

Summary:

LoRA is a specific PEFT method that uses low-rank matrices to efficiently fine-tune large models by reducing the number of parameters that need updating.
PEFT is a broader category that includes LoRA and other techniques designed to make fine-tuning large models more efficient by only adjusting a small subset of parameters or adding lightweight auxiliary components.

LLM Transformer Architecture

PEFT with Lora Architecture

LoRA adapter's position in the transformer Architecture

As we can see where LoRA seats in the feed-forward

In the diagrams above, we can clearly see that we have a few parts of the model architecture available to tune. Inside the Lora we have the Pre-trained Weights (Frozen) as well as the New fine-tuned Weights from PEFT:LoRA training, which gets summed up.

Common Types of PEFT

1. Adapters

Adapters are small neural networks inserted into each layer of a pre-trained model. During fine-tuning, only the parameters of these adapter layers are updated, while the rest of the model remains frozen.

How It Works: Adapters add a small bottleneck layer in each transformer layer. These bottleneck layers have far fewer parameters than the original model layers, making them much lighter to train.
Use Case: Adapters are useful in multi-task learning, where the same model is fine-tuned on multiple tasks, each with its own set of adapters.

2. LoRA (Low-Rank Adaptation):

as already explained in the start

3. Prefix-Tuning

Use Case: Prefix-tuning is effective in tasks where the input context can significantly steer the model’s output, such as text generation or language translation.

4. Prompt-Tuning

Prompt-tuning involves fine-tuning a model by learning a small set of additional parameters that are prepended or appended to the input prompt.

How It Works: Similar to prefix-tuning, prompt-tuning adjusts special tokens that are added to the input prompt. These tokens are optimized during fine-tuning to guide the model toward desired outputs.
Use Case: Prompt-tuning is particularly useful for large language models when adapting to new tasks with minimal computational overhead.

5. BitFit (Bias-Only Fine-Tuning)

BitFit fine-tunes only the bias terms in the model’s layers, leaving the other parameters unchanged.

How It Works: During fine-tuning, only the bias parameters of the model’s layers are updated. This drastically reduces the number of parameters being modified, while still allowing the model to adapt to new tasks.
Use Case: BitFit is used when the goal is to perform minimal adjustments to the model, making it a very lightweight fine-tuning approach.

6. Layer-Freezing

In layer-freezing, certain layers of the model are kept fixed (frozen) during fine-tuning, while only a subset of layers (usually the higher or lower ones) are fine-tuned.

How It Works: Typically, only the last few layers or the first few layers of the model are fine-tuned, depending on where the new task differs most from the pre-trained tasks.
Use Case: Layer-freezing is beneficial when the model’s earlier layers (which capture more general features) are expected to remain useful across tasks, reducing the need to fine-tune the entire model.

7. Head-Tuning

Head-tuning focuses on fine-tuning only the output layer or “head” of the model, which is responsible for producing the final task-specific output.

How It Works: The entire model is kept frozen except for the final layer, which is fine-tuned to produce task-specific outputs, like classification labels.
Use Case: Head-tuning is effective in classification tasks where the underlying representations do not need to change, only the final mapping to labels.

8. Side-Tuning

Side-tuning involves adding a small side network that is trained alongside the pre-trained model. The side network’s output is combined with the main model’s output during inference.

How It Works: A small neural network is added to the side of the pre-trained model. This side network is trained for the new task, while the main model remains frozen. The outputs from both networks are combined to make the final prediction.
Use Case: Side-tuning is useful when integrating new information or tasks without disrupting the original model’s performance.

9. HyperNetworks

HyperNetworks generate weights for certain parts of the model based on a smaller auxiliary model. The generated weights are used for specific tasks.

How It Works: A smaller network (the hypernetwork) generates task-specific weights for parts of the main model. These generated weights are used instead of, or in addition to, the original model weights during fine-tuning.
Use Case: HyperNetworks are useful when fine-tuning on highly specialized tasks where task-specific weight adjustments can significantly improve performance.

Now let’s get back to our code. In this article, we will use LoRA.

################################################
#PERFORM PARAMETER EFFICIENT FINE-TUNNING(PEFT)
################################################

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, #RANK
    lora_alpha=32,
    target_modules=["q","v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM #FLAN-T%
)

#print number of trainable parameters in the PEFT MODEL
peft_model = get_peft_model(original_model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

print(print_number_of_trainable_model_parameters(peft_model))

Add LoRA adapter layers/parameters to the original LLM to be trained

###################################################
#Add LoRA adapter layers/parameters ot the original LLM to be trained
##################################################

#TRAIN PEFT Adapter
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir = output_dir,
    auto_find_batch_size=True,
    learning_rate= 1e-3,#higher learning rate than full fine-tuning
    num_train_epochs=1, #increase for more accuracy
    logging_steps=1,
    max_steps=1,
    report_to = "none" #else it will ask for https://wandb.ai/ api
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

#if you were preparing the model for further trainig, you would set is_trainable=True

Now Train the PEFT model

#################################
#Train the peft model
################################

peft_trainer.train()
peft_model_path="./peft-dialogue-summary-checkpoint-local"
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Load the PEFT MODEL

############################################################
#Load pre-trained peft model from save dir
##########################################################
from peft import PeftModel, PeftConfig

peft_model_base= AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base",torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")#original_model & instruct_model

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       './peft-dialogue-summary-checkpoint-local',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

#the number of trainable parameters will be 0 due to is_trainable=False

print(print_number_of_trainable_model_parameters(peft_model))

EVALUATE

###################################################
#EVALUATE the MODEL Qualitatively(HUMAN EVALUATION)
###################################################

index=200
dialogue = dataset_test['dialogue'][index]
human_baseline_summary = dataset_test['summary'][index]

prompt= f"""
Summarize the following conversation.
{dialogue}

Summary: """

input_ids= tokenizer(prompt, return_tensors="pt").input_ids.to(device) 

original_model_outputs = original_model.generate(input_ids = input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1,))
original_model_text_output = tokenizer.decode(original_model_outputs[0],skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids = input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids = input_ids,generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0],skip_special_tokens=True)

print(f"BASELINE HUMAN SUMMARY:\n{human_baseline_summary}")
print("++++++++++++++++++")
print(f"ORIGNIAL MODEL:\n{original_model_text_output}")
print("+++++++")
print(f"INSTURCT MODEL:\n{instruct_model_text_output}")
print("+++++++++")
print(f"PEFT MODEL:\n{peft_model_text_output}")

Evaluate with Rouge Metric

#####################################################
#EVALUATE THE MODEL QUALITATIVELY (with ROUGE Metric)
#####################################################
dialogues = dataset_test['dialogue'][0:10]
human_baseline_summaries = dataset_test['summary'][0:10]

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []
for _, diaglogue in enumerate(dialogues):
    prompt = f"""
{dialogues}

Summary:"""
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    original_model_outputs = original_model.generate(input_ids = input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    orignial_model_text_output = tokenizer.decode(original_model_outputs[0],skip_special_tokens=True)
    original_model_summaries.append(orignial_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids= input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries,peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ["human_baseline_summaries","original_model_summaries","instruct_model_summaries","peft_model_summaries"])
df

####################################################
#compute rouge
#####################################################
rouge = evaluate.load('rouge')

oringial_model_results = rouge.compute(
    predictions = original_model_summaries, 
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instrct_model_results = rouge.compute(
    predictions = instruct_model_summaries,
    references = human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions = peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print("ORIGINAL MODEL")
print(oringial_model_results)
print("INSTRUCT MODEL")
print(instrct_model_results)
print("PEFT MODEL")
print(peft_model_results)

Evaluate with a Larger dataset

#Load and compare PEFT model performance with other models with larger df
results = df
#larger dataset
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries = results['peft_model_summaries'].values


oringial_model_results = rouge.compute(
    predictions = original_model_summaries, 
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instrct_model_results = rouge.compute(
    predictions = instruct_model_summaries,
    references = human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions = peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)
print("ORIGINAL MODEL")
print(oringial_model_results)
print("INSTRUCT MODEL")
print(instrct_model_results)
print("PEFT MODEL")
print(peft_model_results)

compare PEFT model performance with other models with larger dataset

print("ABSOLUTE percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(oringial_model_results.values())))
for key,value in zip(instruct_model_results.keys(),improvement):
    print(f'{key}:{value:.2f}')

“ABSOLUTE percentage improvement of PEFT MODEL over HUMAN BASELINE”

print("ABSOLUTE percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instrct_model_results.values())))
for key,value in zip(instruct_model_results.keys(),improvement):
    print(f'{key}:{value:.2f}')

ABSOLUTE percentage improvement of PEFT MODEL over INSTRUCT MODEL

Thats it done!

You can check out the whole implementation using Kaggle. The link is in below. Make sure you have GPU accelerators for faster process.

fine_tunning_llm_with_peft_lora

Explore and run machine learning code with Kaggle Notebooks | Using data from Training_Data_FineTuning_LLM_PEFT_LORA

www.kaggle.com

and likewise, i hope you enjoyed the article, i tried my best to make it simple. You can also use it as a template and get started for complex use cases. Enjoy

Next, we will learn training large language models (LLMs) with reinforcement learning algorithms, Proximal Policy Optimization (PPO)

Until then feel free to reach out. Thanks for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

Some of my alternative internet presences are Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.