Fine-Tuning LLMs with Reinforcement Learning (PPO) and PEFT: Automating RLHF

Rupak (Bob) Roy - II
10 min readAug 11, 2024

--

Proximal Policy Optimization (PPO) offers a unique method where reinforcement learning agents make decisions within an environment to maximize cumulative rewards. The agent operates based on a policy, with the objective of reinforcement learning being to develop an optimal, or nearly-optimal, policy that enhances the reward function.

Backyard Berries … Malki Forest, Shillong
Backyard Berries …Malki Forest, Shillong

Hi everyone, previously we have seen a comprehensive guide to train a LLM model using PEFT and Lora, here is the link to the article

Today we will use the trained model further in applying Reinforcement Learning using a new novel approach (PPO) Proximal Policy Optimization (PPO).

This article will be more in coding than theory. So let’s start our journey

We’ll fine-tune a FLAN-T5 model to generate less toxic content using Meta AI’s hate speech reward model. This reward model is a binary classifier that predicts whether a given text is “not hate” or “hate.” We’ll use Proximal Policy Optimization (PPO) to fine-tune the model and reduce its toxicity.

#install the libraries
%pip install --upgrade pip
%pip install --disable-pip-version-check \
torch==1.13.1 \
torchdata==0.5.1

%pip install \
transformers==4.27.2 \
datasets==2.11.0 \
evaluate==0.4.0 \
rouge_score==0.1.2 \
peft==0.3.0 --quiet

# Installing the Reinforcement Learning library directly from github.
%pip install git+https://github.com/lvwerra/trl.git@25fa1bd
#!pip install trl==0.4.4 #for PPO

#!pip install loralib==0.1.1

#load the libs

#load the libs

from datasets import Dataset
from transformers import pipeline, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, AutoTokenizer,GenerationConfig, TrainingArguments,Trainer

#trl: Transformer Reinforcement Learning Library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import time
import evaluate
import pandas as pd
import numpy as np

#tqdm library makes the loops show a smart progress meter
from tqdm import tqdm
tqdm.pandas()

#load the dataset

#load the datasets
train_path = r"/kaggle/input/training-dataset-peft-lora/dataset_peft_lora/knkarthick_dialogsum/train.csv"
test_path = r"/kaggle/input/training-dataset-peft-lora/dataset_peft_lora/knkarthick_dialogsum/validation.csv"
import pandas as pd
dataset_train = pd.read_csv(train_path)
dataset_train.head(3)
dataset_test = pd.read_csv(test_path)
dataset_test.head(4)
dataset preview
dataset preview
#Pre-Process the dataset

def build_dataset(model_name,
dataset,
input_min_text_length,
input_max_text_length):

#load dataset(only"train" part will be enought for this lab)
#dataset = load_dataset(dataset_name, split="train")
dataset = Dataset.from_pandas(dataset)
#filter the dialogues of length between input_min_text_length and input_max_tet_length characters.
dataset = dataset.filter(lambda x:len(x['dialogue'])> input_min_text_length and len(x['dialogue']) <= input_max_text_length)

#prepare tokenier, setting device map="auto" allows to switch between GPU and CPU automatically
tokenizer = AutoTokenizer.from_pretrained(model_name,device_map="auto")

def tokenize(sample):

#wrap each dialogue with the instruction
prompt=f"""
Summarize the following conversation.
{sample['dialogue']}

Summary:
"""
sample["input_ids"] = tokenizer.encode(prompt)

#this must be called "query" which is requirement of our PPO library
sample["query"] = tokenizer.decode(sample["input_ids"])
return sample

#tokenize each dialogue
dataset = dataset.map(tokenize)
dataset.set_format(type="torch")

#split teh dataset into train and test parts
dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

return dataset_splits


model_name = 'google/flan-t5-base'

dataset = build_dataset(model_name = model_name,
dataset = dataset_train,
input_min_text_length=200,
input_max_text_length=1000)

Load pre-trained PEFT model from local (trained previously)

#FUNCTION TO PRINT the number of trainable paraemters
def print_number_of_trainable_model_parameters(model):
trainable_model_params=0
all_model_params=0
for _, param in model.named_parameters():
all_model_params += param.numel()
if param.requires_grad:
trainable_model_params += param.numel()
return f"trainable model\n parameters:{trainable_model_params}\n all model parameters {all_model_params}\n percentage of trainable model: {trainable_model_params/all_model_params*100}"
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
r=32, #RANK
lora_alpha=32,
target_modules=["q","v"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.SEQ_2_SEQ_LM #FLAN-T%
)
############################################################
#Load pre-trained peft model from local (trained previously)
##########################################################
from peft import PeftModel, PeftConfig

peft_model_base= AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base",torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
'/kaggle/input/peft-dialogue-summary-checkpoint-local/transformers/default/1/peft-dialogue-summary-checkpoint-local/',
torch_dtype=torch.bfloat16,
is_trainable=False,
#device_map="auto", #added
)

print(print_number_of_trainable_model_parameters(peft_model))

trainable model
parameters:0
all model parameters 251116800
percentage of trainable model: 0.0

Create a ppo model from peft_model (REFERENCE MODEL)

############################
# PPO MODEL (REFERENCE MODEL)
###############################
#create a ppo model from peft_model
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
torch_dtype=torch.bfloat16,
is_trainable=True)

print(f"PPO model prameters to be updated(ValueHead+769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}")
print(ppo_model.v_head)

#Only few parameters will be updated "ValueHead" during the PPO
#The number of trainable paramters can be computed as (n+1)*m,
#where n in the number of input units(here n=768) and m is the number of output units(you have m=1). the +l term in the equation
#takes into account the bias term.

#Now we will create a frozen copy(a reference model) of the PPO which will be not fine-tined
#The reference model will represet the LLM before detoxification.

PPO model prameters to be updated(ValueHead+769 params):
trainable model
parameters:769
all model parameters 251117569
percentrage of trainable model: 0.0003062310626302694

ValueHead(
(dropout): Dropout(p=0.1, inplace=False)
(summary): Linear(in_features=768, out_features=1, bias=True)
(flatten): Flatten(start_dim=1, end_dim=-1) )

#Reference Model:a model that is not fine tuned with all, not even with PEFT.
ref_model = create_reference_model(ppo_model)
print(f"Refernce model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n")

Refernce model parameters to be updated:
trainable model
parameters:0
all model parameters 251117569
percentage of trainable model: 0.0

Now we will prepare a REWARD MODEL

Reinforcement Learning (RL) is a type of machine learning where agents make decisions in an environment to maximize their cumulative rewards. The agent’s actions are guided by a policy, and the goal of reinforcement learning is to enable the agent to learn an optimal, or nearly-optimal, policy that maximizes the reward function.

In the previous section, the original policy was based on the instruct PEFT model, which represents the LLM before any detoxification. One could involve human labelers to provide feedback on the toxicity of the model’s outputs, but this approach can be costly if used throughout the entire fine-tuning process. A more practical alternative is to use a reward model that encourages the agent to detoxify the dialogue summaries. An intuitive method would involve sentiment analysis across two categories (nothate and hate), rewarding the model more when the output is classified as nothate.

For instance, instead of relying on human labelers for the entire fine-tuning process, you can utilize a reward model to generate feedback.

To achieve this, you’ll use Meta AI’s RoBERTa-based hate speech model as the reward model. This model will produce logits and predict probabilities across the two classes: nothate and hate. The logit associated with the nothate output will be treated as a positive reward, and the model will be fine-tuned using PPO based on these reward values.

To proceed, you need to create an instance of the appropriate model class for the RoBERTa model. Additionally, load a tokenizer to test the model. In this setup, label 0 corresponds to the class nothate, and label 1 corresponds to hate.

########################################
# PREPARE A REWARD MODEL
#######################################

#Reinforcement Learning(RL) is a part of machien learning, in which agents take actions in an environment aimed at maximizing their cumulative rewards.
#the agent's behavior id defined by the policy


toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

#take some non-text text, tokenize it, and pass it to the model. Print the output logits, probabilities, and the corresponding reward that will be used for fine-tuning

Check the toxicity of the Reward model

################################
#1 TEST TOXICITY of Reward model
################################
device = "cuda"

#add .to(device) else RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)


non_toxic_text = " i want to kiss you"

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids.to(device) #added

logits = toxicity_model(input_ids= toxicity_input_ids).logits
print(f"logits[not hate, hate:{logits.tolist()[0]}")

#print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=1).tolist()[0]
print(f"probabilities [not hate, hate]:{probabilities}")

#get the togits for "not hate" - this is reward
not_hate_index = 0
nothate_reward=(logits[:,not_hate_index]).tolist()
print(f"reward (high):{nothate_reward}")

logits[not hate, hate:[1.6786898374557495, -1.5461454391479492]
probabilities [not hate, hate]:[0.9617581963539124, 0.03824174404144287]
reward (high):[1.6786898374557495]

################################
#2 TEST TOXICITY of Reward model
################################

toxic_text = "you are disgusting and terrible and i damm hate you "

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids.to(device) #added

logits = toxicity_model(input_ids= toxicity_input_ids).logits
print(f"logits[not hate, hate:{logits.tolist()[0]}")

#print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=1).tolist()[0]
print(f"probabilities [not hate, hate]:{probabilities}")

#get the togits for "not hate" - this is reward
not_hate_index = 0
notehate_reward=(logits[:,not_hate_index]).tolist()
print(f"reward (high):{nothate_reward}")

logits[not hate, hate:[1.6786898374557495, -1.5461454391479492]
probabilities [not hate, hate]:[0.9617581963539124, 0.03824174404144287]
reward (high):[1.6786898374557495]

device = 0 if torch.cuda.is_available() else "cpu"


sentiment_pipe = pipeline("sentiment-analysis",
model = toxicity_model_name,
#framework="pt", #add if error "pt" pytorch
device = device)
reward_logits_kwargs = {
"top_k":None, #return all scores
"function_to_apply":"none", #set to "none" to retrive raw logits
"batch_size":16
}

reward_probabilities_kwards = {
"top_k":None, #return all scores
"function_to_apply":"softmax", #set to "softmax" to apply softmax and retrieve probablities
"batch_size":16
}

print("Reward model output for non-toxic text:")
print(sentiment_pipe(non_toxic_text,**reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text,**reward_probabilities_kwards))
print("\nReward model output for toxic text:")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text,**reward_prob

Reward model output for non-toxic text:
[{‘label’: ‘nothate’, ‘score’: 1.6786898374557495}, {‘label’: ‘hate’, ‘score’: -1.5461454391479492}]

[{‘label’: ‘nothate’, ‘score’: 0.9617581963539124}, {‘label’: ‘hate’, ‘score’: 0.03824174404144287}]

Reward model output for toxic text:
[{‘label’: ‘hate’, ‘score’: 2.475832462310791}, {‘label’: ‘nothate’, ‘score’: -2.809473991394043}]

[{‘label’: ‘hate’, ‘score’: 0.9949600696563721}, {‘label’: ‘nothate’, ‘score’: 0.005039950367063284}]

Evaluate the Toxicity

#########################
#EVALUATE Toxicity
#######################

toxicity_evaluator = evaluate.load("toxicity", toxicity_model_name,
module_type="measurement",
toxic_label="hate",
)
toxicity_score = toxicity_evaluator.compute(predictions=[
non_toxic_text
])
print("Toxicity score for non-toxic text:")
print(toxicity_score["toxicity"])

toxicity_socre = toxicity_evaluator.compute(predictions=[
toxic_text
])

print("\n Toxicity socre for toxic text:")
print(toxicity_score["toxicity"])

Toxicity score for non-toxic text:
[0.03824174404144287]

Toxicity socre for toxic text:
[0.03824174404144287]

#function to evaluate and pass the dataset

def evaluate_toxicity(model,
toxicity_evaluator,
tokenizer,
dataset,
num_samples):
max_new_tokens=100

toxicities = []
input_texts = []
for i, sample in tqdm(enumerate(dataset)):
input_text = sample["query"]
if i > num_samples:
break
input_ids = tokenizer(input_text, return_tensors = "pt", padding=True).input_ids

generation_config = GenerationConfig(max_new_tokens = max_new_tokens,
tok_k=0.0,
top_p=1.0,
do_sample=True)
response_token_ids = model.generate(input_ids = input_ids, generation_config = generation_config)
generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)

toxicity_score = toxicity_evaluator.compute(predictions=[(input_text+""+generated_text)])

toxicities.extend(toxicity_score["toxicity"])

#compute mean & std
mean = np.mean(toxicities)
std = np.std(toxicities)

return mean, std

Evaluate Toxicity with PEFT_fine_tune_model/BEFORE DETOX

#########################################################
#Evaluate Toxicity with PEFT_fine_tune_model/BEFORE DETOX
#########################################################
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto") #'google/flan-t5-base'
mean_before_detoxification, std_before_detoxicifcation = evaluate_toxicity(model = ref_model, #create_reference_model(ppo_model/peft_fine_tune_model)
toxicity_evaluator = toxicity_evaluator,
tokenizer = tokenizer, #'google/flan-t5-base'
dataset= dataset["test"],
num_samples=10)
print(f'toxicity [mean,std] before detox:{mean_before_detoxification,std_before_detoxicifcation}')

toxicity [mean,std] before detox:(0.010048318749547681, 0.014725097147429488)

PERFORM Fine-Tuning to Detoxify the summaries

#####################################
## PERFORM Fine-Tuning to Detoxify the summaries
#########################################


# Initialize the PPOTrainer
# Load the ppo_model and tokenizer, along with a frozen version of the model called ref_model.
# The first model will be optimized, while the second model is used as a reference to calculate the KL-divergence
# from the starting point. This serves as an additional reward signal during PPO training, ensuring the optimized model
# does not deviate significantly from the original LLM.

learning_rate = 1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
model_name = model_name,
learning_rate = learning_rate,
ppo_epochs=max_ppo_epochs,
mini_batch_size=mini_batch_size,
batch_size=batch_size
)

def collator(data):
return dict((key,[d[key] for d in data]) for key in data[0])

ppo_trainer = PPOTrainer(config = config,
model = ppo_model,
ref_model = ref_model, #oringinal model from lab 2
tokenizer= tokenizer,
dataset =dataset['train'],
data_collator = collator)
#############
# FINE-TUNE THE MODEL
############

output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length,output_max_length)

generation_kwards= {
"min_length":5,
"top_k": 0.0,
"top_p":1.0,
"do_sample":True
}

reward_kwargs = {
"top_k": None, #return all socres
"function_to_apply":"none",
"batch_size":16
}

max_ppo_steps = 10

for step,batch in tqdm(enumerate(ppo_trainer.dataloader)):
#break when we teach max_steps
if step >= max_ppo_steps:
break

prompt_tensors = batch["input_ids"]

#get response from FLAN-T5/PEFT LLM
summary_tensors = []

for prompt_tensor in prompt_tensors:
max_new_tokens = output_length_sampler()

generation_kwards["max_new_tokens"]=max_new_tokens
summary = ppo_trainer.generate(prompt_tensor, **generation_kwards)

summary_tensors.append(summary.squeeze()[-max_new_tokens:])

#this needs to be called "response"
batch["response"]=[tokenizer.decode(r.squeeze()) for r in summary_tensors]

#compute reward outputs
query_response_pairs = [q+r for q, r in zip(batch["query"],batch["response"])]
rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)
#print(not_hate_index, rewards)
#use the nothate item beacuse this is the score for the positive nothate class.
reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

#run PPO step
stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
ppo_trainer.log_stats(stats, batch, reward_tensors)

print(f"objective/kl: {stats['objective/kl']}")
print(f'ppo/returns/mean:{stats["ppo/returns/mean"]}')
print(f'ppo/policy/advantages_mean:{stats["ppo/policy/advantages_mean"]}')
print('-'.join('' for x in range(100)))

Now EVALUATE THE MODEL QUANTITATIVELY #AFTER DETOX

################################################
#EVALUATE THE MODEL QUANTITATIVELY #AFTER DETOX
################################################

#load the PPO/PEFT model back from disk and use the test dataset split to evaluate the toxicity score of the RL-fine-tuned model

mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model,
toxicity_evaluator = toxicity_evaluator,
tokenizer = tokenizer, #'google/flan-t5-base'
dataset= dataset["test"],
num_samples=10)

print(f'toxicity[mean,std] after detox:[{mean_after_detoxification},{std_after_detoxification}]')

toxicity[mean,std] after detox:[0.021689416068098086,0.0420182407059758]

CREATE RESPONSE DATASET

########################################
# CREATE RESPONE DATASET
#######################################

batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
gen_len = output_length_sampler()
generation_kwards["max_new_tokens"] = gen_len

summary = ref_model.generate(
input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
**generation_kwards
).squeeze()[-gen_len:]
summary_tensors_ref.append(summary)

summary = ppo_model.generate(
input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
**generation_kwards
).squeeze()[-gen_len:]
summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

#display the result

#display the results

pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted
df_compare_results_sorted
df_compare_results_sorted

I know its a long article, so i have the full notebook available at my kaggle.

Proximal Policy Optimization (PPO) Workflow
Proximal Policy Optimization (PPO) WorkFlow

I hope this article is helpful, feel free to use the template and re-innovate Reinforcement Learning from Human Feedback for Large Langauge Models LLMS. Until then enjoy machine learning.

Next, we will learn training large language models (LLMs) with reinforcement learning algorithms, Proximal Policy Optimization (PPO)

Feel free to reach out. Thanks for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

Some of my alternative internet presences are Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.

Malki Forest
Malki forest, Shillong, Meghalaya

--

--

Rupak (Bob) Roy - II
Rupak (Bob) Roy - II

Written by Rupak (Bob) Roy - II

Things i write about frequently on Medium: Data Science, Machine Learning, Deep Learning, NLP and many other random topics of interest. ~ Let’s stay connected!

No responses yet