Evaluating RAG Models with RAGAS: A New Benchmark for RAG

8 min readAug 19, 2024

Traditional metrics often fall short when it comes to assessing the nuanced abilities of these models, especially in specialized tasks. One of the most promising advancements in this area is the use of Retrieval-Augmented Generation (RAG) combined with Retrieval-Augmented Generation Assisted Scoring (RAGAS). Explore how RAG works, why RAGAS is essential for evaluation, and how these tools can be used to assess the effectiveness of LLMs.

We all know what are is RAG? how does it works, even the detail description of RAG Fusion mentioned in my previous article.

Mastering RAG Fusion in Simple Steps: A Deep Dive into Retrieval-Augmented Generation”

Revolutionize the way we approach natural language processing and information retrieval. We delve into the intricacies…

bobrupakroy.medium.com

So, in this article, we will concentrate on Evaluating the RAG.

RAGAS

Evaluating RAG models for production-ready use cases poses unique challenges. Traditional metrics like BLEU, ROUGE, and METEOR focus on the overlap between generated and reference texts, which might not fully capture the quality of responses generated by RAG models. These metrics often overlook the relevance of retrieved information and how well it is integrated into the final output.

RAGAS, or Retrieval-Augmented Generation Assisted Scoring, addresses these limitations by providing a more holistic evaluation framework. RAGAS considers both the retrieval and generation aspects of a RAG model, ensuring that the entire process — from document retrieval to response generation — is assessed. This is particularly important for tasks like open-domain question answering, where the accuracy and relevance of information are paramount.

Key Components of RAGAS Evaluation

RAGAS evaluates RAG models using a multi-faceted approach, focusing on:

Relevance: How well do the retrieved documents match the query? RAGAS assesses whether the information brought in by the retrieval step is pertinent to the user’s query.
Integration: How effectively is the retrieved information incorporated into the generated response? RAGAS evaluates the coherence between the retrieved content and the generated text, ensuring that the response is both accurate and contextually appropriate.
Semantic Accuracy: Beyond word-level matching, RAGAS considers the semantic content of the response, ensuring that the meaning aligns with the query’s intent and the retrieved information.
Contextual Appropriateness: RAGAS also looks at how well the generated response fits within the broader context of the conversation or task at hand, ensuring that it is not only factually correct but also relevant and useful.

What’s intriguing about RAGAS is that it originally emerged as a framework for “reference-free” evaluation. This approach doesn’t rely on human-annotated ground truth labels in the evaluation dataset. Instead, RAGAS utilizes LLMs to perform the evaluations autonomously.

To evaluate a RAG pipeline, RAGAS requires the following information:

question: The user query that serves as the input to the RAG pipeline.
answer: The generated response produced by the RAG pipeline.
contexts: The contexts retrieved from the external knowledge source used to generate the answer.
ground_truths: The correct answer to the question, which is the only human-annotated information needed. This is required only for the context_recall metric (see Evaluation Metrics).

Evaluation Metrics: Ragas provides tons of new metrics to evaluate the RAG pipeline. Here are they:

Component-Wise Evaluation

Faithfulness
Answer relevancy
Context recall
Context precision
Context utilization
Context entity recall
Summarization Score

Metrics and their definitions:

ragas.metrics.answer_relevancy:Scores the relevancy of the answer according to the given question.
ragas.metrics.answer_similarity: Scores the semantic similarity of ground truth with the generated answer.
ragas.metrics.answer_correctness: Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.
ragas.metrics.context_precision:Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.
ragas.metrics.context_recall: Estimates context recall by estimating TP and FN using annotated answers and retrieved context.
ragas.metrics.context_entity_recall: Calculates recall based on entities present in ground truth and context.
ragas.metrics.summarization_score: This metric gives a measure of how well the summary captures the important information from the contexts. The intuition behind this metric is that a good summary shall contain all the important information present in the context(or text so to say).

All metrics are scaled to the range [0, 1], with higher values indicating a better performance.

There are also other end-to-end RAG pipeline evaluation metrics available like Answer semantic similarity, Answer Correctness, and more.

Here is the link to its official documentation https://docs.ragas.io/en/latest/concepts/metrics/index.html

Now let’s see how can we implement the RAGAS

#intall the libararies
!pip install ragas


################################################
# Load thed dataset
################################################

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# Dataset link: https://huggingface.co/datasets/explodinggradients/amnesty_qa
#from datasets import load_dataset
#loading the V2 dataset
#amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
loader = TextLoader(r'state_of_the_union.txt',encoding="utf8")
documents = loader.load()
 
# Chunk the data
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

################################################
############ CREATE Vector Store ###############
################################################

#-------------CREATE Vector EMBEDINGS------------------------------------

#----------- Load the Embedding model ---------------------
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cpu"}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

#-------------Create Vector Store--------------------------

# Use FAISS vector DB
from langchain.vectorstores.faiss import FAISS
vc_db = FAISS.from_documents(chunks, embeddings)
vc_db.save_local("vc_db_ragas")
vc_db = FAISS.load_local("vc_db_ragas", embeddings,allow_dangerous_deserialization=True)

############### Retrival ########################
#################################################

#Initialize the retrievaer
# retriever = vectordb.as_retriever(search_kwargs={"k": 3})
retriever = vc_db.as_retriever()

Now we will create the RAG pipeline.


from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

# Define LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Define prompt template
template = """Utilize the retrieved context below to answer the question.
If you're unsure of the answer, simply state you don't know and apologies
Keep your response concise, limited to two sentences.
Question: {question}
Context: {context}
"""

prompt = ChatPromptTemplate.from_template(template)

# Setup RAG pipeline
rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm
    | StrOutputParser() 
)

Preparing Evaluation Data from RAG

#the amnesty_qa dataset already contains 'question', 'ground_truths', 'answer', 'contexts'
#you can also add your own 

from datasets import Dataset

questions = ["What did the prime minister say about climate change?", 
             "What did the prime minister say about the new healthcare policy?",
             "What did the prime minister say about the economic recovery?",
            ]

ground_truths = [["The prime minister emphasized the urgency of tackling climate change and committed to reducing carbon emissions."],
                ["The prime minister announced that the new healthcare policy will expand access to affordable healthcare for all citizens."],
                ["The prime minister highlighted the positive signs of economic recovery and promised further support for businesses."]]
answers = []
contexts = []

# Inference
for query in questions:
  answers.append(rag_chain.invoke(query))
  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])


data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truths
}

#If the context_recall metric is not relevant to you, there's no need to provide the ground_truths information

# Convert dict to dataset
dataset = Dataset.from_dict(data)

Initialize the RAGAS RAG Evaluator


###################################################
# EVALUATING THE RAG ##############################
###################################################

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

#Evaluate


###############################################
result = evaluate(
    dataset = dataset,
    llm=llm,
    embeddings=embeddings,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],)

df = result.to_pandas()

##################################################
# amnesty_qa dataset
##################################################

#in the amnesty_qa dataset you dont have create RAG pipeline because 
#we are trying to validate questions and answers (assuming)already generated by another model

from datasets import load_dataset
edataset = load_dataset("explodinggradients/amnesty_qa","english_v2")
edataset["eval"]['question']

result = evaluate(
    amnesty_qa["eval"],
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

result
df = result.to_pandas()
df.head()

References to the metrics used.

Faithfulness — Measures the factual consistency of the answer to the context based on the question.
Context_precision — Measures how relevant the retrieved context is to the question, conveying the quality of the retrieval pipeline.
Answer_relevancy — Measures how relevant the answer is to the question.
Context_recall — Measures the retriever’s ability to retrieve all necessary information required to answer the question.

Thats it!

Troubleshooting — “Bring your own LLMS”

Bring your own LLMs | Ragas

Skip to content Ragas uses langchain under the hood for connecting to LLMs for metrices that require them. This means…

docs.ragas.io

We get some errors while using other models, for that i have posted some walk around to it.

#Approach 1 ################################################

from langchain_core.language_models import BaseLanguageModel
from langchain_core.embeddings import Embeddings
import evaluate

langchain_llm = llm # any langchain LLM instance
langchain_embeddings = embeddings# any langchain Embeddings instance

from ragas.metrics import faithfulness
from ragas import evaluate

results = evaluate(metrics=[faithfulness], dataset=dataset, llm=llm, embeddings=embeddings)

#-------------------------------------------------------------
# override the llm and embeddings for a specific metric
from ragas.metrics import answer_relevancy 

answer_relevancy.llm = langchain_llm
answer_relevancy.embeddings = langchain_embeddings

# You can also init a new metric with the llm and embeddings of your choice

from ragas.metrics import AnswerRelevancy
answer_relevancy_duplicate = AnswerRelevancy(llm=langchain_llm, embeddings=langchain_embeddings)

# pass to evaluate
result = evaluate(metrics=[answer_relevancy_duplicate, answer_relevancy])
result = evaluate(metrics=[answer_relevancy_duplicate, answer_relevancy,])

#-------------------------------------------
from ragas.metrics import answer_relevancy 
answer_relevancy.llm = llm #any langchain models
answer_relevancy.embeddings = embeddings langchain Embeddings instance

result = evaluate(metrics=[answer_relevancy])

#Approach 2 #############################################
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.chat_models.huggingface import ChatHuggingFace
from langchain_community.llms import HuggingFaceHub

embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
hugging_llm = HuggingFaceHub(
repo_id="HuggingFaceH4/zephyr-7b-beta",
task="text-generation",
model_kwargs={
"max_new_tokens": 512,
"top_k": 30,
"temperature": 0.1,
"repetition_penalty": 1.03,
},
)

llm = LangchainLLMWrapper(hugging_llm)
embeddings = LangchainEmbeddingsWrapper(embedding_model)


from ragas import evaluate
from ragas.metrics import context_precision
result = evaluate(
dataset=dataset,
llm=llm,
embeddings=embeddings,
metrics=[
context_precision,
],
)

#APPROACH 3########################################
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
llm = ChatOllama(model="mistral")
embeddings = OllamaEmbeddings(model = "mistral")


from langchain.chat_models import ChatOpenAI
from ragas.llms import LangchainLLM

inference_server_url = "http://localhost:8080/v1"

# create vLLM Langchain instance
chat = ChatOpenAI(
    model="HuggingFaceH4/zephyr-7b-alpha",
    openai_api_key="no-key",
    openai_api_base=inference_server_url,
    max_tokens=5,
    temperature=0,
)

# use the Ragas LangchainLLM wrapper to create a RagasLLM instance
vllm = LangchainLLM(llm=chat)

from ragas.metrics import (
    context_precision,
    faithfulness,
    context_recall,
)
from ragas.metrics.critique import harmfulness

# change the LLM
faithfulness.llm = vllm
context_precision.llm = vllm
context_recall.llm = vllm
harmfulness.llm = vllm

It’s quite fascinating to observe the above RAG evaluation in action within a production environment.

I hope this helps!

Next, we will try to explore ARES which includes MRR, NDCG that leverages synthetic data and LLM-based evaluators, making it ideal for environments that require ongoing updates and training, with an emphasis on response ranking and relevance.

Thanks for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

Some of my alternative internet presences are Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.