Evaluating RAG Models with ARES: A Scalable Approach to Automated Retrieval and Generation Scoring

Rupak (Bob) Roy - II
4 min readAug 19, 2024

--

ARES (Automated Retrieval-Enhanced Scoring) is an evaluation framework designed to assess the performance of Retrieval-Augmented Generation (RAG) models.

ARES focuses on providing an automated, scalable way to evaluate how well RAG systems retrieve relevant information and generate accurate, coherent responses based on that information.

Key Features of ARES:

  • Automated Evaluation: ARES automates the evaluation process, reducing the need for extensive human intervention, making it suitable for environments where models are frequently updated or retrained.
  • Focus on Retrieval and Generation: ARES evaluates both the retrieval and generation components of a RAG system, ensuring that the information retrieved is relevant and that the generated responses are accurate and contextually appropriate.
  • Use of Synthetic Data: Similar to other modern evaluation frameworks, ARES can utilize synthetic data, which allows for testing in scenarios where labeled data may be scarce or unavailable.
  • Scalability: Designed for large-scale systems, ARES can efficiently evaluate complex models across a variety of tasks, ensuring that RAG models perform well even as they grow in complexity and size.

What Does ARES Evaluate in RAG Models?

ARES performs a detailed evaluation of Retrieval-Augmented Generation (RAG) models, focusing on context relevance, answer faithfulness, and answer relevance. This comprehensive assessment ensures a deep understanding of the RAG system’s performance.

How Does ARES Automate the Evaluation Process?

ARES reduces the reliance on human labeling by utilizing fine-tuned classifiers and synthetic data. Its Prediction-Powered Inference (PPI) component enhances evaluations by accounting for variability in model responses and provides statistical confidence in the results. By incorporating fine-tuned classifiers and synthetic data, ARES delivers accurate assessments while significantly minimizing the need for manual labeling.

Is ARES Compatible with My Custom RAG Model?

Absolutely. ARES is a model-agnostic tool that allows you to generate synthetic queries and answers from your documents. You can use ARES to evaluate the performance of your custom RAG model with these generated queries and answers.

To implement ARES for evaluating your RAG system and comparing it with other RAG configurations, you will need three key components:

  1. A human preference validation set consisting of annotated query, document, and answer triples, aligned with the evaluation criteria (e.g., context relevance, answer faithfulness, and/or answer relevance). Ideally, this set should contain several hundred examples, with a minimum of 50 examples.
  2. A collection of few-shot examples to score context relevance, answer faithfulness, and/or answer relevance within your system.
  3. A much larger set of unlabeled query-document-answer triples generated by your RAG system for scoring.

Let’s get started

!pip install ares-ai

export OPENAI_API_KEY=<your key here>

#Download Dataset
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_judge_scoring.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_synthetic_query_generation.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_unlabeled_output.tsv

#optional
from ares import ARES
ares = ARES()
#ares.KILT_dataset("nq")
# Fetches NQ datasets with ratios including 0.5, 0.6, 0.7, etc.
# For purposes of our quick start guide, we rename nq_ratio_0.5 to nq_unlabeled_output and nq_labeled_output.

Run the following to retrieve the UES/IDP scores

from ares import ARES
ues_idp_config = {
"in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice" : "gpt-3.5-turbo-0125"
}

ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

Run the following to retrieve ARES’s PPI scores

ppi_config = { 
"evaluation_datasets": ['nq_unlabeled_output.tsv'],
"few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
"llm_judge": "gpt-3.5-turbo-1106",
"labels": ["Context_Relevance_Label"],
"gold_label_path": "nq_labeled_output.tsv",
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

#Run the following to see the model’s accuracy on the NQ unlabeled dataset!

from ares import ARES

ues_idp_config = {
"in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice" : "gpt-3.5-turbo-0125"
}

ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

#Run the following to see ARES’s synthetic generation in action!

from ares import ARES

synth_config = {
"document_filepaths": ["nq_labeled_output.tsv"] ,
"few_shot_prompt_filename": "nq_few_shot_prompt_for_synthetic_query_generation.tsv",
"synthetic_queries_filenames": ["synthetic_queries_1.tsv"],
"documents_sampled": 6189
}

ares_module = ARES(synthetic_query_generator=synth_config)
results = ares_module.generate_synthetic_data()
print(results)

#Run the following to see ARES’s PPI in action!

from ares import ARES

ppi_config = {
"evaluation_datasets": ['nq_unlabeled_output.tsv'],
"checkpoints": ["Context_Relevance_Label_nq_labeled_output_date_time.pt"],
"rag_type": "question_answering",
"labels": ["Context_Relevance_Label"],
"gold_label_path": "nq_labeled_output.tsv",
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

Output:
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300

Run ARES on the local model

#####################################
# Local Model Execution with vLLM ###
#####################################


#1) UES/IDP w/ vLLM
from ares import ARES

ues_idp_config = {
"in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice": "meta-llama/Llama-2-13b-hf", # Specify vLLM model
"vllm": True, # Toggle vLLM to True
"host_url": "http://0.0.0.0:8000/v1" # Replace with server hosting model followed by "/v1"
}

ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)


#2) PPI w/ vLLM

from ares import ARES

ppi_config = {
"evaluation_datasets": ['nq_unabeled_output.tsv'],
"few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
"llm_judge": "meta-llama/Llama-2-13b-hf", # Specify vLLM model
"labels": ["Context_Relevance_Label"],
"gold_label_path": "nq_labeled_output.tsv",
"vllm": True, # Toggle vLLM to True
"host_url": "http://0.0.0.0:8000/v1" # Replace with server hosting model followed by "/v1"
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

ARES is promising in evaluating different scenarios.

Enjoyed this article? Let me know if you need anything. Talk Soon.

Next, we will look at some advanced RAG techniques like Bi-Encoder, Cross-Encoder in ReRanking, Dense PAssage Retrieval (DPR)

Thanks for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

Some of my alternative internet presences are Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

--

--

Rupak (Bob) Roy - II

Things i write about frequently on Medium: Data Science, Machine Learning, Deep Learning, NLP and many other random topics of interest. ~ Let’s stay connected!