Large language models (LLMs) like those powering chatbots or virtual assistants have transformed how we interact with information. However, when tasked with answering questions in specialized areas—such as corporate policies, medical research, or recent news—these models often struggle. Their knowledge is limited to what they were trained on, which may not include the latest or most specific data. To overcome this, researchers have developed techniques like retrieval-augmented generation (RAG) and fine-tuning.
A newer approach, Retrieval-Augmented Fine-Tuning (RAFT), combines the strengths of both, offering a powerful solution for domain-specific tasks. Developed by researchers at UC Berkeley, RAFT is like teaching a model to study for an open-book exam, ensuring it knows the material and how to use external resources effectively.
This article explains RAFT in simple terms, using analogies, examples, and structured details to make it accessible. We’ll explore how it works, its benefits, and how it can be applied, complete with tables and a coding example for clarity.
On This Page
Table of Contents
Introduction to LLMs and Their Challenges
LLMs are trained on vast datasets, enabling them to generate human-like text. However, they face challenges in specialized domains:
- Limited Knowledge: They may not know proprietary or recent information.
- Data Requirements: Adapting them to new domains often needs large, labeled datasets.
- Accuracy Risks: Without proper context, they might produce incorrect or “hallucinated” answers.
To address these, two main techniques are used:
- RAG: Retrieves relevant documents during use to provide context.
- Fine-Tuning: Trains the model on domain-specific data to embed knowledge.
Both have limitations, which RAFT aims to overcome by blending their advantages.
Understanding RAG and Fine-Tuning
Before diving into RAFT, let’s break down its building blocks.
Retrieval-Augmented Generation (RAG)
RAG enhances LLMs by allowing them to access external documents when answering questions. Here’s how it works:
- Retrieval: A query (e.g., “What’s IBM’s parental leave policy?”) is used to search a database, often with vector-based methods, to fetch relevant documents.
- Generation: The model combines the query and retrieved documents to generate an answer.
Pros:
- Accesses up-to-date or domain-specific information.
- No need to retrain the model.
Cons:
- Depends heavily on the retriever’s quality.
- The model may struggle to filter irrelevant documents.
Fine-Tuning
Fine-tuning adjusts a pre-trained LLM’s parameters using a domain-specific dataset. The process involves:
- Dataset Creation: Collecting labelled data (e.g., questions and answers about a company’s policies).
- Training: Updating the model’s weights to better handle the domain.
Pros:
- Deeply embeds domain knowledge.
- Improves performance on specific tasks.
Cons:
- Requires significant labelled data.
- Risks overfitting or becoming outdated.
- Can’t easily adapt to new information.
The Analogy: Studying for an Exam
To understand RAFT’s role, consider this analogy:
- Fine-Tuning: Studying for a closed-book exam. You memorize everything but can’t access notes during the test. If you studied the wrong material, you’re stuck.
- RAG: Taking an open-book exam without studying. You can use the book, but if you don’t know where to look, you might fail.
- RAFT: Studying for an open-book exam. You learn the material and know how to use the book effectively, ensuring success.
RAFT is like teaching the model to “fish” for answers, rather than just giving it the “fish” (pre-trained knowledge or retrieved documents).
What Is RAFT?
Retrieval-Augmented Fine-Tuning (RAFT) is a hybrid technique that fine-tunes LLMs to excel at RAG in specific domains. Developed by UC Berkeley researchers, including Tianjun Zhang and Shishir G. Patil, RAFT trains models to use retrieved documents effectively by distinguishing between relevant (golden) and irrelevant (distractor) documents. This makes the model more accurate, robust, and transparent in its responses.
Why RAFT Matters
RAFT addresses key limitations:
- RAG’s Weakness: Poor retrieval can lead to irrelevant documents, confusing the model.
- Fine-Tuning’s Weakness: Static knowledge can’t handle new or external data.
By combining both, RAFT creates models that are both knowledgeable and adaptable, ideal for enterprise applications like answering policy questions or analysing proprietary data.
How RAFT Works
RAFT’s core innovation lies in its training process, which simulates real-world RAG scenarios. Here’s a detailed breakdown:
Training Data Setup
Each training sample consists of:
- Query (Q): A question, e.g., “How much parental leave does IBM offer?”
- Documents:
- Golden Documents (D)*: Contain the answer, e.g., IBM’s HR policy stating “12 weeks of paid parental leave.”
- Distractor Documents (D_i): Irrelevant, e.g., retirement plans or code of conduct.
- Answer (A)*: The correct response, often with reasoning.
Two types of document sets are created:
- Set One: Includes golden and distractor documents (P% of the data).
- Set Two: Includes only distractor documents ((1-P)% of the data).
This setup mimics real-world retrieval, where the retriever may fetch irrelevant documents.
Chain-of-Thought (CoT) Reasoning
RAFT trains the model to generate answers using chain-of-thought reasoning, where it explains its steps and cites documents. For example:
Query: “How much parental leave does IBM offer?”
Documents:
- [Golden] “IBM Parental Leave Policy: Employees are entitled to 12 weeks of paid parental leave.”
- [Distractor] “IBM Retirement Plan: Details about 401(k) contributions.”
- [Distractor] “IBM Code of Conduct: Guidelines for employee behavior.”
CoT Response:
- Identify the query’s focus: parental leave policy.
- Review documents:
- Document 1 mentions 12 weeks of paid parental leave—relevant.
- Documents 2 and 3 discuss retirement and conduct—irrelevant.
- Answer: IBM offers 12 weeks of paid parental leave, based on Document 1.
This reasoning improves transparency and teaches the model to filter distractions.
Training Objective
- With Golden Documents: Use them to generate the correct answer.
- With Only Distractors: Recognize the lack of relevant information and either rely on internal knowledge or say “I don’t know.”
The proportion of golden documents (P%) varies by dataset, optimizing performance.
Table: RAFT Training Data Structure
Component | Description | Example |
---|---|---|
Query (Q) | Question to answer | “How much parental leave does IBM offer?” |
Golden Document (D*) | Contains the answer | “IBM Parental Leave Policy: 12 weeks of paid parental leave.” |
Distractor Document (D_i) | Irrelevant to the query | “IBM Retirement Plan: Details about 401(k).” |
Document Set One | Golden + distractors (P% of data) | [Golden, Distractor1, Distractor2] |
Document Set Two | Only distractors ((1-P)% of data) | [Distractor1, Distractor2, Distractor3] |
Answer (A*) | Correct response with CoT reasoning | “IBM offers 12 weeks, per Document 1.” |
Benefits of RAFT
RAFT offers significant advantages:
- Improved Accuracy:
- By learning to focus on golden documents, RAFT enhances performance on domain-specific questions.
- Example: On the HotpotQA dataset, RAFT achieved 35.28% accuracy, a 30.87% gain over domain-specific fine-tuning.
- Reduced Hallucinations:
- Training with distractor-only sets teaches the model to avoid fabricating answers, minimizing errors.
- Example: In cases with no relevant documents, RAFT is more likely to say “I don’t know.”
- Enhanced Transparency:
- CoT reasoning makes the model’s process clear, citing specific documents for traceability.
- Example: Users can verify the model’s answer by checking cited sources.
- Scalability:
- RAFT is adaptable to various domains, from healthcare (PubMed) to software (HuggingFace).
Table: RAFT Performance Across Datasets
Dataset | RAFT Accuracy | DSF + RAG | GPT-3.5 + RAG | RAFT Gain Over DSF |
---|---|---|---|---|
PubMed | 73.30% | 71.60% | 71.60% | +1.70% |
HotpotQA | 35.28% | 4.41% | 41.50% | +30.87% |
HuggingFace | 74.00% | 42.59% | 29.08% | +31.41% |
Torch Hub | 84.95% | 82.80% | 60.21% | +2.15% |
TensorFlow | 86.86% | 60.29% | 65.59% | +26.57% |
Implementation and Examples
Implementing RAFT involves creating a training dataset and fine-tuning the model. Here’s a practical example and a coding outline.
Real-World Example
Scenario: A company wants an LLM to answer employee questions about benefits.
Query: “How much parental leave does IBM offer?”
Training Data:
- Golden Document: IBM’s HR policy document.
- Distractors: Documents on retirement or IT protocols.
- Answer: “12 weeks of paid parental leave, per the HR policy.”
Training:
- For 70% of samples (P=0.7), include the golden document.
- For 30%, include only distractors, teaching the model to recognize missing information.
Outcome: The model learns to cite the HR policy accurately, ignoring irrelevant documents.
Example
Below is a simplified pseudocode for RAFT training:
# Function to retrieve distractor documents
def retrieve_distractors(query, num_docs):
return [f"Distractor_Doc_{i}" for i in range(num_docs)]
# Training loop
for query, golden_doc, answer in training_data:
# Decide document set
if random.random() < P: # P% chance
documents = [golden_doc] + retrieve_distractors(query, k-1)
else:
documents = retrieve_distractors(query, k)
# Prepare input
input_text = f"Query: {query}\nDocuments:\n" + "\n".join(documents) + "\nAnswer:"
# Generate CoT answer
cot_answer = f"Step 1: Analyze query.\nStep 2: Review documents.\nAnswer: {answer}"
# Fine-tune model
model.train(input_text, cot_answer)
This code simulates RAFT’s training process, balancing golden and distractor documents.
WarpUP
Retrieval-Augmented Fine-Tuning (RAFT) is a breakthrough in adapting LLMs for domain-specific tasks. By training models to use retrieved documents effectively, RAFT improves accuracy, reduces errors, and enhances transparency. Its ability to handle both relevant and irrelevant information makes it ideal for applications like enterprise chatbots or research assistants. As AI evolves, RAFT’s scalable approach will likely play a key role in making models more reliable and adaptable.
For further reading, explore these resources:
- RAFT: Adapting Language Model to Domain Specific RAG
- RAFT: A new way to teach LLMs to be better at RAG
- How to fine-tune LLMs for better RAG performance

FAQs
What is RAFT in simple terms?
RAFT is a way to make AI language models (like chatbots) smarter by combining two tricks: retrieval (looking up information) and fine-tuning (teaching the model specific knowledge). Imagine you’re studying for a test. RAFT is like learning the material and knowing how to use your notes during the exam. It trains the AI to find and use the right information from a database to answer questions accurately, especially for specific topics like company policies or medical data.
How is RAFT different from regular AI training?
Regular AI training (like fine-tuning) is like memorizing a textbook—you learn a lot, but you can’t look up new info. Another method, called RAG (Retrieval-Augmented Generation), is like using Google during a test—you can search, but you might grab the wrong info if you’re not careful. RAFT mixes both: it teaches the AI to “study” specific knowledge and how to search for the right information, making it better at handling tricky, topic-specific questions.
Why would someone use RAFT?
RAFT is great when you need an AI to answer questions about a specific area, like a company’s rules or a technical field. It helps the AI:
Give accurate answers by focusing on relevant information.
Avoid making up stuff (called “hallucinations”) when it doesn’t have the right data.
Show its work, so you know where the answer came from, which is super important for businesses or legal stuff.
For example, a RAFT-trained AI could answer, “What’s the vacation policy at Google?” by pulling the exact policy document and explaining its reasoning.
What kind of data does RAFT need?
To train with RAFT, you need a dataset with three things:
A question (e.g., “How many sick days do employees get?”).
A set of documents—some relevant (like an HR handbook) and some irrelevant (like a marketing report).
A correct answer, often written step-by-step to show how to use the right documents.
Can RAFT stop AI from giving wrong answers?
Not completely, but it helps a lot! RAFT trains the AI to recognize when it doesn’t have enough good information. Instead of guessing, it might say, “I don’t know,” or use what it already knows. This cuts down on wrong answers (or “hallucinations”), which is a big deal in places like hospitals or law firms where mistakes can cause trouble.
Is RAFT hard to set up?
It’s not super easy, but it’s doable with the right tools. You need:
A dataset with questions, documents, and answers.
A way to fine-tune an AI model (like using Python libraries such as Hugging Face).
Some computing power, since training can be resource-heavy.
Can RAFT work with any AI model?
Pretty much! RAFT is flexible and can be used with many language models, like BERT, GPT, or custom ones. It’s more about the training method than the model itself. You can tweak RAFT to fit different models or tasks, making it handy for all kinds of projects.
What’s an example of RAFT in real life?
Imagine a chatbot for a hospital that answers, “What are the side effects of this medicine?” A RAFT-trained chatbot would:
Search a medical database for drug information.
Ignore unrelated documents (like hospital billing policies).
Give a clear answer, like, “Based on the drug’s info sheet, side effects include nausea and dizziness,” and explain which document it used.
This makes the chatbot trustworthy and useful for doctors or patients.
Are there any downsides to RAFT?
Like anything, RAFT has some challenges:
Data Prep: You need good, organized data, which can take time to create.
Cost: Training an AI with RAFT can use a lot of computer power, which might cost money.
Retriever Quality: If the system that grabs documents is weak, RAFT can’t work its magic as well.