What is RAG? Building a Medical Chatbot with Llama2 and Pinecone
Large Language Models are incredible at generating human-like text, but they have a fundamental flaw: they hallucinate. In domains like healthcare, hallucination isn't just annoying — it's dangerous. That's where Retrieval Augmented Generation (RAG) comes in.
In this post, I'll explain RAG from the ground up and show you how I built a medical chatbot that provides reliable, source-cited answers.
What is RAG?
Retrieval Augmented Generation is a technique that combines:
- Retrieval: Searching a knowledge base for relevant documents
- Augmentation: Injecting those documents into the LLM's context
- Generation: Having the LLM generate an answer grounded in the retrieved documents
User Question ──▶ Vector Search ──▶ Relevant Documents
│
▼
LLM ◀── Question + Documents = Grounded Answer
Why Not Just Use the LLM Directly?
| Approach | Pros | Cons |
|---|---|---|
| LLM only | Simple, fast | Hallucinations, outdated knowledge |
| Fine-tuning | Customized, domain-specific | Expensive, requires retraining |
| RAG | Grounded, up-to-date, citable | More complex, requires vector DB |
RAG gives you the best of all worlds: the LLM's natural language ability combined with factual grounding from your own knowledge base.
The Medical Chatbot Architecture
Here's the high-level architecture:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Medical │ │ Embedding │ │ Pinecone │
│ Documents │────▶│ Model │────▶│ Vector DB │
│ (PDFs, CSVs) │ │ (HuggingFace)│ │ │
└──────────────┘ └──────────────┘ └──────────────┘
│
User Question ──▶ Embed ──▶ Similarity Search ──╯
│
Relevant Chunks ◀──╯
│
▼
┌──────────────┐
│ Llama2 │
│ (Generate) │
└──────────────┘
│
▼
Cited Answer
Step-by-Step Implementation
Step 1: Prepare Medical Documents
I sourced medical reference content and processed it into chunks:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
# Load medical PDFs
loader = DirectoryLoader(
"data/medical_docs/",
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
documents = loader.load()
# Split into chunks optimized for medical content
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Small enough for precise retrieval
chunk_overlap=50, # Overlap to maintain context
separators=["\n\n", "\n", ". ", " "]
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} pages")
Step 2: Create Embeddings and Store in Pinecone
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore
import pinecone
# Use a medical-optimized embedding model
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Initialize Pinecone
pc = pinecone.Pinecone(api_key="your-api-key")
# Create vector store
vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name="medical-chatbot"
)
Step 3: Set Up the Retrieval Chain
from langchain_community.llms import CTransformers
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Load Llama2 locally (quantized for efficiency)
llm = CTransformers(
model="models/llama-2-7b-chat.ggmlv3.q4_0.bin",
model_type="llama",
config={
"max_new_tokens": 512,
"temperature": 0.1, # Low temperature for factual answers
}
)
# Custom prompt that enforces source citation
PROMPT_TEMPLATE = """
Use the following pieces of medical information to answer the question.
If you don't know the answer based on the provided context, say
"I don't have enough information to answer this question. Please consult
a healthcare professional."
Do NOT make up medical information. Only use what is provided in the context.
Always cite which document the information comes from.
Context: {context}
Question: {question}
Helpful Answer:"""
prompt = PromptTemplate(
template=PROMPT_TEMPLATE,
input_variables=["context", "question"]
)
# Build the RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(
search_kwargs={"k": 3} # Retrieve top 3 most relevant chunks
),
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)
Step 4: Build the Flask API
from flask import Flask, request, jsonify, render_template
app = Flask(__name__)
@app.route("/chat", methods=["POST"])
def chat():
question = request.json.get("question")
if not question:
return jsonify({"error": "Question is required"}), 400
result = qa_chain.invoke({"query": question})
# Extract source citations
sources = [
{
"page": doc.metadata.get("page", "N/A"),
"source": doc.metadata.get("source", "Unknown"),
"snippet": doc.page_content[:200]
}
for doc in result["source_documents"]
]
return jsonify({
"answer": result["result"],
"sources": sources,
"confidence": calculate_confidence(result)
})
def calculate_confidence(result):
"""Estimate confidence based on retrieval similarity scores."""
if not result.get("source_documents"):
return 0.0
# Higher similarity = higher confidence
return min(0.95, len(result["source_documents"]) / 3 * 0.85)
Key Design Decisions
Why Llama2 Over GPT-4?
For a medical chatbot:
- Data Privacy: Patient-related queries should not leave your infrastructure
- Cost: Running locally is free after initial setup
- Latency: Local inference is faster than API calls for batch processing
- Compliance: HIPAA and GDPR require data to stay within controlled environments
Why Pinecone Over Other Vector DBs?
| Vector DB | Hosted | Scaling | Speed | Cost |
|---|---|---|---|---|
| Pinecone | ✅ Fully managed | ✅ Auto-scaling | ✅ Fast | $$$ |
| ChromaDB | ❌ Self-hosted | ⚠️ Manual | ✅ Fast | Free |
| Weaviate | ✅ Cloud option | ✅ Good | ✅ Fast | $$ |
| pgvector | ❌ Self-hosted | ⚠️ Postgres-limited | ⚠️ Moderate | Free |
I chose Pinecone for the managed infrastructure and auto-scaling. For a self-hosted alternative, I'd recommend pgvector if you're already running PostgreSQL.
Chunk Size Optimization
Finding the right chunk size is critical:
- Too small (100 tokens): Missing context, fragmented information
- Too large (2000 tokens): Diluted relevance, wasted context window
- Sweet spot (300-500 tokens): Precise retrieval with sufficient context
Common RAG Pitfalls and Solutions
1. The "Lost in the Middle" Problem
LLMs tend to ignore information in the middle of long contexts. Solution: rerank retrieved documents so the most relevant ones are first and last.
2. Retrieval-Generation Mismatch
The retriever finds relevant documents, but the LLM ignores them. Solution: Use a stronger prompt that explicitly instructs the LLM to base answers only on the provided context.
3. Stale Knowledge Base
Documents become outdated. Solution: Automated ingestion pipeline that re-indexes documents on a schedule.
Explore the full source code on GitHub.