What is RAG? Building a Medical Chatbot with Llama2 and Pinecone

2025-03-15•RAG, LLM, Python, Pinecone, AI

Large Language Models are incredible at generating human-like text, but they have a fundamental flaw: they hallucinate. In domains like healthcare, hallucination isn't just annoying — it's dangerous. That's where Retrieval Augmented Generation (RAG) comes in.

In this post, I'll explain RAG from the ground up and show you how I built a medical chatbot that provides reliable, source-cited answers.

What is RAG?

Retrieval Augmented Generation is a technique that combines:

Retrieval: Searching a knowledge base for relevant documents
Augmentation: Injecting those documents into the LLM's context
Generation: Having the LLM generate an answer grounded in the retrieved documents

User Question ──▶ Vector Search ──▶ Relevant Documents
                                          │
                                          ▼
                   LLM ◀── Question + Documents = Grounded Answer

Why Not Just Use the LLM Directly?

Approach	Pros	Cons
LLM only	Simple, fast	Hallucinations, outdated knowledge
Fine-tuning	Customized, domain-specific	Expensive, requires retraining
RAG	Grounded, up-to-date, citable	More complex, requires vector DB

RAG gives you the best of all worlds: the LLM's natural language ability combined with factual grounding from your own knowledge base.

The Medical Chatbot Architecture

Here's the high-level architecture:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Medical      │     │  Embedding   │     │  Pinecone    │
│  Documents    │────▶│  Model       │────▶│  Vector DB   │
│  (PDFs, CSVs) │     │  (HuggingFace)│    │              │
└──────────────┘     └──────────────┘     └──────────────┘
                                                  │
User Question ──▶ Embed ──▶ Similarity Search ──╯
                                                  │
                              Relevant Chunks ◀──╯
                                     │
                                     ▼
                              ┌──────────────┐
                              │   Llama2      │
                              │   (Generate)  │
                              └──────────────┘
                                     │
                                     ▼
                              Cited Answer

Step-by-Step Implementation

Step 1: Prepare Medical Documents

I sourced medical reference content and processed it into chunks:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader

# Load medical PDFs
loader = DirectoryLoader(
    "data/medical_docs/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)
documents = loader.load()

# Split into chunks optimized for medical content
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Small enough for precise retrieval
    chunk_overlap=50,      # Overlap to maintain context
    separators=["\n\n", "\n", ". ", " "]
)
chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks from {len(documents)} pages")

Step 2: Create Embeddings and Store in Pinecone

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore
import pinecone

# Use a medical-optimized embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Initialize Pinecone
pc = pinecone.Pinecone(api_key="your-api-key")

# Create vector store
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name="medical-chatbot"
)

Step 3: Set Up the Retrieval Chain

from langchain_community.llms import CTransformers
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Load Llama2 locally (quantized for efficiency)
llm = CTransformers(
    model="models/llama-2-7b-chat.ggmlv3.q4_0.bin",
    model_type="llama",
    config={
        "max_new_tokens": 512,
        "temperature": 0.1,  # Low temperature for factual answers
    }
)

# Custom prompt that enforces source citation
PROMPT_TEMPLATE = """
Use the following pieces of medical information to answer the question.
If you don't know the answer based on the provided context, say
"I don't have enough information to answer this question. Please consult
a healthcare professional."

Do NOT make up medical information. Only use what is provided in the context.
Always cite which document the information comes from.

Context: {context}

Question: {question}

Helpful Answer:"""

prompt = PromptTemplate(
    template=PROMPT_TEMPLATE,
    input_variables=["context", "question"]
)

# Build the RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(
        search_kwargs={"k": 3}  # Retrieve top 3 most relevant chunks
    ),
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

Step 4: Build the Flask API

from flask import Flask, request, jsonify, render_template

app = Flask(__name__)

@app.route("/chat", methods=["POST"])
def chat():
    question = request.json.get("question")

    if not question:
        return jsonify({"error": "Question is required"}), 400

    result = qa_chain.invoke({"query": question})

    # Extract source citations
    sources = [
        {
            "page": doc.metadata.get("page", "N/A"),
            "source": doc.metadata.get("source", "Unknown"),
            "snippet": doc.page_content[:200]
        }
        for doc in result["source_documents"]
    ]

    return jsonify({
        "answer": result["result"],
        "sources": sources,
        "confidence": calculate_confidence(result)
    })

def calculate_confidence(result):
    """Estimate confidence based on retrieval similarity scores."""
    if not result.get("source_documents"):
        return 0.0
    # Higher similarity = higher confidence
    return min(0.95, len(result["source_documents"]) / 3 * 0.85)

Key Design Decisions

Why Llama2 Over GPT-4?

For a medical chatbot:

Data Privacy: Patient-related queries should not leave your infrastructure
Cost: Running locally is free after initial setup
Latency: Local inference is faster than API calls for batch processing
Compliance: HIPAA and GDPR require data to stay within controlled environments

Why Pinecone Over Other Vector DBs?

Vector DB	Hosted	Scaling	Speed	Cost
Pinecone	✅ Fully managed	✅ Auto-scaling	✅ Fast	$$$
ChromaDB	❌ Self-hosted	⚠️ Manual	✅ Fast	Free
Weaviate	✅ Cloud option	✅ Good	✅ Fast	$$
pgvector	❌ Self-hosted	⚠️ Postgres-limited	⚠️ Moderate	Free

I chose Pinecone for the managed infrastructure and auto-scaling. For a self-hosted alternative, I'd recommend pgvector if you're already running PostgreSQL.

Chunk Size Optimization

Finding the right chunk size is critical:

Too small (100 tokens): Missing context, fragmented information
Too large (2000 tokens): Diluted relevance, wasted context window
Sweet spot (300-500 tokens): Precise retrieval with sufficient context

Common RAG Pitfalls and Solutions

1. The "Lost in the Middle" Problem

LLMs tend to ignore information in the middle of long contexts. Solution: rerank retrieved documents so the most relevant ones are first and last.

2. Retrieval-Generation Mismatch

The retriever finds relevant documents, but the LLM ignores them. Solution: Use a stronger prompt that explicitly instructs the LLM to base answers only on the provided context.

3. Stale Knowledge Base

Documents become outdated. Solution: Automated ingestion pipeline that re-indexes documents on a schedule.

Explore the full source code on GitHub.