AI Security

RAG Not Working? Common Issues and How to Fix Them

DeviDevs Team
7 min read
#rag#llm#vector-database#embeddings#troubleshooting

RAG (Retrieval Augmented Generation) promises to make your AI answer from your documents, but it often doesn't work as expected. This guide helps you diagnose and fix common RAG problems.

How RAG Should Work

User Question → Embed Question → Search Vector DB → Get Relevant Chunks
                                                            ↓
                                         LLM + Context → Answer

When it fails:
- Wrong documents retrieved
- Right documents, wrong answer
- Slow performance
- No answer at all

Problem 1: Retrieval Returns Irrelevant Documents

Symptom: You ask "What's our refund policy?" but get documents about shipping.

Solution 1 - Check embedding quality:

from langchain_openai import OpenAIEmbeddings
import numpy as np
 
embeddings = OpenAIEmbeddings()
 
# Test semantic similarity
query = "What is the refund policy?"
doc1 = "Our refund policy allows returns within 30 days."
doc2 = "Shipping takes 3-5 business days."
 
query_emb = embeddings.embed_query(query)
doc1_emb = embeddings.embed_query(doc1)
doc2_emb = embeddings.embed_query(doc2)
 
# Calculate cosine similarity
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
print(f"Query vs Doc1 (refund): {cosine_sim(query_emb, doc1_emb):.3f}")  # Should be high ~0.85+
print(f"Query vs Doc2 (shipping): {cosine_sim(query_emb, doc2_emb):.3f}")  # Should be low ~0.70
 
# If both are similar, your embeddings aren't distinguishing well

Solution 2 - Improve chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
# ❌ Bad: Too small chunks lose context
bad_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,  # Too small!
    chunk_overlap=0
)
 
# ✅ Good: Larger chunks with overlap
good_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Good size
    chunk_overlap=200,    # Overlap preserves context
    separators=["\n\n", "\n", ". ", " ", ""]  # Semantic breaks
)
 
# Even better: Section-aware splitting
def split_by_headers(document):
    """Split document by markdown headers."""
    import re
    sections = re.split(r'\n#{1,3}\s', document)
    return [s.strip() for s in sections if s.strip()]

Solution 3 - Add metadata filtering:

# Add metadata when indexing
documents = []
for doc in raw_documents:
    documents.append({
        "content": doc.text,
        "metadata": {
            "category": doc.category,  # "refund", "shipping", etc.
            "date": doc.date,
            "source": doc.filename
        }
    })
 
# Filter during retrieval
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"category": "refund"}  # Only search refund docs
    }
)

Solution 4 - Use hybrid search:

from langchain.retrievers import BM25Retriever, EnsembleRetriever
 
# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 4
 
# Vector search for semantic
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
 
# Combine both
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]  # Equal weight
)

Problem 2: Retrieves Right Documents, Wrong Answer

Symptom: The retrieved chunks contain the answer, but LLM still gets it wrong.

Solution 1 - Improve the prompt:

# ❌ Bad prompt
prompt = f"Answer this: {question}\nContext: {context}"
 
# ✅ Good prompt with clear instructions
prompt = f"""Answer the question based ONLY on the context below.
If the context doesn't contain the answer, say "I don't have that information."
Do not make up information not in the context.
 
Context:
{context}
 
Question: {question}
 
Answer:"""

Solution 2 - Check context is actually passed:

# Debug: Print what the LLM actually sees
def debug_rag(question, context):
    full_prompt = f"Context:\n{context}\n\nQuestion: {question}"
    print("="*50)
    print("LLM INPUT:")
    print(full_prompt)
    print("="*50)
 
    response = llm.invoke(full_prompt)
    print("\nLLM OUTPUT:")
    print(response)
    return response

Solution 3 - Reduce chunk count if too noisy:

# Too many chunks can confuse the LLM
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 3  # Fewer, more relevant chunks
    }
)
 
# Or use relevance score filtering
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "k": 5,
        "score_threshold": 0.7  # Only high-confidence matches
    }
)

Solution 4 - Use a re-ranker:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
 
# Initial retrieval gets more docs
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
 
# Re-ranker picks the best
compressor = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

Problem 3: RAG is Too Slow

Symptom: Takes 5+ seconds to answer simple questions.

Solution 1 - Optimize vector database:

# Use persistent storage instead of in-memory
from langchain_community.vectorstores import Chroma
 
# ❌ Slow: Rebuilds every time
vectorstore = Chroma.from_documents(documents, embeddings)
 
# ✅ Fast: Persists to disk
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

Solution 2 - Reduce embedding calls:

# Cache embeddings
from functools import lru_cache
 
@lru_cache(maxsize=1000)
def cached_embed(text):
    return tuple(embeddings.embed_query(text))
 
# Or pre-compute common queries
common_queries = ["refund policy", "shipping time", "contact support"]
precomputed = {q: embeddings.embed_query(q) for q in common_queries}

Solution 3 - Use async operations:

import asyncio
from langchain_openai import ChatOpenAI
 
async def fast_rag(question):
    # Run retrieval and LLM prep in parallel
    retrieval_task = asyncio.create_task(
        vectorstore.asimilarity_search(question, k=3)
    )
 
    # While retrieving, set up LLM
    llm = ChatOpenAI(model="gpt-4o-mini")
 
    # Wait for retrieval
    docs = await retrieval_task
 
    # Generate answer
    context = "\n".join([d.page_content for d in docs])
    response = await llm.ainvoke(f"Context: {context}\nQuestion: {question}")
 
    return response

Problem 4: Vector Store Connection Issues

Symptom: Errors like "Collection not found" or connection timeouts.

Solution for Chroma:

from langchain_community.vectorstores import Chroma
import chromadb
 
# Create persistent client
client = chromadb.PersistentClient(path="./chroma_db")
 
# Check collections exist
print(client.list_collections())
 
# Create/get collection
vectorstore = Chroma(
    client=client,
    collection_name="my_docs",
    embedding_function=OpenAIEmbeddings()
)

Solution for Pinecone:

from pinecone import Pinecone
from langchain_pinecone import PineconeVectorStore
 
# Initialize with explicit error handling
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
 
# Check index exists
indexes = pc.list_indexes()
if "my-index" not in [idx["name"] for idx in indexes]:
    raise ValueError("Index 'my-index' not found!")
 
# Connect
vectorstore = PineconeVectorStore(
    index_name="my-index",
    embedding=OpenAIEmbeddings()
)

Problem 5: Documents Not Being Indexed

Symptom: Search returns nothing even though you added documents.

Solution 1 - Verify indexing succeeded:

# Add documents with confirmation
ids = vectorstore.add_documents(documents)
print(f"Added {len(ids)} documents")
 
# Verify they exist
test_results = vectorstore.similarity_search("test query", k=1)
print(f"Can retrieve: {len(test_results) > 0}")

Solution 2 - Check document format:

from langchain.schema import Document
 
# ❌ Wrong format
vectorstore.add_texts(["text1", "text2"])  # May lose metadata
 
# ✅ Correct format
docs = [
    Document(page_content="text1", metadata={"source": "file1"}),
    Document(page_content="text2", metadata={"source": "file2"})
]
vectorstore.add_documents(docs)

Solution 3 - Persist changes (Chroma):

# After adding documents
vectorstore.persist()
 
# Or use auto-persist
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)
# Changes auto-persist in newer versions

Complete RAG Debug Checklist

def debug_rag_pipeline(question, vectorstore, llm):
    print("="*60)
    print("RAG DEBUG REPORT")
    print("="*60)
 
    # 1. Test embedding
    print("\n1. EMBEDDING TEST")
    try:
        emb = vectorstore._embedding_function.embed_query(question)
        print(f"   ✅ Embedding works (dim: {len(emb)})")
    except Exception as e:
        print(f"   ❌ Embedding failed: {e}")
        return
 
    # 2. Test retrieval
    print("\n2. RETRIEVAL TEST")
    try:
        docs = vectorstore.similarity_search(question, k=3)
        print(f"   ✅ Retrieved {len(docs)} documents")
        for i, doc in enumerate(docs):
            print(f"   Doc {i+1}: {doc.page_content[:100]}...")
    except Exception as e:
        print(f"   ❌ Retrieval failed: {e}")
        return
 
    # 3. Test LLM
    print("\n3. LLM TEST")
    try:
        context = "\n".join([d.page_content for d in docs])
        prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
        response = llm.invoke(prompt)
        print(f"   ✅ LLM responded")
        print(f"   Response: {response.content[:200]}...")
    except Exception as e:
        print(f"   ❌ LLM failed: {e}")
 
    print("\n" + "="*60)
 
# Usage
debug_rag_pipeline("What is the refund policy?", vectorstore, llm)

Quick Reference: Common Fixes

| Symptom | Likely Cause | Fix | |---------|--------------|-----| | Wrong docs retrieved | Poor chunking | Increase chunk size + overlap | | Low relevance scores | Bad embeddings | Try different embedding model | | Right docs, wrong answer | Prompt issues | Improve prompt clarity | | Slow queries | No persistence | Use persistent vector DB | | Empty results | Not indexed | Verify add_documents + persist |

Need Help With Your RAG System?

Production RAG systems require careful tuning. Our team offers:

  • RAG architecture design
  • Retrieval optimization
  • LLM security audits
  • Performance tuning

Get RAG expertise

Weekly AI Security & Automation Digest

Get the latest on AI Security, workflow automation, secure integrations, and custom platform development delivered weekly.

No spam. Unsubscribe anytime.