Over 70% of enterprise AI projects fail to make it past the prototype stage — not because the models are bad, but because they are disconnected from real, up-to-date data. That is the core problem Retrieval-Augmented Generation (RAG) solves. By grounding large language model (LLM) responses in your own documents and databases, RAG transforms generic AI into a genuinely useful, domain-aware assistant.
In this guide, you will build a complete RAG app using MongoDB Atlas and Python from scratch. You will set up a vector-enabled database, embed documents, run semantic search queries, and wire everything together into a working question-answering pipeline. No hand-waving, no skipped steps — just a real, functional app you can extend and deploy.
Whether you are a Python developer exploring AI for the first time or a backend engineer adding LLM capabilities to an existing stack, this tutorial gives you a solid, production-friendly foundation. Let’s get into it.
What Is RAG and Why Does It Matter?
Before writing a single line of code, it helps to understand exactly what RAG is doing and why it has become the dominant pattern for building LLM-powered applications on private or proprietary data. For a deeper conceptual overview, see the guide on what is RAG.
The Problem with Vanilla LLMs
Large language models like GPT-4 or Claude are trained on massive public datasets with a fixed knowledge cutoff. Ask one about your internal product documentation, last quarter’s sales data, or a proprietary research paper, and it will either hallucinate an answer or admit it has no idea.
This is not a bug — it is a fundamental design constraint. LLMs are stateless and frozen at training time. They cannot look things up. RAG is the architectural pattern that adds that lookup capability.
How RAG Works
RAG combines two components: a retrieval system and a generative model. At query time, the system:
- Converts the user query into a vector embedding
Searches a vector database for the most semantically relevant document chunks
Passes those chunks as context to the LLM
The LLM generates an answer grounded in the retrieved content
The result: responses that are accurate, traceable, and up-to-date — without retraining the model. According to a 2023 study published by Meta AI Research introducing the original RAG paper, retrieval-augmented models consistently outperform closed-book models on knowledge-intensive tasks. The pattern has since become table stakes in enterprise AI.
Why MongoDB Atlas Is a Strong Choice for RAG
MongoDB Atlas added native Vector Search in 2023, allowing you to store vector embeddings alongside your regular document data. This means you do not need a separate vector database like Pinecone or Weaviate — your application data and your AI index live in one place, with one query interface and one operations workflow.
For teams already using MongoDB, this is a significant simplification. For new projects, it removes an entire infrastructure dependency. Learn more about how MongoDB fits into modern vector databases for AI apps.
RAG vs. Fine-Tuning: Quick Comparison
Factor | RAG | Fine-Tuning |
|---|---|---|
Data updates | Add new docs anytime | Requires retraining |
Cost | Low (inference + search) | High (GPU training) |
Accuracy on private data | High with good retrieval | High if trained well |
Time to deploy | Hours to days | Days to weeks |
Traceability | Sources are visible | Black box answers |
Prerequisites and Environment Setup
Before writing application code, you need a few tools and accounts in place. This section walks through everything you need and how to verify it is working.
What You Need
- Python 3.9+ — check with python –version
A MongoDB Atlas account — free tier is sufficient for this project
An OpenAI API key — used for both embeddings and generation
Basic familiarity with pip and virtual environments
Installing Dependencies
Create a virtual environment and install the required packages:
python -m venv rag-env source rag-env/bin/activate # On Windows: rag-env\Scripts\activate pip install pymongo openai python-dotenv tiktoken
Create a .env file in your project root:
MONGODB_URI=”your_atlas_connection_string” OPENAI_API_KEY=”your_openai_key”
Setting Up MongoDB Atlas
Log in to MongoDB Atlas and create a free cluster if you do not already have one. Then:
Go to Database Access and create a new database user with read/write privileges.
Go to Network Access and add your current IP address (or 0.0.0.0/0 for development).
Click Connect, choose Drivers, and copy your connection string. Replace <password> with your database user’s password.
Save the connection string to your .env file.
Creating a Vector Search Index in MongoDB Atlas
MongoDB Atlas Vector Search stores and queries high-dimensional embeddings using the Hierarchical Navigable Small World (HNSW) algorithm — one of the fastest approximate nearest neighbour methods available. As of 2024, Atlas Vector Search is generally available and supports up to 4096 dimensions per vector, covering all major embedding models.
Creating the Database and Collection
Connect to your cluster and create the database and collection you will use:
from pymongo import MongoClient import os from dotenv import load_dotenvload_dotenv()client = MongoClient(os.getenv(“MONGODB_URI”)) db = client[“rag_demo”] collection = db[“documents”]
Defining the Vector Search Index
In the Atlas UI, go to your cluster, click on Search, then Create Search Index. Choose Atlas Vector Search and paste in this index definition:
{ “fields”: [ { “type”: “vector”, “path”: “embedding”, “numDimensions”: 1536, “similarity”: “cosine” } ] }
The numDimensions value of 1536 matches the output of OpenAI’s text-embedding-3-small model. If you use a different embedding model, adjust this number accordingly. The similarity metric is set to cosine, which works well for semantic text similarity.
Verifying the Index
After creating the index, Atlas will show it as Active once it has finished building. For an empty collection, this takes only a few seconds. You are now ready to load documents.
Loading, Chunking, and Embedding Your Documents
The quality of your RAG application depends heavily on how you prepare and chunk your documents. Poor chunking is one of the most common reasons RAG pipelines underperform. A 2024 analysis by LlamaIndex found that chunk size and overlap settings account for up to 30% of retrieval accuracy variance.
Loading Documents
For this example, you will work with plain text files. In production, you would add PDF parsing, HTML extraction, or database queries. Here is a simple document loader:
def load_documents(directory: str) -> list[dict]: docs = [] for filename in os.listdir(directory): if filename.endswith(“.txt”): with open(os.path.join(directory, filename), “r”) as f: docs.append({“filename”: filename, “text”: f.read()}) return docs
Chunking Strategies
Chunking splits long documents into smaller pieces that fit within a context window and can be retrieved independently. The right chunking strategy depends on your content:
- Fixed-size chunking — split every N tokens, with optional overlap. Simple and effective for uniform documents.
- Sentence or paragraph chunking — split on natural boundaries. Better for prose and articles.
- Recursive chunking — split on larger boundaries first (sections), then smaller ones (paragraphs, sentences). Best for mixed content.
Here is a basic fixed-size chunker with overlap:
import tiktoken def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]: enc = tiktoken.get_encoding(“cl100k_base”) tokens = enc.encode(text) chunks = [] start = 0 while start < len(tokens): end = start + chunk_size chunk_tokens = tokens[start:end] chunks.append(enc.decode(chunk_tokens)) start += chunk_size – overlap return chunks
Generating and Storing Embeddings
Now embed each chunk using OpenAI and store it in MongoDB with its embedding vector:
from openai import OpenAI openai_client = OpenAI(api_key=os.getenv(“OPENAI_API_KEY”))def embed_text(text: str) -> list[float]: response = openai_client.embeddings.create( model=”text-embedding-3-small”, input=text ) return response.data[0].embeddingdef ingest_documents(docs: list[dict]): for doc in docs: chunks = chunk_text(doc[“text”]) records = [] for i, chunk in enumerate(chunks): embedding = embed_text(chunk) records.append({ “source”: doc[“filename”], “chunk_index”: i, “text”: chunk, “embedding”: embedding }) collection.insert_many(records) print(f”Ingested {len(records)} chunks from {doc[‘filename’]}”)
Run this once to ingest your documents. MongoDB stores the full document data and the embedding vector together, so you get the chunk text back alongside each search result without an additional query.
Building the Retrieval and Generation Pipeline
This is where the application comes together. The retrieval pipeline takes a user query, converts it to an embedding, finds the most relevant chunks in MongoDB, and passes them to the LLM as context. According to OpenAI’s developer documentation, keeping retrieved context under 4,000 tokens per call is a practical starting point for balancing accuracy and cost.
Semantic Search with MongoDB Vector Search
MongoDB’s $vectorSearch aggregation stage handles approximate nearest-neighbour lookup. Here is the retrieval function:
def retrieve_relevant_chunks(query: str, top_k: int = 5) -> list[dict]: query_embedding = embed_text(query) pipeline = [ { “$vectorSearch”: { “index”: “vector_index”, “path”: “embedding”, “queryVector”: query_embedding, “numCandidates”: top_k * 10, “limit”: top_k } }, { “$project”: { “_id”: 0, “text”: 1, “source”: 1, “chunk_index”: 1, “score”: { “$meta”: “vectorSearchScore” } } } ] results = list(collection.aggregate(pipeline)) return results
The numCandidates parameter controls the accuracy-speed trade-off. A value of top_k * 10 is a solid default — it examines 10x more candidates than you need, ensuring high recall without being wasteful.
Constructing the Prompt
Good prompt construction is critical for RAG quality. You want to give the LLM exactly the retrieved context it needs, tell it clearly how to use it, and prevent it from going off-script:
def build_prompt(query: str, chunks: list[dict]) -> str: context = “\n\n—\n\n”.join( f”Source: {c[‘source’]} (chunk {c[‘chunk_index’]}\n{c[‘text’]}” for c in chunks ) return f”””You are a helpful assistant. Answer the question below using ONLY \ the provided context. If the answer is not in the context, say so honestly.Context: {context}Question: {query} Answer:”””
Generating the Answer
Pass the constructed prompt to the LLM and return the response:
def answer_question(query: str) -> dict: chunks = retrieve_relevant_chunks(query) if not chunks: return {“answer”: “No relevant documents found.”, “sources”: []} prompt = build_prompt(query, chunks) response = openai_client.chat.completions.create( model=”gpt-4o-mini”, messages=[{“role”: “user”, “content”: prompt}], temperature=0.2 ) return { “answer”: response.choices[0].message.content, “sources”: list({c[“source”] for c in chunks}) }# Example usage result = answer_question(“What is the refund policy?”) print(result[“answer”]) print(“Sources:”, result[“sources”])
Testing, Evaluating, and Improving Your RAG App
Building a working RAG pipeline is the starting point. Knowing whether it is actually performing well requires structured evaluation. A 2024 report from Databricks found that teams who implement RAG evaluation frameworks ship higher-quality AI features 40% faster than those who rely on manual testing alone.
Manual Spot Testing
The fastest way to verify your pipeline is to test with questions you already know the answers to. Pick 10 to 20 representative questions from your document set and check:
Is the answer factually correct?
Are the right source chunks being retrieved?
Does the answer cite the right document?
What happens when you ask something the documents do not cover?
Metrics to Track
For systematic evaluation, track these key metrics:
- Retrieval Precision — Of the top-K chunks returned, what fraction are actually relevant?
- Answer Faithfulness — Does the generated answer stay within the retrieved context, or does it hallucinate?
- Answer Relevance — Does the answer address what was actually asked?
- Latency — How long does a query take end to end, including embedding and retrieval?
Tools like RAGAS (Retrieval Augmented Generation Assessment) automate this evaluation using LLM-as-judge patterns and can produce numeric scores for faithfulness and relevance across a test set.
Common Fixes for Poor Performance
If your pipeline is underperforming, these are the most common culprits and how to address them:
- Chunks too large: The retrieved context dilutes the relevant information. Try reducing chunk size to 200 to 300 tokens.
- No overlap between chunks: Answers that straddle chunk boundaries are missed. Add 10 to 20% overlap.
- Weak query: Users often phrase questions poorly for retrieval. Consider query rewriting with an LLM call before embedding.
- Top-K too low: Increasing from 3 to 5 or 7 chunks often significantly improves recall without breaking the prompt budget.
If you are using AI coding tools to iterate faster during development, the guide on AI coding assistants for Python developers covers the best tools available and how to use them effectively.
Taking Your RAG App Toward Production
A working prototype and a production-ready application are different things. Here is what changes when real users and real data volumes enter the picture.
Scaling Ingestion
Embedding API calls are the main bottleneck during ingestion. For large document sets, parallelize with asyncio or a queue-based worker system. Also consider:
- Batch embedding: OpenAI’s embedding API accepts multiple inputs per call. Batch up to 100 chunks per request to reduce round-trip overhead.
- Incremental ingestion: Track which documents have been embedded (by file hash or last-modified timestamp) and only re-embed on change.
- Cost management: text-embedding-3-small costs $0.02 per million tokens as of 2024. For large corpora, estimate costs before full ingestion runs.
Security and Access Control
Do not expose raw MongoDB Atlas credentials in client-side code or shared environments. Use:
Environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault)
MongoDB Atlas API keys scoped to minimum required permissions
Application-level access control to ensure users only retrieve documents they are authorized to see
Monitoring and Observability
Once deployed, instrument your RAG pipeline to track:
Query latency (embedding + retrieval + generation, separately)
Failed retrievals (queries that return zero or low-scoring chunks)
LLM token usage and cost per query
User feedback signals — thumbs up/down ratings are a simple but powerful signal
MongoDB Atlas also provides built-in monitoring for query performance and index usage, which helps you tune numCandidates and index parameters over time. Learn more about how MongoEngine simplifies working with MongoDB in Python on the MongoEngine homepage.
Frequently Asked Questions
Do I need a separate vector database if I use MongoDB Atlas?
No. MongoDB Atlas has built-in Vector Search that stores and queries embeddings natively within your existing collections. For most production RAG applications, this eliminates the need for a standalone vector database like Pinecone or Weaviate, reducing infrastructure complexity and operational overhead.
What embedding model should I use for my RAG app?
OpenAI’s text-embedding-3-small is a strong default: it is fast, inexpensive, and produces 1536-dimensional vectors with excellent semantic quality. For privacy-sensitive workloads where you need on-premise or self-hosted embeddings, alternatives like Sentence Transformers (all-MiniLM-L6-v2) offer solid quality without any API dependency.
How many chunks should I retrieve (top-K)?
Start with 5. Too few chunks miss relevant content; too many dilute the context and increase hallucination risk. The ideal value depends on your average chunk size and the complexity of questions. Evaluate empirically: test top-K values of 3, 5, and 7 on a representative question set and compare answer quality.
Can I use a different LLM instead of GPT-4?
Absolutely. The generation step is modular — you can swap in any LLM that accepts a text prompt. Good alternatives include Anthropic’s Claude, Google’s Gemini, Meta’s Llama 3 (self-hosted), or any model accessible via the Hugging Face Inference API. The retrieval and embedding steps remain the same regardless of which generator you use.
How do I handle documents that change frequently?
Implement an incremental ingestion pipeline that tracks document versions by hash or modification timestamp. When a document changes, delete its existing chunks from MongoDB by filtering on the source filename, re-chunk and re-embed the updated content, and insert the new records. This keeps your index fresh without a full re-ingestion on every update.
Conclusion
You now have a fully functional RAG app built with MongoDB Atlas and Python. Let’s recap the three most important things to take away from this guide:
- RAG architecture solves the knowledge gap — by retrieving relevant content at query time, you get accurate, grounded answers from private documents without retraining any model.
- Document preparation determines retrieval quality — invest time in your chunking strategy, chunk sizes, and overlap settings. These decisions have a bigger impact on output quality than model selection.
- MongoDB Atlas simplifies the stack — combining your document store and vector index in one system reduces infrastructure complexity, especially for teams already using MongoDB.
From here, experiment with query rewriting, add metadata filters to your vector search queries, or try a re-ranking step to improve precision further. The RAG pattern is highly composable — every component can be upgraded independently as your requirements grow.
Ready to keep building? Explore the MongoEngine documentation and vector database resources on mongoengine.org for more guides, tools, and patterns for building AI-powered Python applications.

Matt Ortiz is a software engineer and technical writer with 11 years of experience building data-intensive applications with Python and MongoDB. He spent six years at Rackspace engineering cloud-hosted database infrastructure, followed by three years at a New York-based fintech startup where he led backend architecture for a real-time transaction processing system built on MongoDB Atlas. Since joining the MongoEngine editorial team in 2025, Matt has expanded his focus to the broader AI developer stack — reviewing coding assistants, vector databases, LLM APIs, RAG frameworks, and image generation tools across hundreds of real-world test scenarios. His writing is read by engineers at companies ranging from early-stage startups to Fortune 500 technology teams. When a tool earns his recommendation, it’s because he’s used it in production.
Follow on Twitter: @mattortiz40
