Simple RAG Application Using CrewAI and a Custom LLM (Ollama)
An engineering team wants internal AI assistance but cannot send proprietary documents to external APIs. Compliance policies restrict outbound data flow. Latency budgets are tight. The result is predictable: experimentation with retrieval-augmented generation stalls because the default path—cloud-hosted LLMs—conflicts with operational constraints.
This is where a simple RAG application using CrewAI and a custom LLM with Ollama becomes practical rather than experimental. By running an open-weight model locally through Ollama and pairing it with a vector store, teams can build document-grounded systems without exposing internal data to third-party inference endpoints.
Self-hosted LLM deployments have matured significantly over the past year. Tooling around quantized models, GPU acceleration, and local model management has lowered the barrier to entry. At the same time, orchestration frameworks like CrewAI provide structured multi-agent coordination for workflows that go beyond single model calls.
Product capabilities, pricing, and availability are subject to change. Evaluating tools through free tiers, trials, or sandbox environments is advisable before making commitments.
This article explains how a simple RAG architecture works with Ollama as a custom LLM backend, where CrewAI fits in the stack, and provides a complete working source code example suitable for local experimentation.
How a Simple RAG Application Using CrewAI and Ollama Works
At a mechanical level, RAG consists of two distinct processes:
- Retrieval — Find relevant document fragments using vector similarity search.
- Generation — Use an LLM to produce a response grounded in those fragments.
Replacing a cloud API with Ollama does not change the retrieval pipeline. It changes the generation layer.
The Retrieval Layer: Embeddings and Vector Search
The retrieval pipeline typically includes:
- Document ingestion
- Chunking into manageable segments
- Embedding generation
- Vector database storage
- Similarity search at query time
When a user asks a question, the system embeds the query, retrieves the most semantically similar chunks, and injects them into the prompt.
Even when using a locally hosted model for generation, embeddings can be generated either locally or through a compatible embedding model. Ollama supports local embedding models, which keeps the entire RAG stack self-contained.
The Generation Layer: Ollama as a Custom LLM
Ollama runs open-weight models locally (such as Llama variants, Mistral, or other community-supported models). It exposes an HTTP API on localhost, which frameworks can integrate with.
In this architecture:
- Ollama handles inference.
- CrewAI orchestrates agent tasks.
- A vector store (such as FAISS) handles similarity retrieval.
The result is a self-hosted RAG system capable of running entirely on a developer workstation or on-prem GPU server.
Architecture Overview
A minimal local RAG stack looks like this:
- Document Loader → splits content
- Embedding Model (via Ollama or compatible provider) → generates vectors
- FAISS Vector Store → stores embeddings
- CrewAI Agent → synthesizes final response
- Ollama Model → generates output
CrewAI does not manage embeddings or storage. It defines agents and tasks, clarifying responsibility boundaries in the workflow.
Complete Working Example: Simple RAG Application Using CrewAI and Custom LLM (Ollama)
This example assumes:
- Ollama installed locally
- A model pulled (e.g.,
llama3ormistral) - Python 3.10+
Step 1: Install Ollama
Install Ollama and pull a model:
ollama pull llama3
Verify it runs:
ollama run llama3
Exit after confirming.
Step 2: Project Structure
rag-crewai-ollama/
│
├── main.py
├── requirements.txt
└── docs/
└── handbook.txt
Step 3: requirements.txt
crewai
langchain
langchain-community
faiss-cpu
ollama
Install:
pip install -r requirements.txt
Step 4: Sample Internal Document
docs/handbook.txt
Security Guidelines:
- All production secrets must be stored in a vault.
- Rotate API keys every 90 days.
- Enforce multi-factor authentication for admin accounts.
Deployment Policy:
- Run automated tests before merging to main.
- Ensure CI pipeline passes without warnings.
- Document rollback plan before release.
Step 5: Full Source Code (Local RAG with CrewAI + Ollama)
from crewai import Agent, Task, Crew
from langchain_community.llms import Ollama
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.embeddings import OllamaEmbeddings
# --------------------------------------------------
# 1. Load and Split Documents
# --------------------------------------------------
loader = TextLoader("docs/handbook.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=50
)
docs = text_splitter.split_documents(documents)
# --------------------------------------------------
# 2. Create Embeddings (Local via Ollama)
# --------------------------------------------------
embeddings = OllamaEmbeddings(model="llama3")
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# --------------------------------------------------
# 3. Initialize Ollama LLM
# --------------------------------------------------
llm = Ollama(
model="llama3",
temperature=0
)
# --------------------------------------------------
# 4. Define CrewAI Agent
# --------------------------------------------------
answer_agent = Agent(
role="Internal Knowledge Assistant",
goal="Answer questions strictly using retrieved internal documentation.",
backstory="You are a precise assistant that avoids speculation.",
llm=llm,
verbose=True
)
# --------------------------------------------------
# 5. Retrieval Function
# --------------------------------------------------
def retrieve_context(query: str):
results = retriever.get_relevant_documents(query)
return "\n\n".join([doc.page_content for doc in results])
# --------------------------------------------------
# 6. Execute Query
# --------------------------------------------------
user_query = input("Ask a question about internal documentation: ")
context = retrieve_context(user_query)
task = Task(
description=f"""
You are given the following internal documentation:
{context}
Using only this documentation, answer the question below.
If the answer is not present, respond with:
"The documentation does not contain this information."
Question: {user_query}
""",
agent=answer_agent
)
crew = Crew(
agents=[answer_agent],
tasks=[task],
verbose=True
)
result = crew.kickoff()
print("\nFinal Answer:\n")
print(result)
Example Interaction
Input:
What security measures are required for admin accounts?
Output:
Multi-factor authentication must be enforced for admin accounts.
The model responds strictly based on retrieved document fragments.
Trade-Offs of Using Ollama in RAG Systems
Advantages
- Full data control (no external API calls)
- Predictable cost (no per-token billing)
- Offline capability
- Custom model selection and tuning
Limitations
- Hardware constraints (GPU memory limits)
- Potentially lower reasoning quality compared to frontier cloud models
- Model updates require manual management
Local LLM inference is improving rapidly, but model selection matters. Smaller quantized models may struggle with long-context reasoning or nuanced queries.
Framework Comparison for Local RAG Systems
| Feature | CrewAI + Ollama | LangChain + Cloud LLM | AutoGen + Cloud LLM |
|---|---|---|---|
| Data Residency | Fully local | External API | External API |
| Cost Model | Hardware-based | Per token | Per token |
| Orchestration | Role-based agents | Chains & tools | Conversational agents |
| Setup Complexity | Moderate | Moderate | Moderate |
| Latency | Low (local inference) | Dependent on API | Dependent on API |
For organizations prioritizing data isolation, the local stack offers structural advantages. For teams prioritizing model quality at scale, managed APIs often provide stronger performance.
Common Pitfalls
- Choosing a model too small for complex retrieval synthesis.
- Ignoring chunk size tuning.
- Overcomplicating the agent structure before validating retrieval quality.
Local deployments reward iterative tuning.
FAQ: Simple RAG Application Using CrewAI and Ollama
Q: Can Ollama fully replace cloud LLMs in RAG systems?
A: For many internal documentation use cases, yes. For highly complex reasoning tasks, frontier models may still outperform smaller local models.
Q: Does CrewAI require internet access?
A: No. If embeddings and LLM inference run locally through Ollama, the entire pipeline can operate offline.
Q: What hardware is required?
A: CPU inference works for small models, but GPUs significantly improve performance. Requirements depend on model size.
Q: Can this architecture scale to enterprise environments?
A: Yes, with proper GPU provisioning, monitoring, and model lifecycle management.
Q: Is this approach suitable for sensitive data?
A: It can be, since data remains local. However, security posture depends on infrastructure configuration and access controls.
Closing Thoughts
A simple RAG application using CrewAI and a custom LLM via Ollama demonstrates that document-grounded AI does not require external APIs. The architecture remains conceptually identical to cloud-based RAG: retrieve context, inject into prompt, generate grounded response. The difference lies in deployment control and infrastructure trade-offs.
For engineering teams balancing privacy, cost, and flexibility, local orchestration combined with structured agent workflows presents a viable alternative to purely managed AI stacks.
Editorial Note: This article is based on publicly available industry research, official documentation, and general informational sources. Content is reviewed and updated periodically to reflect changes in products, specifications, pricing models, and market practices.
I am a writer, blogger and maker! I am passionate about technology and new trends in the market.