Simple RAG Application Using CrewAI and a Custom LLM (Ollama)

An engineering team wants internal AI assistance but cannot send proprietary documents to external APIs. Compliance policies restrict outbound data flow. Latency budgets are tight. The result is predictable: experimentation with retrieval-augmented generation stalls because the default path—cloud-hosted LLMs—conflicts with operational constraints.

This is where a simple RAG application using CrewAI and a custom LLM with Ollama becomes practical rather than experimental. By running an open-weight model locally through Ollama and pairing it with a vector store, teams can build document-grounded systems without exposing internal data to third-party inference endpoints.

Self-hosted LLM deployments have matured significantly over the past year. Tooling around quantized models, GPU acceleration, and local model management has lowered the barrier to entry. At the same time, orchestration frameworks like CrewAI provide structured multi-agent coordination for workflows that go beyond single model calls.

Product capabilities, pricing, and availability are subject to change. Evaluating tools through free tiers, trials, or sandbox environments is advisable before making commitments.

This article explains how a simple RAG architecture works with Ollama as a custom LLM backend, where CrewAI fits in the stack, and provides a complete working source code example suitable for local experimentation.

How a Simple RAG Application Using CrewAI and Ollama Works

At a mechanical level, RAG consists of two distinct processes:

Retrieval — Find relevant document fragments using vector similarity search.
Generation — Use an LLM to produce a response grounded in those fragments.

Replacing a cloud API with Ollama does not change the retrieval pipeline. It changes the generation layer.

The Retrieval Layer: Embeddings and Vector Search

The retrieval pipeline typically includes:

Document ingestion
Chunking into manageable segments
Embedding generation
Vector database storage
Similarity search at query time

When a user asks a question, the system embeds the query, retrieves the most semantically similar chunks, and injects them into the prompt.

Even when using a locally hosted model for generation, embeddings can be generated either locally or through a compatible embedding model. Ollama supports local embedding models, which keeps the entire RAG stack self-contained.

The Generation Layer: Ollama as a Custom LLM

Ollama runs open-weight models locally (such as Llama variants, Mistral, or other community-supported models). It exposes an HTTP API on localhost, which frameworks can integrate with.

In this architecture:

Ollama handles inference.
CrewAI orchestrates agent tasks.
A vector store (such as FAISS) handles similarity retrieval.

The result is a self-hosted RAG system capable of running entirely on a developer workstation or on-prem GPU server.

Architecture Overview

A minimal local RAG stack looks like this:

Document Loader → splits content
Embedding Model (via Ollama or compatible provider) → generates vectors
FAISS Vector Store → stores embeddings
CrewAI Agent → synthesizes final response
Ollama Model → generates output

CrewAI does not manage embeddings or storage. It defines agents and tasks, clarifying responsibility boundaries in the workflow.

Complete Working Example: Simple RAG Application Using CrewAI and Custom LLM (Ollama)

This example assumes:

Ollama installed locally
A model pulled (e.g., llama3 or mistral)
Python 3.10+

Step 1: Install Ollama

Install Ollama and pull a model:

ollama pull llama3

Verify it runs:

ollama run llama3

Exit after confirming.

Step 2: Project Structure

rag-crewai-ollama/
│
├── main.py
├── requirements.txt
└── docs/
    └── handbook.txt

Step 3: requirements.txt

crewai
langchain
langchain-community
faiss-cpu
ollama

Install:

pip install -r requirements.txt

Step 4: Sample Internal Document

docs/handbook.txt

Security Guidelines:

- All production secrets must be stored in a vault.
- Rotate API keys every 90 days.
- Enforce multi-factor authentication for admin accounts.

Deployment Policy:

- Run automated tests before merging to main.
- Ensure CI pipeline passes without warnings.
- Document rollback plan before release.

Step 5: Full Source Code (Local RAG with CrewAI + Ollama)

from crewai import Agent, Task, Crew
from langchain_community.llms import Ollama
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.embeddings import OllamaEmbeddings

# --------------------------------------------------
# 1. Load and Split Documents
# --------------------------------------------------
loader = TextLoader("docs/handbook.txt")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50
)

docs = text_splitter.split_documents(documents)

# --------------------------------------------------
# 2. Create Embeddings (Local via Ollama)
# --------------------------------------------------
embeddings = OllamaEmbeddings(model="llama3")

vectorstore = FAISS.from_documents(docs, embeddings)

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# --------------------------------------------------
# 3. Initialize Ollama LLM
# --------------------------------------------------
llm = Ollama(
    model="llama3",
    temperature=0
)

# --------------------------------------------------
# 4. Define CrewAI Agent
# --------------------------------------------------
answer_agent = Agent(
    role="Internal Knowledge Assistant",
    goal="Answer questions strictly using retrieved internal documentation.",
    backstory="You are a precise assistant that avoids speculation.",
    llm=llm,
    verbose=True
)

# --------------------------------------------------
# 5. Retrieval Function
# --------------------------------------------------
def retrieve_context(query: str):
    results = retriever.get_relevant_documents(query)
    return "\n\n".join([doc.page_content for doc in results])

# --------------------------------------------------
# 6. Execute Query
# --------------------------------------------------
user_query = input("Ask a question about internal documentation: ")

context = retrieve_context(user_query)

task = Task(
    description=f"""
You are given the following internal documentation:

{context}

Using only this documentation, answer the question below.

If the answer is not present, respond with:
"The documentation does not contain this information."

Question: {user_query}
""",
    agent=answer_agent
)

crew = Crew(
    agents=[answer_agent],
    tasks=[task],
    verbose=True
)

result = crew.kickoff()

print("\nFinal Answer:\n")
print(result)

Example Interaction

Input:

What security measures are required for admin accounts?

Output:

Multi-factor authentication must be enforced for admin accounts.

The model responds strictly based on retrieved document fragments.

Trade-Offs of Using Ollama in RAG Systems

Advantages

Full data control (no external API calls)
Predictable cost (no per-token billing)
Offline capability
Custom model selection and tuning

Limitations

Hardware constraints (GPU memory limits)
Potentially lower reasoning quality compared to frontier cloud models
Model updates require manual management

Local LLM inference is improving rapidly, but model selection matters. Smaller quantized models may struggle with long-context reasoning or nuanced queries.

Framework Comparison for Local RAG Systems

Feature	CrewAI + Ollama	LangChain + Cloud LLM	AutoGen + Cloud LLM
Data Residency	Fully local	External API	External API
Cost Model	Hardware-based	Per token	Per token
Orchestration	Role-based agents	Chains & tools	Conversational agents
Setup Complexity	Moderate	Moderate	Moderate
Latency	Low (local inference)	Dependent on API	Dependent on API

For organizations prioritizing data isolation, the local stack offers structural advantages. For teams prioritizing model quality at scale, managed APIs often provide stronger performance.

Common Pitfalls

Choosing a model too small for complex retrieval synthesis.
Ignoring chunk size tuning.
Overcomplicating the agent structure before validating retrieval quality.

Local deployments reward iterative tuning.

FAQ: Simple RAG Application Using CrewAI and Ollama

Q: Can Ollama fully replace cloud LLMs in RAG systems?
A: For many internal documentation use cases, yes. For highly complex reasoning tasks, frontier models may still outperform smaller local models.

Q: Does CrewAI require internet access?
A: No. If embeddings and LLM inference run locally through Ollama, the entire pipeline can operate offline.

Q: What hardware is required?
A: CPU inference works for small models, but GPUs significantly improve performance. Requirements depend on model size.

Q: Can this architecture scale to enterprise environments?
A: Yes, with proper GPU provisioning, monitoring, and model lifecycle management.

Q: Is this approach suitable for sensitive data?
A: It can be, since data remains local. However, security posture depends on infrastructure configuration and access controls.

Closing Thoughts

A simple RAG application using CrewAI and a custom LLM via Ollama demonstrates that document-grounded AI does not require external APIs. The architecture remains conceptually identical to cloud-based RAG: retrieve context, inject into prompt, generate grounded response. The difference lies in deployment control and infrastructure trade-offs.

For engineering teams balancing privacy, cost, and flexibility, local orchestration combined with structured agent workflows presents a viable alternative to purely managed AI stacks.

Editorial Note: This article is based on publicly available industry research, official documentation, and general informational sources. Content is reviewed and updated periodically to reflect changes in products, specifications, pricing models, and market practices.

Woopsly

I am a writer, blogger and maker! I am passionate about technology and new trends in the market.

Simple RAG Application Using CrewAI and a Custom LLM (Ollama)