Context Pool
Open source · Self-hosted · No vector DB

Document Q&A without
embeddings or guesswork

Context Pool exhaustively scans every chunk of every document, pools positive hits, and synthesizes a final answer with verbatim citations — all running on your own infrastructure.

Get started →View on GitHub
terminal
# 3 commands to get started
$git clone https://github.com/steve958/Context-Pool.git
$cp config.example.yaml config/config.yaml
$docker-compose -f docker-compose.hub.yml up
✓ backend ready http://localhost:8000
✓ frontend ready http://localhost:3000
4
LLM providers
8
File formats
0
Vector DBs needed
100%
Self-hosted
Architecture

How Context Pool works

Four deterministic phases. No semantic shortcuts. Every document, every chunk, every time.

STEP 01

Parse

Each file is converted to clean Markdown — PDF text layers, DOCX headings, HTML content, EML bodies and attachments, or OCR for scanned images.

PyMuPDF · python-docx · BeautifulSoup · OCR.space
STEP 02

Chunk

Markdown is split into token-bounded segments that respect heading boundaries and page markers. Chunk size is fully configurable.

Heading-aware · Token-windowed · Page-marker preserved
STEP 03

Scan

Every chunk is sent to the LLM with a strict extractive prompt. Positive hits are pooled; empty chunks are discarded. No skipping, no shortcuts.

{"has_answer": true, "evidence_quotes": ["..."]}
STEP 04

Synthesize

All pooled hits are sent to the LLM in a single synthesis call. The result is a final answer with full citations: document, page, heading, verbatim quote.

{"final_answer": "...", "citations": [...]}
🔍
Exhaustive by design
Unlike vector-search RAG, Context Pool never prefilters chunks. Every segment of every document is evaluated against your question. If the answer exists somewhere in your documents, Context Pool will find it — even when the vocabulary in the question differs from the document.
Capabilities

Everything you need

Batteries included. From OCR to citations to a production-ready Docker setup.

🔎
Exhaustive scanning
Every chunk of every document is evaluated. No prefiltering, no semantic shortcuts, no missed passages.
📌
Verbatim citations
Every claim is backed by an exact quote from the source, with document name, page number, and heading path.
🏠
Fully self-hosted
Run on your own machine or server. Documents stay in your Docker volume. Your infrastructure, your data.
🔌
4 LLM providers
OpenAI, Anthropic, Google Gemini, and Ollama. Switch without changing code — just update config.yaml.
📄
8 file formats
PDF (text + scanned), DOCX, TXT, Markdown, HTML, EML (with attachments), PNG, and JPEG.
👁
OCR built in
Scanned PDFs and images are processed via OCR.space. Toggle per query — no permanent setup needed.
📧
Email-aware parsing
.eml files are parsed intelligently: body, attachments, or both — individually chunked and cited.
Real-time progress
WebSocket events stream chunk-by-chunk progress to the UI as the scan runs. No polling required.
🧩
REST + WebSocket API
Every feature is available programmatically. The UI is just a client. Build your own integration.
🗂
Workspaces
Organise documents into named workspaces. Query a single document or the entire workspace at once.
🎛
Configurable chunking
Control chunk size, overlap strategy, and token limits. Tune the accuracy vs. cost trade-off for your use case.
🔐
Production security
API key auth middleware, CORS env config, non-root Docker user, file MIME validation, and input bounds checking.
LLM Providers

Your model, your choice

Switch providers by changing one line in config.yaml. No code changes needed.

OpenAIRecommended
gpt-4ogpt-4o-minigpt-4-turbo
provider: openai api_key: "ENV:OPENAI_API_KEY" model: "gpt-4o-mini" context_window_tokens: 128000 max_chunk_tokens: 24000
💡 gpt-4o-mini is the best cost/quality starting point.
AnthropicBest reasoning
claude-3-5-sonnetclaude-3-5-haikuclaude-3-opus
provider: anthropic api_key: "ENV:ANTHROPIC_API_KEY" model: "claude-3-5-haiku-20241022" context_window_tokens: 200000 max_chunk_tokens: 32000
💡 200K context window means fewer, larger chunks.
Google GeminiLargest context
gemini-2.0-flashgemini-1.5-progemini-1.5-flash
provider: google api_key: "ENV:GOOGLE_API_KEY" model: "gemini-2.0-flash" context_window_tokens: 1000000 max_chunk_tokens: 48000
💡 1M context window. Very large chunk sizes possible.
Ollama100% offline
llama3.2mistralphi3deepseek-r1
provider: ollama api_key: "" model: "llama3.2" context_window_tokens: 8192 max_chunk_tokens: 3000 ollama_base_url: "http://host.docker.internal:11434"
💡 Nothing leaves your machine. Requires Ollama running locally.
Installation

Up and running in minutes

Docker Compose is the fastest path. Local dev and API-only modes also supported.

1Clone the repo
git clone https://github.com/steve958/Context-Pool.git
cd Context-Pool
2Create config
mkdir -p config
cp config.example.yaml config/config.yaml
# Edit config/config.yaml — set provider + model
3Set your API key
# Create .env at the project root
echo "OPENAI_API_KEY=sk-proj-..." > .env

# Optional: enable API authentication
echo "API_KEY=your-secret-here" >> .env
4Start (pulls pre-built images — no build needed)
docker-compose -f docker-compose.hub.yml up

# UI  → http://localhost:3000
# API → http://localhost:8000/docs
REST API

First-class programmatic access

Every feature available in the UI is accessible via REST API and WebSocket. Build your own integrations.

WS /ws/query/{run_id}
Real-time events: chunk_progress · synthesis_started · synthesis_finished · error
Request
{
  "name": "Q3 Contracts"
}
Response
{
  "ws_id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Q3 Contracts",
  "document_count": 0
}
Use cases

Built for high-stakes document work

Wherever missing a relevant passage is not an option, exhaustive scanning pays off.

⚖️Legal

Contract review

QUESTION
"What does each contract say about termination clauses and notice periods?"
RESULT
Found 7 relevant clauses across 12 contracts. Page and heading citations included.
🔬Research

Literature review

QUESTION
"Which papers discuss transformer attention mechanisms in the context of long documents?"
RESULT
Extracted relevant passages from 34 PDFs, cited by author, section, and page.
📊Finance

Due diligence

QUESTION
"Are there any contingent liabilities or pending litigation mentioned in the disclosure documents?"
RESULT
3 disclosures flagged. Verbatim evidence quotes with page references.
📧Discovery

Email archive search

QUESTION
"Find all emails that discuss the merger timeline and list the mentioned dates."
RESULT
Scanned 240 .eml files including attachments. 18 positive hits extracted.
🏥Healthcare

Clinical document review

QUESTION
"What contraindications are mentioned for Drug X across all patient records?"
RESULT
Scanned scanned PDFs via OCR. 9 contraindications found across 15 records.
🛠Engineering

Technical spec analysis

QUESTION
"What are the stated load-bearing limits in each structural report?"
RESULT
Extracted 22 numeric values with units, pages, and table headings cited.
FAQ

Common questions