Unified Email Semantic Search: Architecture & Implementation Guide
Target Hardware: Intel i5 4th Gen (Haswell) with AVX2 support.
Target Deployment: Dockerized, behind Cloudflare Tunnel, CI/CD via Gitea Actions.
Memory Budget: Nominally restricted to a 4GB Docker ceiling, with flexibility to allocate more by shrinking adjacent background services.

This document outlines the end-to-end engineering plan for building a self-hosted, natural-language-searchable email aggregator. It is divided into two phases: V1 (Retrieval-Only via external embeddings) and V2 (Local Synthesis via dynamically routed LLM inference).

Phase 1: Retrieval-Only Architecture (V1)
Phase 1 establishes the ingestion pipeline, vector storage, and hybrid search interface. It relies on an external API (e.g., OpenAI text-embedding-3-small or Voyage) for generating embeddings to keep local compute and memory footprints minimal.

1. System Components

Ingestion Engine: mbsync (isync).

Role: Pulls IMAP mail from iCloud, Gmail (via OAuth2 helper scripts), and Exchange accounts into a local Maildir.

Execution: Runs via cron within the container for standard accounts, and uses IMAP IDLE for primary accounts to achieve near real-time sync.

Database: PostgreSQL with pgvector.

Role: Centralized storage for email metadata, keyword indexes (via tsvector / BM25), and semantic vectors.

Tuning: Restrict shared_buffers to 256MB and limit max_connections to prevent the Postgres daemon from dominating the RAM budget.

Indexer & Query API: Python (FastAPI).

Role: Acts as the orchestration layer. A lightweight Python script utilizes watchdog to monitor the Maildir for new files, parses MIME content (stripping HTML), calls the embedding API, and writes to Postgres. It also exposes the search endpoint for the frontend.

Frontend & Webmail: SnappyMail + Custom Search SPA.

Role: SnappyMail (PHP-FPM) provides the standard interface for reading and replying. A minimal React or Svelte SPA serves as the unified search bar, querying the FastAPI backend.

Networking: cloudflared (already active on server-side, can point to a port whenever necessary).

Role: securely exposes the frontend UI via the existing Cloudflare tunnel, bypassing the need for local reverse proxies (like Caddy/Nginx) or port forwarding.

2. Docker Compose Blueprint

Organize the docker-compose.yml into these distinct services:

db: The pgvector/pgvector:pg16 image. Mount a persistent volume for /var/lib/postgresql/data.

mail-sync: A lightweight Alpine image containing mbsync, cron, and Python (for OAuth2 token refresh scripts like mutt_oauth2.py). Mount the local Maildir volume.

api: The FastAPI Python application. Mounts the Maildir volume (read-only) to watch for changes and process new emails.

webmail: The SnappyMail PHP-FPM image paired with a minimal web server to serve the Search SPA.

3. CI/CD Pipeline (Gitea Actions)

Create a .gitea/workflows/deploy.yml file to handle automated rollouts:

Trigger: Execute on push to the main branch.

Build Phase: Use Docker Buildx to build the api and custom webmail/SPA images.

Publish Phase: Push the built images to your local Gitea container registry.

Deploy Phase: Use a local Gitea runner (or SSH action) on the host machine to execute:

docker compose pull

docker compose up -d --build

docker image prune -f (Crucial for keeping the 200GB disk clean over time).

Phase 2: Local LLM Synthesis (V2)
Phase 2 introduces a localized, quantized LLM to synthesize answers directly from the retrieved emails. Because of the Haswell CPU and tight memory constraints, this phase relies on dynamic semantic routing and highly optimized bare-metal inference.

1. The Inference Engine (llama.cpp)

To achieve usable token generation speeds (target: <10 seconds for extraction) on an i5 Haswell processor, standard Python inference wrappers will introduce too much overhead.

Implementation: Build llama.cpp from source inside your API container. You must explicitly compile it with the -mavx2 flag to leverage the Haswell architecture's Advanced Vector Extensions.

Binding: Use standard C++ bindings or a highly efficient wrapper like llama-cpp-python to interface with the FastAPI orchestration layer.

Model Selection: Llama-3.2-3B-Instruct in .gguf format, quantized to Q4_K_M. This will consume roughly 2.0GB of RAM for the weights.

2. Dynamic Semantic Routing

To prevent the LLM from executing deep reasoning on simple extraction tasks (which would blow past the 10-second latency target and exhaust the KV cache), implement a zero-cost heuristic router in the Python API layer.

The Routing Logic:

Query Ingestion: The user submits a search. The API generates the vector embedding.

Intent Classification: Compute the dot product between the query vector and pre-defined "intent vectors" (e.g., Extraction Intent vs. Synthesis Intent).

Execution Paths:

Fast Path (Extraction): Triggered by factual queries (e.g., "What is the flight number?").

Context: Truncate Postgres retrieval to the Top 1 most relevant email.

LLM Parameters: Limit output generation to max_tokens: 30.

Expected Latency: 3–5 seconds.

Slow Path (Synthesis): Triggered by complex queries (e.g., "Summarize the project delays").

Context: Expand retrieval to the Top 3 most relevant emails.

LLM Parameters: Set output generation to max_tokens: 200.

Expected Latency: 15–25 seconds.

3. Memory & Cache Management

When routing to the Slow Path, the LLM's Key-Value (KV) cache will expand significantly as it ingests 3 full emails.

Pre-Processing: Before feeding retrieved emails into the prompt context, aggressively strip all non-essential headers, massive email signatures, and HTML artifacts in the Python layer.

Context Limit: Set a hard n_ctx limit (e.g., 2048 or 4096 tokens) when initializing llama.cpp. If the retrieved emails exceed this length, truncate the oldest or lowest-ranked email rather than letting the KV cache spill over into system swap memory, which would fatally stall the container.

By keeping the routing logic in Python and the heavy compute explicitly bound to a C++ compiled engine, the system will comfortably manage the balance between speed and advanced reasoning within the available resources.