Zero-Cost AI Agent Memory: Local GPU Embeddings with SurrealDB and OpenClaw
How I gave my AI agent semantic memory using a local GPU, SurrealDB, and Qwen3 embeddings — zero API costs, zero data leaving my machine. This is Part 1 of a two-part series. Part 2: Hybrid Search, Graph Recall, and Memory Consolidation → My AI agent, Kit, wakes up with amnesia every session. It reads markdown files to reconstruct its memory, but that's brute force — dump everything into context and hope the important stuff doesn't get lost in the noise. I wanted semantic memory. Ask Kit "what happened with WhatsApp?" and get back the exact section about the gateway fix, not a wall of unrelated notes. Here's how I built it in an afternoon. Zero API costs. Nothing leaves my machine. Local GPU embeddings + SurrealDB + OpenClaw = semantic memory for your AI agent at zero cost. OpenClaw agents store memory in markdown files: No API key? Memory search is disabled. Your agent falls back to loading entire files into context, burning tokens on irrelevant content. The fix: run your own embedding model on your GPU and point OpenClaw at it. SurrealDB because it combines document storage, graph relations, and vector search in a single binary. No need for separate Postgres + Neo4j + Pinecone. One database, one query language, one process. Qwen3-Embedding-4B because it ranks #5 on the MTEB multilingual leaderboard and #3 on code retrieval — beating most paid API models. It's Apache 2.0 licensed, uses safetensors (no pickle security risk), and fits in 8GB of VRAM. FastAPI because OpenClaw expects an OpenAI-compatible SurrealDB is a single binary. No Docker, no cluster, no config files. Start it bound to localhost only (important — no network exposure): SurrealDB uses the Business Source License 1.1. Free for self-hosted/internal use. After 4 years, code converts to Apache 2.0. We bind to SurrealDB supports document, graph, AND vector operations in a single query language. Here's the schema: This gives you three search modalities in one database: Before installing any model, run a security check. For Qwen3-Embedding-4B: First load takes a few minutes (downloading weights). Subsequent loads take ~3.4 seconds from cache. This is the key piece — a thin FastAPI server that makes your local GPU model look like OpenAI's embedding API to OpenClaw. This is the satisfying part. One config change and OpenClaw's native Edit Restart the gateway: That's it. OpenClaw will: Ask your agent something that requires memory recall. In the embedding server logs you'll see: The agent's You could set Cost: OpenAI's Privacy: My agent has access to my personal notes, legal documents, project plans, and daily journals. I'm not sending that to an API endpoint. Quality: Qwen3-Embedding-4B ranks #5 on the MTEB multilingual leaderboard. OpenAI's Latency: Localhost is faster than any API. No network round trips, no rate limits, no cold starts (the model stays loaded via systemd). Control: I can swap models, change dimensions, add custom preprocessing — all without waiting for a provider to update their API. The embedding server and SurrealDB prototype give us the foundation. The roadmap: SurrealDB's multi-model architecture makes all of this possible in a single query. The graph layer is what separates this from a plain vector database — you can ask "what concepts are related to the decisions I made last week?" and get answers that require traversing relationships, not just computing cosine similarity. Update: All four of these are now built. Read Part 2: Hybrid Search, Graph Recall, and Memory Consolidation for the full implementation — hybrid BM25+vector search, graph traversal, auto-ingest, and extractive consolidation. 107 tests, zero frameworks. All code is available in the surrealdb-prototype directory: Built by a human and his AI familiar, Kit.TLDR
What We're Building
The Problem
MEMORY.md for long-term facts, memory/YYYY-MM-DD.md for daily notes. The built-in memory_search tool needs an embedding provider to work — typically OpenAI, Google, or Voyage API keys.Why These Tools
/v1/embeddings endpoint. A 60-line FastAPI wrapper makes our local model look like OpenAI to OpenClaw.Step 1: Install SurrealDB
# Install to ~/.surrealdb/
|
# Verify
# Create data directory
# Start with RocksDB backend
&
# Verify
&&
Security note
127.0.0.1 only — no network exposure. Run as your normal user, not root.Step 2: Design the Memory Schema
-- Connect
-- surreal sql --endpoint http://127.0.0.1:8000 \
-- --username root --password root \
-- --namespace kit --database memory
-- Memory entries (the core documents)
DEFINE TABLE memory SCHEMAFULL;
DEFINE FIELD kind ON memory TYPE string
ASSERT $value IN ["conversation","decision","context","daily_log","note","lesson"];
DEFINE FIELD title ON memory TYPE string;
DEFINE FIELD content ON memory TYPE string;
DEFINE FIELD source ON memory TYPE option<string>;
DEFINE FIELD tags ON memory TYPE array<string>;
DEFINE FIELD created_at ON memory TYPE datetime DEFAULT time::now;
DEFINE FIELD updated_at ON memory TYPE datetime DEFAULT time::now;
DEFINE FIELD embedding ON memory TYPE option<array<float>>;
-- Full-text search (BM25)
DEFINE ANALYZER memory_analyzer TOKENIZERS blank, class
FILTERS snowball(english);
DEFINE INDEX memory_content_search ON memory
FIELDS content SEARCH ANALYZER memory_analyzer BM25;
-- Vector index (1024 dimensions, cosine distance)
DEFINE INDEX memory_embedding_idx ON memory
FIELDS embedding MTREE DIMENSION 1024 DIST COSINE;
-- Concept nodes (for graph relations)
DEFINE TABLE concept SCHEMAFULL;
DEFINE FIELD name ON concept TYPE string;
DEFINE FIELD description ON concept TYPE option<string>;
DEFINE INDEX concept_name_idx ON concept FIELDS name UNIQUE;
-- Graph edges
DEFINE TABLE references SCHEMAFULL TYPE RELATION
FROM memory TO concept;
DEFINE TABLE related_to SCHEMAFULL TYPE RELATION
FROM concept TO concept;
DEFINE TABLE follows SCHEMAFULL TYPE RELATION
FROM memory TO memory;
SELECT * FROM memory WHERE content @@ "WhatsApp" — keyword search with BM25 relevance scoringSELECT *, vector::similarity::cosine(embedding, $vec) AS score FROM memory — semantic similaritySELECT ->references->concept.name FROM memory:some_entry — traverse relationships between memories and conceptsStep 3: Install the Embedding Model
Security review first
Python setup
# Create a venv
&&
# Install PyTorch with CUDA
# Install the rest
Download the model
# This downloads ~7.6GB of safetensors on first run
=
Step 4: Build the Embedding Server
#!/usr/bin/env python3
"""OpenAI-compatible embedding server wrapping Qwen3-Embedding-4B."""
=
= 1024 # MRL truncation from native 2560
=
= 8678
= None
global
=
=
= 8192
=
yield
del
=
: |
: =
=
=
=
= 0
+= # rough estimate
return
return
Run it
Test it
# {"status":"ok","model":"Qwen/Qwen3-Embedding-4B","dim":1024,"gpu":true}
# Returns 1024-dim embedding vector in OpenAI format
Make it persistent (systemd)
[Unit]
Kit Embedding Server (Qwen3-Embedding-4B)
network.target
[Service]
simple
%h/surrealdb-prototype
%h/surrealdb-prototype/.venv/bin/python3 %h/surrealdb-prototype/embedding-server.py
on-failure
5
CUDA_VISIBLE_DEVICES=0
[Install]
default.target
Step 5: Wire Into OpenClaw
memory_search tool starts using your GPU.~/.openclaw/openclaw.json:
MEMORY.md, memory/*.md) for changesmemory_searchVerify it works
INFO: 127.0.0.1:51364 - "POST /v1/embeddings HTTP/1.1" 200 OK
INFO: 127.0.0.1:51368 - "POST /v1/embeddings HTTP/1.1" 200 OK
memory_search should return ranked results with relevance scores:Query: "WhatsApp gateway fix"
#1 [0.68] WhatsApp Gateway Fix — memory/2026-02-12.md
#2 [0.37] Messaging / Channel Behavior — memory/2026-02-10.md
The Numbers
Metric Value Model load time (cached) 3.4s Embedding speed ~14 entries/sec VRAM usage 7.6GB (fp16) Embedding dimensions 1024 (MRL-truncated from 2560) MTEB rank (multilingual) #5 MTEB rank (code retrieval) #3 API cost $0 Data leaving your machine None Why Not Just Use an API?
OPENAI_API_KEY and call it a day. Here's why I didn't:text-embedding-3-small costs $0.02 per million tokens. Sounds cheap until your agent re-indexes on every file change, every session start, every compaction cycle. It adds up.text-embedding-3-small (the default for most tools) doesn't crack the top 20. Higher quality embeddings = better recall.What's Next
Full Source
schema.surql — SurrealDB table definitions and indexesseed.surql — Sample dataembed.py — Batch embed memories into SurrealDBsearch.py — Semantic search CLIembedding-server.py — OpenAI-compatible FastAPI serverrequirements.txt — Pinned Python dependencies