Agent Memory Part 2: Hybrid Search, Graph Recall, and Memory Consolidation
We built the 'What's Next' — hybrid BM25+vector search, graph-aware recall, auto-ingest, and extractive memory consolidation. 107 tests, zero frameworks. This is Part 2. If you haven't read it yet, start with Part 1: Zero-Cost AI Agent Memory → In Part 1, I gave my AI agent, Kit, semantic memory — local GPU embeddings, SurrealDB, and a thin OpenAI-compatible server. The post ended with a "What's Next" section listing four features I hadn't built yet. This post is the receipt. All four are done. 107 tests passing. No frameworks. No LLM API calls for any of it. Here's what the "What's Next" became: After Part 1, Kit could search memories by semantic similarity. Ask "WhatsApp fix" and it'd find the right daily note. But pure vector search has blind spots: It misses keywords. Ask for "RTX 5070 Ti" and vector search returns vaguely related hardware discussions instead of the exact entry mentioning that GPU. BM25 keyword search nails this. It has no concept of relationships. Vector search returns isolated documents. It can't tell you "this decision led to that consequence" or "these three memories all reference the same project." That requires a graph. It drowns in old data. Three weeks of daily notes and the important decisions are buried under routine logs. You need consolidation — compacting old entries into distilled summaries without losing the signal. It doesn't stay current. Every time Kit writes a new memory file, someone had to manually re-embed it. Real memory needs to auto-ingest. SurrealDB already had all the primitives — BM25 indexes, vector indexes, graph relations. I just needed to wire them together. The core insight: no single retrieval method is best for everything. Keywords catch exact terms. Vectors catch meaning. Graph relations catch context. The trick is combining them. RRF is embarrassingly simple and surprisingly effective. Given ranked lists from multiple search methods, score each result by: Where Vector search — embed the query with Qwen3, cosine similarity against stored embeddings: BM25 search — full-text keyword search with English stemming: Graph boost — find memories that reference the same concepts as top vector results: Ask Kit "what happened with the RTX 5070 Ti VRAM issue?" The graph boost is the secret weapon. When the vector results reference concepts like "GPU" and "embeddings", the graph boost pulls in other memories referencing those same concepts — even if they wouldn't rank high by text similarity alone. Hybrid search answers "what's relevant to this query?" Graph recall answers a deeper question: "what's connected to this, and how?" Every memory in SurrealDB can reference concept nodes via Graph recall starts from seed memories (found via vector similarity) and walks the graph outward: Ask "why did we choose Qwen3 over other embedding models?" and graph recall doesn't just find the model selection memory — it follows the graph to find: You get a narrative thread, not isolated snippets. Memory that requires manual re-indexing isn't memory — it's a filing cabinet. The ingest pipeline watches markdown files and keeps SurrealDB in sync. Daily notes are structured with headers. Each Each chunk gets a deterministic ID based on its source file and section title: Before embedding a chunk, we check if the content hash has changed. If the text is identical to what's already stored, skip it. This makes re-ingestion cheap — only modified sections get re-embedded. The ingester maintains a dictionary of known concepts (project names, tools, people). When a chunk mentions a known concept, it automatically creates a The embedding server can handle batches, but we learned the hard way (500 errors) that 42 chunks in one request is too many. Batch size of 8 works reliably: In OpenClaw, this runs automatically on The hardest problem in agent memory isn't remembering — it's forgetting. Three weeks of daily notes and you've got 200+ chunks, most of them routine logs. The signal-to-noise ratio degrades. Consolidation compresses old daily notes into distilled long-term entries. The key constraint: no LLM calls. This runs on a heartbeat (every few hours). Making an API call every time would be expensive and slow. Instead of asking an LLM to "summarize these notes," we use extractive methods — selecting the most information-dense sentences from each section: When multiple daily entries reference the same concept, consolidation merges them: Three reasons: The tradeoff is that extractive summaries aren't as fluent as LLM-generated ones. They're sentence fragments stitched together. But for agent recall — where the consumer is another LLM, not a human — fluency doesn't matter. Information density does. No daemons. No cron jobs (well, one). No polling loops. The memory system is event-driven, triggered by OpenClaw lifecycle events: This means: In The agent decides which search modality to use based on the query. Simple factual lookups use After building all of this, I needed to know if it actually worked end-to-end. Not just unit tests — the full loop: The test: I wrote "the user's code phrase is blue bunny" into today's daily note, started a new session, and asked Kit if it remembered. It did. From SurrealDB, via hybrid search, with a relevance score. That's the difference between "here's a prototype" and "here's a memory system." SurrealDB recently published a conversation with Agno's CEO about building agent memory systems. The key points they make are correct: But the post is a Q&A about what's possible. It describes Agno as the framework that provides the "harness" and SurrealDB as the "memory layer," and recommends developers use both together. What we built here is the memory layer without the framework. No Agno. No Agent OS. Just Python, SurrealQL, and a local GPU. The total dependency footprint: That's it. Four pip packages and a database. Frameworks are useful when you need batteries included. But when you're building a memory system for a specific agent (Kit, running in OpenClaw, with markdown-based memory files), you don't need a generic Agent OS. You need 1,266 lines of Python that do exactly what your agent needs. Batch size matters. Our first ingest attempt sent all 42 chunks to the embedding server in one request. The server returned a 500 error — Qwen3-Embedding-4B on 12GB VRAM can't handle 42 texts at once. Batch size of 8 works. Test your limits. Content hashing is essential. Without it, every session start re-embeds everything. With it, only changed sections get processed. The difference: 2.1s vs 0.3s on re-ingest. Graph edges need to be auto-created. Manual concept linking doesn't scale. The auto-linker catches ~80% of relevant connections. The remaining 20% are edge cases the agent can handle via explicit tagging. Extractive consolidation is "good enough." I expected to need an LLM for summarization. Turns out, selecting the top 3 information-dense sentences from each section and merging by concept produces surprisingly usable consolidated entries. The consumer is an LLM — it doesn't need perfect prose. Event-driven beats always-on. My first instinct was to build a file watcher daemon. But the agent only needs fresh data at session start, and consolidation only matters every few hours. Event-driven is simpler, cheaper, and just as effective. All code from both Part 1 and Part 2: Built by a human and his AI familiar, Kit. The blue bunny remembers. +++Promised Built Lines of Code Hybrid search BM25 + vector + graph fusion with RRF 238 Graph-aware recall Concept traversal + temporal chains 311 Auto-ingest Markdown chunker + content-hashed upserts 309 Memory consolidation Extractive summarization + concept dedup 408 The Motivation
Hybrid Search: Three Signals, One Ranking
Reciprocal Rank Fusion (RRF)
RRF_score = Σ 1 / (k + rank_i)
k is a constant (we use 60, the standard from the original paper) and rank_i is the result's position in each ranked list. Results that appear high in multiple lists get the highest combined score.The Three Search Modalities
=
= f
return
=
= f
return
=
= f
return
Fusing the Results
= or
=
=
=
=
# RRF fusion
=
=
= 60
=
= + /
=
=
return
Why This Beats Single-Modality Search
Graph-Aware Recall: Follow the Threads
The Concept Graph
references edges. Concepts connect to each other via related_to edges. Memories chain temporally via follows edges.memory:whatsapp_fix --references--> concept:whatsapp
memory:whatsapp_fix --references--> concept:openclaw
concept:whatsapp --related_to--> concept:messaging
concept:openclaw --related_to--> concept:gateway
memory:gateway_config --follows--> memory:whatsapp_fix
Multi-Hop Traversal
# Phase 1: vector search for seeds
=
=
return
# Phase 2: expand via concept graph
=
=
=
break
=
# Follow memory -> concept -> related concept -> memory
= f
=
=
=
=
# Phase 3: also follow temporal chains
= f
=
# Phase 4: fetch and rank expanded set
# ... score by distance from seeds, return top_k
What This Gets You
Auto-Ingest: Stay Current Automatically
Markdown Chunking
## Section becomes a chunk:
=
= # default title = filename
=
=
=
return
Content-Hashed Upserts
= f
return
Concept Auto-Linking
references edge: =
=
# Ensure concept exists
# Create edge
Batch Embedding
=
=
=
return
Running It
# One-shot ingest (on session start)
# Output:
# Scanning 8 files...
# 42 chunks found, 38 concepts linked
# 12 chunks updated (30 unchanged)
# Embedded 12 chunks in 0.9s
/new (session start). One line in the agent's instructions triggers it: && \
Memory Consolidation: Forget Gracefully
Extractive Summarization
"""Score a sentence by information density."""
= 0.0
# Longer sentences tend to carry more info (up to a point)
=
+= * 0.3
# Sentences with specific markers are more likely important
+= 0.4
# Sentences with technical terms or proper nouns
=
+= * 0.2
# Bullet points starting with action items
+= 0.3
return
Concept Deduplication
= -
# Find daily memories older than cutoff
=
# Group by shared concepts
=
# For each group: extract top sentences, merge tags, create consolidated entry
=
=
=
# Take the best sentences across all related memories
=
# Create consolidated memory
# Re-embed the consolidated entry
# Archive (don't delete) the originals
Why Not Use an LLM?
Wiring It Together: Event-Driven Architecture
Event Trigger Action Session start ( /new)OpenClaw gateway Run ingest.py --onceMemory recall Agent calls memory_searchHybrid search over SurrealDB Graph recall Agent calls graph_recall in AGENTS.md instructions Multi-hop concept traversal Consolidation OpenClaw heartbeat (every few hours) Run consolidation.py on old entriesThe Agent's Instructions
AGENTS.md, the agent (Kit) has instructions for when to use each component:memory_search (flat file). Complex contextual questions use hybrid search or graph recall.The Smoke Test
The Numbers (Updated)
Component Metric Value Ingest 8 files, 42 chunks 2.1s total Ingest Re-ingest (no changes) 0.3s (hash skip) Hybrid search Query latency <500ms Graph recall (2 hops) Query latency <800ms Consolidation 50 entries <1s Test suite 4 modules 107 tests, all passing Total code 4 Python modules 1,266 lines LLM API calls For search/ingest/consolidation 0 What SurrealDB's Blog Gets Right (and What's Missing)
sentence-transformers (for embeddings)requests (for HTTP to SurrealDB)fastapi + uvicorn (for the embedding server)Lessons Learned
What's Actually Next
Full Source
schema.surql — SurrealDB table definitions and indexesembedding-server.py — OpenAI-compatible FastAPI server (Qwen3-Embedding-4B)ingest.py — Markdown chunker + auto-ingest pipelinehybrid_search.py — BM25 + vector + graph fusion with RRFgraph_recall.py — Multi-hop concept and temporal traversalconsolidation.py — Extractive memory compactiontests/ — 107 tests across 4 modules